proceedings of the thirteenth workshop on graph-based ... · graph theory (gt) and natural language...

EMNLP-IJCNLP 2019

Graph-Based Methods forNatural Language Processing

Proceedings of the Thirteenth Workshop

November 4, 2019Hong Kong

c©2019 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-950737-86-4

ii

Introduction

Welcome to TextGraphs, the Workshop on Graph-Based Methods for Natural Language Processing. Thethirteenth edition of our workshop is being organized on November 4, 2019, in conjunction with the2019 Conference on Empirical Methods in Natural Language Processing and 9th International JointConference on Natural Language Processing, being held in Hong Kong.

The workshops in the TextGraphs series have published and promoted the synergy between the field ofGraph Theory (GT) and Natural Language Processing (NLP) for over a decade. The target audience ofour workshop comprises of researchers working on problems related to either Graph Theory or graph-based algorithms applied to Natural Language Processing, Social Media, and the Semantic Web.

TextGraphs addresses a broad spectrum of research areas within NLP. This is because, besides traditionalNLP applications like parsing, word sense disambiguation, semantic role labeling, and informationextraction, graph-based solutions also target web-scale applications like information propagation insocial networks, rumor proliferation, e-reputation, language dynamics learning, and future eventsprediction.

The selection process was competitive: we received 31 submissions (17 long and 14 short papers) andaccepted 18 of them (9 long and 9 short papers) for presentation, resulting in the overall acceptance rateof 58%.

This year, for the first time in the history of TextGraphs, we organized a shared task on Multi-HopInference for Explanation Regeneration. The goal of the task was to provide detailed gold explanationsfor standardized elementary science exam questions by selecting facts from a knowledge base. Theshared task received public entries from four participating teams, substantially advancing the state-of-the-art in this challenging problem. The participants’ reports along with the shared task overview by itsorganizers are also presented at the workshop.

We thank Minlie Huang for his invited talk on Controllable Language Generation.

Finally, we are thankful to the members of the program committee for their valuable and high qualityreviews. All submissions have benefited from their expert feedback. Their timely contribution was thebasis for accepting an excellent list of papers and making the thirteenth edition of TextGraphs a success.

Dmitry Ustalov, Swapna Somasundaran, Peter Jansen, Goran Glavaš,Martin Riedl, Mihai Surdeanu, and Michalis VazirgiannisTextGraphs-13 OrganizersNovember 2019

iii

Organizers:

Dmitry Ustalov, Yandex, RussiaSwapna Somasundaran, Educational Testing Service, Princeton, USAPeter Jansen, University of Arizona, USAGoran Glavaš, University of Mannheim, GermanyMartin Riedl, Heidelberg Druckmaschinen, GermanyMihai Surdeanu, University of Arizona, USAMichalis Vazirgiannis, École Polytechnique, France

Program Committee:

Željko Agic, IT University Copenhagen, DenmarkTomáš Brychcín, University of West Bohemia, Czech RepublicFlavio Massimiliano Cecchini, Università Cattolica del Sacro Cuore, ItalyTanmoy Chakraborty, Indian Institute of Technology Delhi, IndiaMihail Chernoskutov, Krasovskii Institute of Mathematics and Mechanics, RussiaStefano Faralli, University of Rome Unitelma Sapienza, ItalyMichael Flor, Educational Testing Service, USADebanjan Ghosh, Massachusetts Institute of Technology, USACarlos Gómez-Rodríguez, Universidade da Coruña, SpainTomáš Hercig, University of West Bohemia, Czech RepublicAnne Lauscher, University of Mannheim, GermanySuman Kalyan Maity, Northwestern University, USAFragkiskos Malliaros, CentraleSupelec & University of Paris-Saclay, FranceGabor Melli, Sony PlayStation, USAMohsen Mesgar, Ubiquitous Knowledge Processing (UKP) Lab, GermanyClayton Morrison, University of Arizona, USAAnimesh Mukherjee, Indian Institute of Technology Kharagpur, IndiaGiannis Nikolentzos, École Polytechnique, FranceEnrique Noriega-Atala, University of Arizona, USAAlexander Panchenko, Skolkovo Institute of Science and Technology, RussiaSimone Paolo Ponzetto, University of Mannheim, GermanyJan Wira Gotama Putra, Tokyo Institute of Technology, JapanNatalie Schluter, IT University of Copenhagen, DenmarkRebecca Sharp, University of Arizona, USAKonstantinos Skianis, École Polytechnique, FranceAntoine Tixier, École Polytechnique, FranceNicolas Turenne, Paris Est University & INRA, FranceKateryna Tymoshenko, University of Trento, ItalyIvan Vulic, University of Cambridge, UKVikas Yadav, University of Arizona, USARui Zhang, Yale University, USA

Invited Speakers:

Minlie Huang, Tsinghua University, China

v

Table of Contents

Transfer in Deep Reinforcement Learning Using Knowledge GraphsPrithviraj Ammanabrolu and Mark Riedl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Relation Prediction for Unseen-Entities Using Entity-Word GraphsYuki Tagawa, Motoki Taniguchi, Yasuhide Miura, Tomoki Taniguchi, Tomoko Ohkuma, Takayuki

Yamamoto and Keiichi Nemoto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Scalable graph-based method for individual named entity identificationSammy Khalife and Michalis Vazirgiannis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Neural Speech Translation using Lattice Transformations and Graph NetworksDaniel Beck, Trevor Cohn and Gholamreza Haffari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

Using Graphs for Word Embedding with Enhanced Semantic RelationsMatan Zuckerman and Mark Last . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Identifying Supporting Facts for Multi-hop Question Answering with Document Graph NetworksMokanarangan Thayaparan, Marco Valentino, Viktor Schlegel and André Freitas . . . . . . . . . . . . . . 42

Essentia: Mining Domain-specific Paraphrases with Word-Alignment GraphsDanni Ma, Chen Chen, Behzad Golshan and Wang-Chiew Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Layerwise Relevance Visualization in Convolutional Text Graph ClassifiersRobert Schwarzenberg, Marc Hübner, David Harbecke, Christoph Alt and Leonhard Hennig . . . 58

TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation RegenerationPeter Jansen and Dmitry Ustalov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ASU at TextGraphs 2019 Shared Task: Explanation ReGeneration using Language Models and IterativeRe-Ranking

Pratyay Banerjee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Red Dragon AI at TextGraphs 2019 Shared Task: Language Model Assisted Explanation GenerationYew Ken Chia, Sam Witteveen and Martin Andrews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Team SVMrank: Leveraging Feature-rich Support Vector Machines for Ranking Explanations to Ele-mentary Science Questions

Jennifer D’Souza, Isaiah Onando Mulang’ and Sören Auer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains of Facts for ExplainableMulti-hop Inference

Rajarshi Das, Ameya Godbole, Manzil Zaheer, Shehzaad Dhuliawala and Andrew McCallum . 101

Joint Semantic and Distributional Word Representations with Multi-Graph EmbeddingsPierre Daix-Moreux and Matthias Gallé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Evaluating Research Novelty Detection: Counterfactual ApproachesReinald Kim Amplayo, Seung-won Hwang and Min Song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Do Sentence Interactions Matter? Leveraging Sentence Level Representations for Fake News Classifica-tion

Vaibhav Vaibhav, Raghuram Mandyam and Eduard Hovy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

vii

Faceted Hierarchy: A New Graph Type to Organize Scientific Concepts and a Construction MethodQingkai Zeng, Mengxia Yu, Wenhao Yu, JinJun Xiong, Yiyu Shi and Meng Jiang . . . . . . . . . . . . 140

Graph-Based Semi-Supervised Learning for Natural Language UnderstandingZimeng Qiu, Eunah Cho, Xiaochun Ma and William Campbell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Graph Enhanced Cross-Domain Text-to-SQL GenerationSiyu Huo, Tengfei Ma, Jie Chen, Maria Chang, Lingfei Wu and Michael Witbrock . . . . . . . . . . . 159

Reasoning Over Paths via Knowledge Base CompletionSaatviga Sudhahar, Andrea Pierleoni and Ian Roberts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Node Embeddings for Graph Merging: Case of Knowledge Graph ConstructionIda Szubert and Mark Steedman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

DBee: A Database for Creating and Managing Knowledge Graphs and EmbeddingsViktor Schlegel and André Freitas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

A Constituency Parsing Tree based Method for Relation Extraction from Abstracts of Scholarly Publica-tions

Ming Jiang and Jana Diesner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

viii

Conference Program

Monday, November 4, 2019

9:00–9:15 Opening remarksOrganizers

9:15–9:35 Transfer in Deep Reinforcement Learning Using Knowledge GraphsPrithviraj Ammanabrolu and Mark Riedl

9:35–9:50 Relation Prediction for Unseen-Entities Using Entity-Word GraphsYuki Tagawa, Motoki Taniguchi, Yasuhide Miura, Tomoki Taniguchi, TomokoOhkuma, Takayuki Yamamoto and Keiichi Nemoto

9:50–10:10 Scalable graph-based method for individual named entity identificationSammy Khalife and Michalis Vazirgiannis

10:10–10:25 Neural Speech Translation using Lattice Transformations and Graph NetworksDaniel Beck, Trevor Cohn and Gholamreza Haffari

10:25–11:00 Coffee Break

11:00–11:20 Using Graphs for Word Embedding with Enhanced Semantic RelationsMatan Zuckerman and Mark Last

11:20–11:40 Identifying Supporting Facts for Multi-hop Question Answering with DocumentGraph NetworksMokanarangan Thayaparan, Marco Valentino, Viktor Schlegel and André Freitas

11:40–11:55 Essentia: Mining Domain-specific Paraphrases with Word-Alignment GraphsDanni Ma, Chen Chen, Behzad Golshan and Wang-Chiew Tan

11:55–12:10 Layerwise Relevance Visualization in Convolutional Text Graph ClassifiersRobert Schwarzenberg, Marc Hübner, David Harbecke, Christoph Alt and LeonhardHennig

12:10–14:00 Lunch Break

14:00–15:00 Invited Talk: Towards more controllable language generation: knowledge and plan-ningMinlie Huang

ix

Monday, November 4, 2019 (continued)

15:00–15:30 TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation RegenerationPeter Jansen and Dmitry Ustalov

15:30–16:00 Coffee Break

16:00–17:20 Shared Task Poster Session

ASU at TextGraphs 2019 Shared Task: Explanation ReGeneration using LanguageModels and Iterative Re-RankingPratyay Banerjee

Red Dragon AI at TextGraphs 2019 Shared Task: Language Model Assisted Expla-nation GenerationYew Ken Chia, Sam Witteveen and Martin Andrews

Team SVMrank: Leveraging Feature-rich Support Vector Machines for Ranking Ex-planations to Elementary Science QuestionsJennifer D’Souza, Isaiah Onando Mulang’ and Sören Auer

Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains ofFacts for Explainable Multi-hop InferenceRajarshi Das, Ameya Godbole, Manzil Zaheer, Shehzaad Dhuliawala and AndrewMcCallum

16:00–17:20 TextGraphs Poster Session

Joint Semantic and Distributional Word Representations with Multi-Graph Embed-dingsPierre Daix-Moreux and Matthias Gallé

Evaluating Research Novelty Detection: Counterfactual ApproachesReinald Kim Amplayo, Seung-won Hwang and Min Song

Do Sentence Interactions Matter? Leveraging Sentence Level Representations forFake News ClassificationVaibhav Vaibhav, Raghuram Mandyam and Eduard Hovy

Faceted Hierarchy: A New Graph Type to Organize Scientific Concepts and a Con-struction MethodQingkai Zeng, Mengxia Yu, Wenhao Yu, JinJun Xiong, Yiyu Shi and Meng Jiang

x

Monday, November 4, 2019 (continued)

Graph-Based Semi-Supervised Learning for Natural Language UnderstandingZimeng Qiu, Eunah Cho, Xiaochun Ma and William Campbell

Graph Enhanced Cross-Domain Text-to-SQL GenerationSiyu Huo, Tengfei Ma, Jie Chen, Maria Chang, Lingfei Wu and Michael Witbrock

Reasoning Over Paths via Knowledge Base CompletionSaatviga Sudhahar, Andrea Pierleoni and Ian Roberts

Node Embeddings for Graph Merging: Case of Knowledge Graph ConstructionIda Szubert and Mark Steedman

DBee: A Database for Creating and Managing Knowledge Graphs and EmbeddingsViktor Schlegel and André Freitas

A Constituency Parsing Tree based Method for Relation Extraction from Abstractsof Scholarly PublicationsMing Jiang and Jana Diesner

17:20–17:30 Closing Remark

xi

Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pages 1–10Hong Kong, November 4, 2019. c©2019 Association for Computational Linguistics

Transfer in Deep Reinforcement Learning using Knowledge Graphs

Prithviraj AmmanabroluSchool of Interactive ComputingGeorgia Institute of Technology

Atlanta, [email protected]

Mark O. RiedlSchool of Interactive ComputingGeorgia Institute of Technology

Atlanta, [email protected]

AbstractText adventure games, in which players mustmake sense of the world through text descrip-tions and declare actions through text descrip-tions, provide a stepping stone toward ground-ing action in language. Prior work has demon-strated that using a knowledge graph as astate representation and question-answering topre-train a deep Q-network facilitates fastercontrol policy learning. In this paper, weexplore the use of knowledge graphs as arepresentation for domain knowledge transferfor training text-adventure playing reinforce-ment learning agents. Our methods are testedacross multiple computer generated and hu-man authored games, varying in domain andcomplexity, and demonstrate that our transferlearning methods let us learn a higher-qualitycontrol policy faster.

1 Introduction

Text adventure games, in which players mustmake sense of the world through text descrip-tions and declare actions through natural language,can provide a stepping stone toward more real-world environments where agents must communi-cate to understand the state of the world and af-fect change in the world. Despite the steadily in-creasing body of research on text-adventure games(Bordes et al., 2010; He et al., 2016; Narasimhanet al., 2015; Fulda et al., 2017; Haroush et al.,2018; Cote et al., 2018; Tao et al., 2018; Am-manabrolu and Riedl, 2019), and in addition to theubiquity of deep reinforcement learning applica-tions (Parisotto et al., 2016; Zambaldi et al., 2019),teaching an agent to play text-adventure games re-mains a challenging task. Learning a control pol-icy for a text-adventure game requires a signifi-cant amount of exploration, resulting in trainingruns that take hundreds of thousands of simula-tions (Narasimhan et al., 2015; Ammanabrolu andRiedl, 2019).

One reason that text-adventure games require somuch exploration is that most deep reinforcementlearning algorithms are trained on a task withouta real prior. In essence, the agent must learn ev-erything about the game from only its interactionswith the environment. Yet, text-adventure gamesmake ample use of commonsense knowledge (e.g.,an axe can be used to cut wood) and genre themes(e.g., in a horror or fantasy game, a coffin is likelyto contain a vampire or other undead monster).This is in addition to the challenges innate to thetext-adventure game itself—games are puzzles—which results in inefficient training.

Ammanabrolu and Riedl (2019) developed a re-inforcement learning agent that modeled the textenvironment as a knowledge graph and achievedstate-of-the-art results on simple text-adventuregames provided by the TextWorld (Cote et al.,2018) environment. They observed that a simpleform of transfer from very similar games greatlyimproved policy training time. However, gamesbeyond the toy TextWorld environments are be-yond the reach of state-of-the-art techniques.

In this paper, we explore the use of knowl-edge graphs and associated neural embeddings asa medium for domain transfer to improve train-ing effectiveness on new text-adventure games.Specifically, we explore transfer learning at mul-tiple levels and across different dimensions. Wefirst look at the effects of playing a text-adventuregame given a strong prior in the form of a knowl-edge graph extracted from generalized textualwalk-throughs of interactive fiction as well asthose made specifically for a given game. Next,we explore the transfer of control policies in deepQ-learning (DQN) by pre-training portions of adeep Q-network using question-answering and byDQN-to-DQN parameter transfer between games.We evaluate these techniques on two differentsets of human authored and computer generated

1

games, demonstrating that our transfer learningmethods enable us to learn a higher-quality con-trol policy faster.

2 Background and Related Work

Text-adventure games, in which an agent must in-teract with the world entirely through natural lan-guage, provide us with two challenges that haveproven difficult for deep reinforcement learningto solve (Narasimhan et al., 2015; Haroush et al.,2018; Ammanabrolu and Riedl, 2019): (1) Theagent must act based only on potentially incom-plete textual descriptions of the world around it.The world is thus partially observable, as the agentdoes not have access to the state of the world atany stage. (2) the action space is combinatoriallylarge—a consequence of the agent having to de-clare commands in natural language. These twoproblems together have kept commercial text ad-venture games out of the reach of existing deepreinforcement learning methods, especially giventhe fact that most of these methods attempt to trainon a particular game from scratch.

Text-adventure games can be treated as par-tially observable Markov decision processes(POMDPs). This can be represented as a 7-tupleof 〈S, T,A,Ω, O,R, γ〉: the set of environmentstates, conditional transition probabilities betweenstates, words used to compose text commands, ob-servations, conditional observation probabilities,the reward function, and the discount factor re-spectively (Cote et al., 2018).

Multiple recent works have explored the chal-lenges associated with these games (Bordes et al.,2010; He et al., 2016; Narasimhan et al., 2015;Fulda et al., 2017; Haroush et al., 2018; Coteet al., 2018; Tao et al., 2018; Ammanabrolu andRiedl, 2019). Narasimhan et al. (2015) introducethe LSTM-DQN, which learns to score the ac-tion verbs and corresponding objects separatelyand then combine them into a single action. Heet al. (2016) propose the Deep Reinforcement Rel-evance Network that consists of separate networksto encode state and action information, with a fi-nal Q-value for a state-action pair that is computedbetween a pairwise interaction function betweenthese. Haroush et al. (2018) present the ActionElimination Network (AEN), which restricts ac-tions in a state to the top-k most likely ones, us-ing the emulator’s feedback. Hausknecht et al.(2019b) design an agent that uses multiple mod-

ules to identify a general set of game play rules fortext games across various domains. None of theseworks study how to transfer policies between dif-ferent text-adventure games in any depth and sothere exists a gap between the two bodies of work.

Transferring policies across different text-adventure games requires implicitly learning amapping between the games’ state and actionspaces. The more different the domain of thetwo games, the harder this task becomes. Previ-ous work (Ammanabrolu and Riedl, 2019) intro-duced the use of knowledge graphs and question-answering pre-training to aid in the problems ofpartial observability and a combinatorial actionspace. This work made use of a system calledTextWorld (Cote et al., 2018) that uses grammarsto generate a series of similar (but not exact same)games. An oracle was used to play perfect gamesand the traces were used to pre-train portions ofthe agent’s network responsible for encoding theobservations, graph, and actions. Their resultsshow that this form of pre-training improves thequality of the policy at convergence it does notshow a significant improvement in the trainingtime required to reach convergence. Further, itis generally unrealistic to have a corpus of verysimilar games to draw from. We build on thiswork, and explore modifications of this algorithmthat would enable more efficient transfer in text-adventure games.

Work in transfer in reinforcement learning hasexplored the idea of transferring skills (Konidarisand Barto, 2007; Konidaris et al., 2012) or trans-ferring value functions/policies (Liu and Stone,2006). Other approaches attempt transfer inmodel-based reinforcement learning (Taylor et al.,2008; Nguyen et al., 2012; Gasic et al., 2013;Wang et al., 2015; Joshi and Chowdhary, 2018),though traditional approaches here rely heavilyon hand crafting state-action mappings across do-mains. Narasimhan et al. (2017) learn to playgames by predicting mappings across domains us-ing a both deep Q-networks and value iterationnetworks, finding that that grounding the gamestate using natural language descriptions of thegame itself aids significantly in transferring usefulknowledge between domains.

In transfer for deep reinforcement learning,Parisotto et al. (2016) propose the Actor-Mimicnetwork which learns from expert policies for asource task using policy distillation and then ini-

2

tializes the network for a target task using theseparameters. Yin and Pan (2017) also use policydistillation, using task specific features as inputsto a multi-task policy network and use a hierarchi-cal experience sampling method to train this multi-task network. Similarly, Rusu et al. (2016) attemptto transfer parameters by using frozen parameterstrained on source tasks to help learn a new setof parameters on target tasks. Rajendran et al.(2017) attempt something similar but use atten-tion networks to transfer expert policies betweentasks. These works, however, do not study the re-quirements for enabling efficient transfer for tasksrooted in natural language, nor do they explore theuse of knowledge graphs as a state representation.

3 Knowledge Graphs for DQNs

A knowledge graph is a directed graph formedby a set of semantic, or RDF, triples in theform of 〈subject, relation, object〉—for exam-ple, 〈vampires, are, undead〉. We follow theopen-world assumption that what is not in ourknowledge graph can either be true or false.

Ammanabrolu and Riedl (2019) introduced theKnowledge Graph DQN (KG-DQN) and touchedon some aspects of transfer learning, show-ing that pre-training portions of the deep Q-network using question answering system on per-fect playthroughs of a game increases the qual-ity of the learned control policy for a generatedtext-adventure game. We build on this work anduse KG-DQN to explore transfer with both knowl-edge graphs and network parameters. Specificallywe seek to transfer skills and knowledge from(a) static text documents describing game play and(b) from playing one text-adventure game to a sec-ond complete game in in the same genre (e.g., hor-ror games). The rest of this section describes KG-DQN in detail and summarizes our modifications.1

For each step that the agent takes, it automat-ically extracts a set of RDF triples from the re-ceived observation through the use of OpenIE(Angeli et al., 2015) in addition to a few rulesto account for the regularities of text-adventuregames. The graph itself is more or less a map ofthe world, with information about objects’ affor-dances and attributes linked to the rooms that theyare place in in a map. The graph also makes a dis-tinction with respect to items that are in the agent’s

1We use the implementation of KG-DQN found athttps://github.com/rajammanabrolu/KG-DQN

Graph embedding

Multi-head graph attention

Sliding bidirectional LSTM

Linear

Observation st

Updated graph

LSTM encoder

Actions a1 … an

Actions a’1 … a’n

Q(st, a’i)

Figure 1: The KG-DQN architecture.

possession or in their immediate surrounding en-vironment. We make minor modifications to therules used in Ammanabrolu and Riedl (2019) tobetter achieve such a graph in general interactivefiction environments.

The agent also has access to all actions acceptedby the game’s parser, following Narasimhan et al.(2015). For general interactive fiction environ-ments, we develop our own method to extract thisinformation. This is done by extracting a set oftemplates accepted by the parser, with the objectsor noun phrases in the actions replaces with a OBJtag. An example of such a template is ”place OBJin OBJ”. These OBJ tags are then filled in by look-ing at all possible objects in the given vocabularyfor the game. This action space is of the orderof A = O(|V | × |O|2) where V is the numberof action verbs, and O is the number of distinctobjects in the world that the agent can interactwith. As this is too large a space for a RL agent toeffectively explore, the knowledge graph is usedto prune this space by ranking actions based ontheir presence in the current knowledge graph andthe relations between the objects in the graph asin Ammanabrolu and Riedl (2019).

The architecture for the deep Q-network con-sists of two separate neural networks—encodingstate and action separately—with the finalQ-valuefor a state-action pair being the result of a pair-wise interaction function between the two (Fig-ure 1). We train with a standard DQN trainingloop; the policy is determined by the Q-value of aparticular state-action pair, which is updated using

3

the Bellman equation (Sutton and Barto, 2018):

Qt+1(st+1,at+1) =

E[rt+1 + γmaxa∈At

Qt(s, a)|st, at] (1)

where γ refers to the discount factor and rt+1 isthe observed reward. The whole system is trainedusing prioritized experience replay Lin (1993), amodified version of ε-greedy learning, and a tem-poral difference loss that is computed as:

L(θ) =rk+1+

γ maxa∈Ak+1

Q(st,a; θ)−Q(st,at; θ)(2)

where Ak+1 represents the action set at step k +1 and st,at refer to the encoded state and actionrepresentations respectively.

4 Knowledge Graph Seeding

In this section we consider the problem of trans-ferring a knowledge graph from a static text re-source to a DQN—which we refer to as seeding.KG-DQN uses a knowledge graph as a state repre-sentation and also to prune the action space. Thisgraph is built up over time, through the course ofthe agent’s exploration. When the agent first startsthe game, however, this graph is empty and doesnot help much in the action pruning process. Theagent thus wastes a large number of steps near thebeginning of each game exploring ineffectively.

The intuition behind seeding the knowledgegraph from another source is to give the agent aprior on which actions have a higher utility andthereby enabling more effective exploration. Text-adventure games typically belong to a particulargenre of storytelling—e.g., horror, sci-fi, or soapopera—and an agent is at a distinct disadvantageif it doesn’t have any genre knowledge. Thus, thegoal of seeding is to give the agent a strong prior.

This seed knowledge graph is extracted fromonline general text-adventure guides as well asgame/genre specific guides when available.2 Thegraph is extracted from this the guide using a sub-set of the rules described in Section 3 used toextract information from the game observations,with the remainder of the RDF triples comingfrom OpenIE. There is no map of rooms in the en-vironment that can be built, but it is possible to

2An example of a guide we use is found here http://www.microheaven.com/IFGuide/step3.html

peopletell

You

can

multipleitems

canuse

gonorth

gosouth

object

examine

knife

cut

takeis

lockkey

isis

opens

take drop

drop

drop

verb

is

is...

.

.

.can

can

do

do

is

command

inventory

.

.

.

. . .

Figure 2: Select partial example of what a seed knowl-edge graph looks like. Ellipses indicate other similarentities and relations not shown.

extract information regarding affordances of fre-quently occurring objects as well as common ac-tions that can be performed across a wide range oftext-adventure games. This extracted graph is thuspotentially disjoint, containing only this generaliz-able information, in contrast to the graph extractedduring the rest of the exploration process. An ex-ample of a graph used to seed KG-DQN is givenin Fig. 2. The KG-DQN is initialized with thisknowledge graph.

5 Task Specific Transfer

The overarching goal of transfer learning in text-adventure games is to be able to train an agent onone game and use this training on to improve thelearning capabilities of another. There is grow-ing body of work on improving training timeson target tasks by transferring network parameterstrained on source tasks (Rusu et al., 2016; Yin andPan, 2017; Rajendran et al., 2017). Of particularnote is the work by Rusu et al. (2016), where theytrain a policy on a source task and then use thisto help learn a new set of parameters on a targettask. In this approach, decisions made during thetraining of the target task are jointly made usingthe frozen parameters of the transferred policy net-work as well as the current policy network.

Our system first trains a question-answeringsystem (Chen et al., 2017) using traces given byan oracle, as in Section 4. For commercial text-adventure games, these traces take the form ofstate-action pairs generated using perfect walk-

4

through descriptions of the game found online asdescribed in Section 4.

We use the parameters of the question-answering system to pre-train portions of the deepQ-network for a different game within in the samedomain. The portions that are pre-trained are thesame parts of the architecture as in Ammanabroluand Riedl (2019). This game is referred to as thesource task. The seeding of the knowledge graphis not strictly necessary but given that state-of-the-art DRL agents cannot complete real games, thismakes the agent more effective at the source task.

We then transfer the knowledge and skills ac-quired from playing the source task to anothergame from the same genre—the target task. Theparameters of the deep Q-network trained on thesource game are used to initialize a new deep Q-network for the target task. All the weights indi-cated in the architecture of KG-DQN as shown inFig. 1 are transferred. Unlike Rusu et al. (2016),we do not freeze the parameters of the deep Q-network trained on the source task nor use the twonetworks to jointly make decisions but instead justuse it to initialize the parameters of the target taskdeep Q-network. This is done to account for thefact that although graph embeddings can be trans-ferred between games, the actual graph extractedfrom a game is non-transferable due to differencesin structure between the games.

6 Experiments

We test our system on two separate sets ofgames in different domains using the Jericho andTextWorld frameworks (Hausknecht et al., 2019a;Cote et al., 2018). The first set of games is “slice oflife” themed and contains games that involve mun-dane tasks usually set in textual descriptions ofnormal houses. The second set of games is “hor-ror” themed and contains noticeably more diffi-cult games with a relatively larger vocabulary sizeand action set, non-standard fantasy names, etc.We choose these domains because of the avail-ability of games in popular online gaming com-munities, the degree of vocabulary overlap withineach theme, and overall structure of games in eachtheme. Specifically, there must be at least threegames in each domain: at least one game to trainthe question-answering system on, and two moreto train the parameters of the source and target taskdeep Q-networks. A summary of the statistics forthe games is given in Table 1. Vocabulary overlap

is calculated by measuring the percentage of over-lap between a game’s vocabulary and the domain’svocabulary, i.e. the union of the vocabularies forall the games we use within the domain. We ob-serve that in both of these domains, the complex-ity of the game increases steadily from the gameused for the question-answering system to the tar-get and then source task games.

We perform ablation tests within each domain,mainly testing the effects of transfer from seed-ing, oracle-based question-answering, and source-to-target parameter transfer. Additionally, thereare a couple of extra dimensions of ablations thatwe study, specific to each of the domains andexplained below. All experiments are run threetimes using different random seeds. For all theexperiments we report metrics known to be impor-tant for transfer learning tasks (Taylor and Stone,2009; Narasimhan et al., 2017): average rewardcollected in the first 50 episodes (init. reward), av-erage reward collected for 50 episodes after con-vergence (final reward), and number of steps takento finish the game for 50 episodes after conver-gence (steps). For the metrics tested after conver-gence, we set ε = 0.1 following both Narasimhanet al. (2015) and Ammanabrolu and Riedl (2019).We use similar hyperparameters to those reportedin (Ammanabrolu and Riedl, 2019) for training theKG-DQN with action pruning, with the main dif-ference being that we use 100 dimensional wordembeddings instead of 50 dimensions for the hor-ror genre.

6.1 Slice of Life Experiments

TextWorld uses a grammar to generate simi-lar games. Following Ammanabrolu and Riedl(2019), we use TextWorld’s “home” theme to gen-erate the games for the question-answering sys-tem. TextWorld is a framework that uses a gram-mar to randomly generate game worlds and quests.This framework also gives us information such asinstructions on how to finish the quest, and a listof actions that can be performed at each step basedon the current world state. We do not let our agentaccess this additional solution information or ad-missible actions list. Given the relatively smallquest length for TextWorld games—games can becompleted in as little as 5 steps—we generate 50such games and partition them into train and testsets in a 4:1 ratio. The traces are generated on thetraining set, and the question-answering system is

5

Table 1: Game statistics

Slice of life HorrorQA/Source Target QA Source TargetTextWorld 9:05 Lurking Horror Afflicted Anchorhead

Vocab size 788 297 773 761 2256Branching factor 122 677 - 947 1918Number of rooms 10 7 25 18 28Completion steps 5 25 289 20 39Words per obs. 65.1 45.2 68.1 81.2 114.2New triples per obs. 6.4 4.1 - 12.6 17.0% Vocab overlap 19.70 21.45 22.80 14.40 66.34Max. aug. reward 5 27 - 21 43

You

're in

bathroom

have

goldwatch

soiledclothing

have

bedroom

northof

is

luxurious

keys

tele-phone

wallet

has

has

has

has

endtable

sink

toilet

shower

has

has

has

cleanerclothes

has dresser

has

extremelysparse

is

You

're in

bathroom

have

goldwatch

soiledchothing

have

bedroom

northof

is

luxurious

keys

tele-phone

wallet

has

has

has

has

endtable

sink

toilet

shower

has

has

has

BedroomThis bedroom is extremely spare, with dirty laundry scattered

haphazardly all over the floor. Cleaner clothing can be found in the

dresser.Abathroomliestothesouth,whileadoortotheeastleadstothe

livingroom.Ontheendtableareatelephone,awalletandsomekeys.

>inventoryYouarecarrying:

somesoiledclothing(beingworn)

agoldwatch(beingworn)

>gosouthBathroomThis is a far from luxurious but still quite functional bathroom,with a

sink,toiletandshower.Thebedroomliestothenorth.

cleanerclothes

has dresser

has

extremelysparse

is

Figure 3: Partial unseeded knowledge graph examplegiven observations and actions in the game 9:05.

evaluated on the test set.We then pick a random game from the test set to

train our source task deep Q-network for this do-main. For this training, we use the reward functionprovided by TextWorld: +1 for each action takenthat moves the agent closer to finishing the quest;-1 for each action taken that extends the minimumnumber of steps needed to finish the quest fromthe current stage; 0 for all other situations.

We choose the game, 9:053 as our target taskgame due to similarities in structure in addition tothe vocabulary overlap. Note that there are multi-ple possible endings to this game and we pick thesimplest one for the purpose of training our agent.

3https://ifdb.tads.org/viewgame?id=qzftg3j8nh5f34i2

alley

half-crumbling

brickwalls

tall,woodenfence

endshere atYou

wall of thenorthernbuilding

narrow,transom style

window

has

hastotter

oppressivelyover

outsidethe realestateoffice

southeastof

lane

its wayback

endshere at

winds grim

little cul-de-sac

ancientleads

essentiallynowhere

shadowy

corner of ...twistingavenues

isis is

istucked

away in

OutsidetheRealEstateOffice

A grim little cul-de-sac, tucked away in a corner of the claustrophobic

tangleofnarrow,twistingavenuesthatlargelyconstitutetheolderportion

ofAnchorhead.Likemostofthestreetsinthiscity,itisancient,shadowy,

andleadsessentiallynowhere.Thelaneendshereattherealestateagent's

office,whichliestotheeast,andwindsitswaybacktowardthecenterof

towntothewest.Anarrow,garbage-chokedalleyopenstothesoutheast.

>gosoutheastAlley

Thisnarrowaperturebetweentwobuildingsisnearlyblockedwithpilesof

rotting cardboard boxes and overstuffed garbage cans. Ugly, half-

crumblingbrickwallstoeithersidetotteroppressivelyoveryou.Thealley

ends here at a tall, wooden fence. High up on the wall of the northern

buildingthereisanarrow,transom-stylewindow.

're inalley

half-crumbling

brickwalls

tall,woodenfence

endshere atYou

wall of thenorthernbuilding

narrow,transom style

window

has

hastotter

oppressivelyover

outsidethe realestateoffice

southeastof

lane

its wayback

endshere at

winds grim

little cul-de-sac

ancientleads

essentiallynowhere

shadowy

corner of ...twistingavenues

isis is

istucked

away in

OutsidetheRealEstateOffice

A grim little cul-de-sac, tucked away in a corner of the claustrophobic

tangleofnarrow,twistingavenuesthatlargelyconstitutetheolderportion

ofAnchorhead.Likemostofthestreetsinthiscity,itisancient,shadowy,

andleadsessentiallynowhere.Thelaneendshereattherealestateagent's

office,whichliestotheeast,andwindsitswaybacktowardthecenterof

towntothewest.Anarrow,garbage-chokedalleyopenstothesoutheast.

>gosoutheastAlley

Thisnarrowaperturebetweentwobuildingsisnearlyblockedwithpilesof

rotting cardboard boxes and overstuffed garbage cans. Ugly, half-

crumblingbrickwallstoeithersidetotteroppressivelyoveryou.Thealley

ends here at a tall, wooden fence. High up on the wall of the northern

buildingthereisanarrow,transom-stylewindow.

're in

Figure 4: Partial unseeded knowledge graph examplegiven observations and actions in the game Anchor-head.

6.2 Horror Experiments

For the horror domain, we choose Lurking Hor-ror4 to train the question-answering system on.The source and target task games are chosen as Af-flicted5 and Anchorhead6 respectively. However,due to the size and complexity of these two gamessome modifications to the games are required forthe agent to be able to effectively solve them.

4https://ifdb.tads.org/viewgame?id=jhbd0kja1t57uop

5https://ifdb.tads.org/viewgame?id=epl4q2933rczoo9x

6https://ifdb.tads.org/viewgame?id=op0uw1gn1tjqmjt7

6

Figure 5: Reward curve for select experiments in theslice of life domain.

We partition each of these games and make themsmaller by reducing the final goal of the gameto an intermediate checkpoint leading to it. Thischeckpoints were identified manually using walk-throughs of the game; each game has a natural in-termediate goal. For example, Anchorhead is seg-mented into 3 chapters in the form of objectivesspread across 3 days, of which we use only thefirst chapter. The exact details of the games afterpartitioning is described in Table 1. For LurkingHorror, we report numbers relevant for the oraclewalkthrough. We then pre-prune the action spaceand use only the actions that are relevant for thesections of the game that we have partitioned out.The majority of the environment is still availablefor the agent to explore but the game ends uponcompletion of the chosen intermediate checkpoint.

6.3 Reward Augmentation

The combined state-action space for a commer-cial text-adventure game is quite large and thecorresponding reward function is very sparse incomparison. The default, implied reward signalis to receive positive value upon completion ofthe game, and no reward value elsewhere. Thisis problematic from an experimentation perspec-tive as text-adventure games are too complex foreven state-of-the-art deep reinforcement learningagents to complete. Even using transfer learningmethods, a sparse reward signal usually results inineffective exploration by the agent.

To make experimentation feasible, we augmentthe reward to give the agent a dense reward sig-nal. Specifically, we use an oracle to generatestate-action traces (identical to how as when train-

Figure 6: Reward curve for select experiments in thehorror domain.

ing the question-answering system). An oracle isan agent that is capable of playing and finishing agame perfectly in the least number of steps pos-sible. The state-action pairs generated using per-fect walkthroughs of the game are then used ascheckpoints and used to give the agent additionalreward. If the agent encounters any of these state-action pairs when training, i.e. performs the rightaction given a corresponding state, it receives aproportional reward in addition to the standard re-ward built into the game. This reward is scaledbased on the game and is designed to be less thanthe smallest reward given by the original rewardfunction to prevent it from overpowering the built-in reward. We refer to agents using this techniqueas having “dense” reward and “sparse” reward oth-erwise. The agent otherwise receives no informa-tion from the oracle about how to win the game.

7 Results/Discussion

The structure of the experiments are such that thefor each of the domains, the target task game ismore complex that the source task game. The sliceof life games are also generally less complex thanthe horror games; they have a simpler vocabularyand a more linear quest structure. Additionally,given the nature of interactive fiction games, itis nearly impossible—even for human players—to achieve completion in the minimum number ofsteps (as given by the steps to completion in Ta-ble 1); each of these games are puzzle based andrequire extensive exploration and interaction withvarious objects in the environment to complete.

Table 2 and Table 3 show results for the slice oflife and horror domains, respectively. In both do-

7

Table 2: Results for the slice of life games. “KG-DQN Full” refers to KG-DQN when seeded first, trained on thesource, then transferred to the target. All experiment with QA indicate pre-training. S, D indicate sparse and densereward respectively.

Experiment Init. Rwd. Final Rwd. StepsSource Game (TextWorld)

KG-DQN no transfer 2.6 ± 0.73 4.7 ± 0.23 110.83 ± 4.92KG-DQN w/ QA 2.8 ± 0.61 4.9 ± 0.09 88.57 ± 3.45KG-DQN seeded 3.2 ± 0.57 4.8 ± 0.16 91.43 ± 1.89

Target Game (9:05)KG-DQN untuned (D) - 2.5 ±0.48 1479.0 ± 22.3

KG-DQN no transfer (S) - - 1916.0 ± 33.17KG-DQN no transfer (D) 0.8 ± 0.32 16.5 ± 1.58 1267.2 ± 7.5

KG-DQN w/ QA (S) - - 1428.0 ± 11.26KG-DQN w/ QA (D) 1.3 ± 0.24 17.4 ± 1.84 1127.0 ± 31.22KG-DQN seeded (D) 1.4 ± 0.35 16.7 ± 2.41 1393.33 ± 26.5

KG-DQN Full (D) 2.7 ± 0.65 19.7 ± 2.0 274.76 ± 21.45

Table 3: Results for horror games. Note that the reward type is dense for all results. “KG-DQN Full“ refers toKG-DQN seeded, transferred from source. All experiment with QA indicate pre-training.

Experiment Init. Rwd. Final Rwd. StepsSource Game (Afflicted)KG-DQN no transfer 3.0 ± 1.3 14.1 ± 1.73 1934.7 ± 85.67

KG-DQN w/ QA 4.3 ± 1.34 15.1 ± 1.60 1179 ± 32.07KG-DQN seeded 4.1 ± 1.19 14.6 ± 1.26 1125.3 ± 49.57

Target Game (Anchorhead)KG-DQN untuned - 3.8 ± 0.23 -

KG-DQN no transfer 1.0 ± 0.34 6.8 ± 0.42 -KG-DQN w/ QA 3.6 ± 0.91 24.8 ± 0.6 4874 ± 90.74KG-DQN seeded 1.7 ± 0.62 26.6 ± 0.42 4937 ± 42.93

KG-DQN full 4.1 ± 0.9 39.9 ± 0.53 4334.3 ± 56.13

mains seeding and QA pre-training improve per-formance by similar amounts from the baseline onboth the source and target task games. A seriesof t-tests comparing the results of the pre-trainingand graph seeding with the baseline KG-DQNshow that all results are significant with p < 0.05.Both the pre-training and graph seeding performsimilar functions in enabling the agent to exploremore effectively while picking high utility actions.

Even when untuned, i.e. evaluating the agenton the target task after having only trained on thesource task, the agent shows better performancethan training on the target task from scratch usingthe sparse reward. As expected, we see a furthergain in performance when the dense reward func-tion is used for both of these domains as well. Inthe horror domain, the agent fails to converge toa state where it is capable of finishing the gamewithout the dense reward function due to the hor-ror games being more complex.

When an agent is trained using on just the tar-get task horror game, Anchorhead, it does not con-verge to completion and only gets as far as achiev-ing a reward of approximately 7 (max. observedreward from the best model is 41). This corre-

sponds to a point in the game where the player isrequired to use a term in an action that the playerhas never observed before, “look up Verlac” whenin front of a certain file cabinet—“Verlac“ beingthe unknown entity. Without seeding or QA pre-training, the agent is unable to cut down the ac-tion space enough to effectively explore and findthe solution to progress further. The relative effec-tiveness of the gains in initial reward due to seed-ing appears to depend on the game and the cor-responding static text document. In all situationsexcept Anchohead, seeding provides comparablegains in initial reward as compared to QA — thereis no statistical difference between the two whenperforming similar t-tests.

When the full system is used—i.e. we seedthe knowledge graph, pre-train QA, then train thesource task game, then the target task game usingthe augmented reward function—we see a signif-icant gain in performance, up to an 80% gain interms of completion steps in some cases. The bot-tleneck at reward 7 is still difficult to pass, how-ever, as seen in Fig. 6, in which we can see thatthe agent spends a relatively long time around thisreward level unless the full transfer technique is

8

used. We further see in Figures 5, 6 that transfer-ring knowledge results in the agent learning thishigher quality policy much faster. In fact, we notethat training a full system is more efficient thanjust training the agent on a single task, i.e. train-ing a QA system then a source task game for 50episodes then transferring and training a seededtarget task game for 50 episodes is more effectivethan just training the target task game by itself foreven 150+ episodes.

8 Conclusions

We have demonstrated that using knowledgegraphs as a state representation enables effi-cient transfer between deep reinforcement learn-ing agents designed to play text-adventure games,reducing training times and increasing the qualityof the learned control policy. Our results show thatwe are able to extract a graph from a general statictext resource and use that to give the agent knowl-edge regarding domain specific vocabulary, objectaffordances, etc. Additionally, we demonstratethat we can effectively transfer knowledge usingdeep Q-network parameter weights, either by pre-training portions of the network using a question-answering system or by transferring parametersfrom a source to a target game. Our agent trainsfaster overall, including the number of episodes re-quired to pre-train and train on a source task, andperforms up to 80% better on convergence than anagent not utilizing these techniques.

We conclude that knowledge graphs enabletransfer in deep reinforcement learning agentsby providing the agent with a more explicit–andinterpretable–mapping between the state and ac-tion spaces of different games. This mappinghelps overcome the challenges twin challenges ofpartial observability and combinatorially large ac-tion spaces inherent in all text-adventure gamesby allowing the agent to better explore the state-action space.

9 Acknowledgements

This material is based upon work supported by theNational Science Foundation under Grant No. IIS-1350339. Any opinions, findings, and conclusionsor recommendations expressed in this material arethose of the author(s) and do not necessarily reflectthe views of the National Science Foundation.

ReferencesPrithviraj Ammanabrolu and Mark O. Riedl. 2019.

Playing text-adventure games with graph-baseddeep reinforcement learning. In Proceedings of2019 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, NAACL-HLT 2019.

Gabor Angeli, Johnson Premkumar, Melvin Jose, andChristopher D. Manning. 2015. Leveraging Lin-guistic Structure For Open Domain Information Ex-traction. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers).

Antoine Bordes, Nicolas Usunier, Ronan Collobert,and Jason Weston. 2010. Towards understanding sit-uated natural language. In Proceedings of the 2010International Conference on Artificial Intelligenceand Statistics.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computa-tional Linguistics (ACL).

Marc-Alexandre Cote, Akos Kadar, Xingdi Yuan, BenKybartas, Emery Fine, James Moore, MatthewHausknecht, Layla El Asri, Mahmoud Adada,Wendy Tay, and Adam Trischler. 2018. TextWorld :A Learning Environment for Text-based Games. InProceedings of the ICML/IJCAI 2018 Workshop onComputer Games, page 29.

Nancy Fulda, Daniel Ricks, Ben Murdoch, and DavidWingate. 2017. What can you do with a rock? affor-dance extraction via word embeddings. In Proceed-ings of the Twenty-Sixth International Joint Con-ference on Artificial Intelligence, IJCAI-17, pages1039–1045.

Milica Gasic, Catherine Breslin, Matthew Hender-son, Dongho Kim, Martin Szummer, Blaise Thom-son, Pirros Tsiakoulis, and Steve J. Young. 2013.Pomdp-based dialogue manager adaptation to ex-tended domains. In SIGDIAL Conference.

Matan Haroush, Tom Zahavy, Daniel J Mankowitz, andShie Mannor. 2018. Learning How Not to Act inText-Based Games. In Workshop Track at ICLR2018, pages 1–4.

Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Cote, and Xingdi Yuan. 2019a. Interac-tive fiction games: A colossal adventure. CoRR,abs/1909.05398.

Matthew Hausknecht, Ricky Loynd, Greg Yang,Adith Swaminathan, and Jason D. Williams. 2019b.Nail: A general interactive fiction agent. CoRR,abs/1902.04259.

9

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Li-hong Li, Li Deng, and Mari Ostendorf. 2016. DeepReinforcement Learning with a Natural LanguageAction Space. In Association for ComputationalLinguistics (ACL).

Girish Joshi and Girish Chowdhary. 2018. Cross-domain transfer in reinforcement learning using tar-get apprentice. In Proceedings of the InternationalConference on Robotics and Automation, pages7525–7532.

George Konidaris and Andrew G. Barto. 2007. Build-ing portable options: Skill transfer in reinforcementlearning. In IJCAI.

George Konidaris, Ilya Scheidwasser, and AndrewG. Barto. 2012. Transfer in reinforcement learningvia shared features. The Journal of Machine Learn-ing Research, 13:1333–1371.

Long-Ji Lin. 1993. Reinforcement learning for robotsusing neural networks. Ph.D. thesis, Carnegie Mel-lon University.

Yaxin Liu and Peter Stone. 2006. Value-function-basedtransfer for reinforcement learning using structuremapping. In AAAI.

Karthik Narasimhan, Regina Barzilay, and TommiJaakkola. 2017. Deep transfer in reinforcementlearning by language grounding. Journal of Artifi-cial Intelligence Research, 63.

Karthik Narasimhan, Tejas Kulkarni, and ReginaBarzilay. 2015. Language Understanding for Text-based Games Using Deep Reinforcement Learning.In Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Trung Thanh Nguyen, Tomi Silander, and Tze-YunLeong. 2012. Transferring expectations in model-based reinforcement learning. In NIPS.

Emilio Parisotto, Jimmy Ba, and Ruslan R. Salakhutdi-nov. 2016. Actor-mimic: Deep multitask and trans-fer reinforcement learning. CoRR, abs/1511.06342.

Janarthanan Rajendran, Aravind S. Lakshminarayanan,Mitesh M. Khapra, P. Prasanna, and BalaramanRavindran. 2017. Attend, adapt and transfer: At-tentive deep architecture for adaptive transfer frommultiple sources in the same domain. In ICLR.

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Des-jardins, Hubert Soyer, James Kirkpatrick, KorayKavukcuoglu, Razvan Pascanu, and Raia Had-sell. 2016. Progressive neural networks. CoRR,abs/1606.04671.

Richard S Sutton and Andrew G Barto. 2018. Rein-forcement Learning: An Introduction. MIT Press.

Ruo Yu Tao, Marc-Alexandre Cote, Xingdi Yuan, andLayla El Asri. 2018. Towards solving text-based

games by producing adaptive action spaces. In Pro-ceedings of the 2018 NeurIPS Workshop on Word-play: Reinforcement and Language Learning inText-based Games.

Matthew E. Taylor, Nicholas K. Jong, and Peter Stone.2008. Transferring instances for model-based rein-forcement learning. In ECML/PKDD.

Matthew E. Taylor and Peter Stone. 2009. Trans-fer learning for reinforcement learning domains: Asurvey. Journal of Machine Learning Research,10:1633–1685.

Zhuoran Wang, Tsung-Hsien Wen, Pei hao Su,and Yannis Stylianou. 2015. Learning domain-independent dialogue policies via ontology param-eterisation. In SIGDIAL Conference.

H. Yin and S. J. Pan. 2017. Knowledge transferfor deep reinforcement learning with hierarchicalexperience replay. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,AAAI’17, pages 1640–1646. AAAI Press.

Vinicius Zambaldi, David Raposo, Adam Santoro, Vic-tor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls,David Reichert, Timothy Lillicrap, Edward Lock-hart, Murray Shanahan, Victoria Langston, RazvanPascanu, Matthew Botvinick, Oriol Vinyals, and Pe-ter Battaglia. 2019. Deep reinforcement learningwith relational inductive biases. In InternationalConference on Learning Representations.

10


Relation Prediction for Unseen-Entities Using Entity-Word Graphs

Yuki Tagawa, Motoki Taniguchi, Yasuhide Miura, Tomoki Taniguchi,Tomoko Ohkuma, Takayuki Yamamoto, Keiichi Nemoto

Fuji Xerox Co., Ltd.tagawa.yuki taniguchi.motoki, yasuhide.miura, taniguchi.tomoki,

ohkuma.tomoko, yamamoto.takayuki, [email protected]

Abstract

Knowledge graphs (KGs) are generally usedfor various NLP tasks. However, as KGs stillmiss some information, it is necessary to de-velop Knowledge Graph Completion (KGC)methods. Most KGC researches do not focuson the Out-of-KGs entities (Unseen-entities),we need a method that can predict the relationfor the entity pairs containing Unseen-entitiesto automatically add new entities to the KGs.In this study, we focus on relation predictionand propose a method to learn entity represen-tations via a graph structure that uses Seen-entities, Unseen-entities and words as nodescreated from the descriptions of all entities. Inthe experiments, our method shows a signifi-cant improvement in the relation prediction forthe entity pairs containing Unseen-entities.

1 Introduction

Knowledge graphs (KGs) are a set of triplesin the form of (subject, relation, object), e.g.,(Tokyo, Capital, Japan). These are an importantresource for NLP tasks, such as entity linking(Radhakrishnan et al., 2018), question answering(Sun et al., 2018; Mohammed et al., 2018), andtext generation (Koncel-Kedziorski et al., 2019).

Although many researches utilize KGs, theseare still incomplete and miss some information.For example, in the Freebase (Bollacker et al.,2008), 71% of the person entities are missinga birthplace (Dong et al., 2014; Krompaß et al.,2015). In addition, as new information is in-creased with time, the KGs need to be updated.Therefore, to further increase the scale of theKGs, many researches have focused on Knowl-edge Graph Completion (KGC), which aims atpredicting the missing information; for example,relation prediction that predicts the relation thatholds between entities.

In recent years, embedding-based KGC meth-ods have been proposed to learn entities andrelations representations in KGs (Bordes et al.,2013; Xie et al., 2016b; Shi and Weninger,2017; Schlichtkrull et al., 2018; An et al., 2018;Nguyen et al., 2018, 2019). However, previousresearches did not focus on the Out-of-KGs(OOKG) problem: OOKG is a problem thatcan not learn entity representations that are notincluded in the training triples (Unseen-entities).Therefore, the OOKG problem needs to beaddressed to add new entities to the KGs such thatthey are automatically extended.

To cope with the OOKG problem, we buildUnseen-entity representations using the entity de-scriptions. Figure 1(a) illustrates examples of en-tity descriptions, the descriptions include distinc-tive words (e.g., game, Nintendo) that representthe entity. Entities with co-occurring distinctivewords in the descriptions are considered to be re-lated. For example, the Unseen-entity “Donkeykong” was created by the “Shigeru Miyamoto”;therefore, these are highly relevant entities.

DKRL (Xie et al., 2016a) addresses the OOKGproblem using entity descriptions. DKRL simplyencodes the Unseen-entity descriptions separatelywith Convolutional Neural Networks (CNN) toobtain its representations. However, DKRL doesnot consider the relations with other entities whileobtaining the Unseen-entity representations. Bycontrast, we can obtain better Unseen-entity rep-resentations by considering the information on re-lated entities using word co-occurrences from thedescriptions of all entities.

We address relation prediction and propose amethod using entity descriptions to address theOOKG problem. Our method creates an Entity-Word graph from the descriptions of all entities.Figure 1(b) illustrates a part of the graph cre-ated from the entity descriptions. By creating

11

SuperSmashBros.Brawl,knowninJapanasDairantōSmashBrothersX,...publishedbyNintendo fortheWiivideogame console.…Brawlisthefirstgame intheseriestoexpandpastNintendo charactersand…

DonkeyKong

Nintendo

ShigeruMiyamoto

SuperSmashBros.Brawl

Nintendo Co.,Ltd.isaJapanesemultinationalconsumerelectronicscompanyheadquarteredinKyoto,Japan.Nintendo istheworld'slargestvideogame companybyrevenue.…

DonkeyKongisaplatformgame developedin1994byNintendo…theplayertakescontrolofMario andmustrescuePaulinefromDonkeyKong.…

ShigeruMiyamotoisaJapanesevideogame designerandproducer.…MiyamotojoinedNintendo in1977,…MiyamotohascreatedincludeMario,DonkeyKong,…

(a) Entity descriptions

ShigeruMiyamoto

SuperSmashBros.Brawl

DonkeyKong

Nintendo

game

Nintendo

Donkey

Mario

Kong

(b) A part of an Entity-Word graph

Figure 1: Examples of (a) entity descriptions and (b) a part of Entity-Word graph. The green node indicates aSeen-entity and the gray node indicates an Unseen-entity. The orange node indicates a word. The Entity-Wordedge has a TF-IDF score and Word-Word edge has a PMI score.

the Entity-Word graph, even Unseen-entities canexplicitly be connected with other entities. Ourmethod encodes the graph with Graph Convolu-tional Networks (GCNs) (Kipf and Welling, 2017)to learn entity representations considering theglobal features of the entire graph. GCNs simplifythe convolutional operations on the graph, andlearn node representations based on their neigh-borhood information. GCNs are utilized for sev-eral NLP tasks (Zhang et al., 2018; De Cao et al.,2019). By encoding the Entity-Word graph withGCNs, not only the descriptions information butalso information of the related entities is propa-gated to the Unseen-entities through words. Weexpect that the entity representations learned viaour Entity-Word graph can contribute to the im-provement in the performance of the KGC.

In summary, our contributions are as follows:

• We propose a method to learn Seen- andUnseen-entity representations using entitydescriptions via GCNs. To the best of ourknowledge, our work is the first considerationof utilizing entity descriptions via a graphstructure to the KGC.

• In the experiments, our method significantlyoutperforms existing models at predicting therelation between the entity pairs containingUnseen-entities. Furthermore, our methodoutperforms existing models at predicting therelation between the pairs of Seen-entities.

2 Related work

TransE (Bordes et al., 2013) is a pioneering workon KGC. The energy function of TransE is defined

as: E(s, r, o) = |s+ r− o|, where, s, r and oform the representation of a fact triple (s, r, o). Theembedding table in TransE converts a one-hot vec-tor into a continuous vector space to learn entitiesand relations representations. TransE minimizesthe loss function L =

∑t∈S

∑t′∈S′ max(E(t) +

γ−E(t′), 0), where, γ is the margin, S representsthe fact triples in the KGs, and S′ represents theunfact triples that are not in the KGs. The un-fact triple t′ is created by replacing the subjector object entity in the fact triple t with anotherone. Some variants of TransE are also proposedin this branch (Wang et al., 2014; Ji et al., 2015;Lin et al., 2015b,a). However, these models can-not learn Unseen-entity representations as theseare not included in the training triples.

DKRL (Xie et al., 2016a) is an extension ofTransE that uses entity descriptions to cope withthe OOKG problem. While DKRL directly buildsUnseen-entity representation from its descriptionto avoid the OOKG problem, DKRL does notconsider the relation between Unseen-entities andother entities. In this study, we construct anEntity-Word graph from the descriptions of allentities. By encoding the graph using GCNs,the entities information is also propagated to theUnseen-entities through words. However, DKRLdoes not have a mechanism by which the infor-mation of the related entities is propagated to theUnseen-entities.

There is a model that addresses triple classifi-cation for triples including Unseen-entities, whichutilizes auxiliary knowledges instead of textualinformation to cope with the OOKG problem

12

(Hamaguchi et al., 2017). For example, to clas-sify the triple (Blade-Runner, is-a, Science-fiction)containing the Unseen-entity “Blade-Runner” aseither true or false, this model is provided withan auxiliary triple (Blade-Runner, based-on, Do-Androids-Dream-of-Electric-Sheep?). The au-thors deal with the special case and their problemsetting is different from ours.

3 Proposed method

Our method first creates a TF-IDF, PMI, and Self-loop graph from the descriptions of all entities andthen encodes each graph with different GCN lay-ers to learn entity representations. Finally, our de-coder predicts the relation between the entities us-ing entity representations.

3.1 Graph creationThe TF-IDF graph has edges between entity andword nodes, and the weight of the edge has a TF-IDF score. On this graph, the Unseen-entities areexplicitly connected to the other entities via wordnodes. By encoding the TF-IDF graph with GCNs,our method can learn Unseen-entity representa-tions considering the information on the relatedentities. The TF-IDF adjacency matrix Atf -idf isexpressed by the following formula:

Ai,jtf -idf =

TF-IDF(i, j) ( entity i, word j. )TF-IDF(j, i) ( entity j, word i. )0 ( otherwise. )

.

(1)

The PMI graph has edges between word nodes,and the weight of the edge has a PMI score. Wordinformation is propagated among words by encod-ing this graph with GCNs. The PMI adjacencymatrix Apmi is expressed by the following for-mula:

Ai,jpmi =

max(PMI(i, j), 0) ( word i, j. )0 ( otherwise. )

.

(2)

In addition to these graphs, we create a Self-loop graph with an edge from a node to itself. Thisconnection indicates that the node representationsare updated using their own representations whenour method encodes this graph with GCNs. TheSelf-loop adjacency matrix Aself is a diagonal ma-trix in which all diagonal entries equal to 1.

3.2 Encoder

Our encoder is expressed by the following formu-las:

hl+1tf -idf = σ(Atf -idfh

loutW

ltf -idf ), (3)

hl+1pmi = σ(Apmih

loutW

lpmi), (4)

hl+1self = σ(Aselfh

loutW

lself ), (5)

hl+1out = hl+1

tf -idf + hl+1pmi + hl+1

self , (6)

where htf -idf , hpmi and hself are the node repre-sentation matrices; hout is an output node repre-sentation matrix; Each A = D− 1

2AD− 12 is a nor-

malized adjacency matrix of each A, representinga graph, where D is the degree matrix of A; Wis a learning parameter matrix; and σ is an activa-tion function. We used two GCN layers for the en-coding. By aggregating the node representationsand stacking multiple GCN layers, it is possible tolearn the entity representations with higher-orderneighborhood information.

3.3 Decoder

A decoder is expressed by the following formula:

p = softmax(tanh(hsbjout ⊕ hobjout)Wdec + b), (7)

where p is a probability distribution for the all re-lations, hsbjout and hobjout are subject and object repre-sentations, respectively, ⊕ indicates concatenationof vector representation, Wdec is a learning param-eter matrix, and b is a bias term. We apply a crossentropy as a loss function.

4 Experiments

4.1 Datasets and evaluation metrics

In this study, we evaluated our method on apopular benchmark dataset FB15K (Bordes et al.,2013) from Freebase. Our method utilizes entitydescriptions; therefore, similar to DKRL, we re-moved 47 entities that have no descriptions, andthe triples containing these entities in FB15K. Toevaluate the relation prediction for the pair in-cluding the Unseen-entity, we also evaluated ourmethod on FB20K. Although FB20K shares thesame training and validation triples as FB15K, thesubject or object entity in the test triples is theUnseen-entity. The statistics of these datasets havebeen described in (Xie et al., 2016a).

Following previous researches on this branch,we used three metrics, namely Mean Rank (MR),

13

Pair type Methods MR MRR Hits@1

S-U [11,586]DKRL (CNN) 3 15.99 0.69 62.8Our method 8.03 0.76 67.2

U-S [18,753]DKRL (CNN) 12.72 0.71 64.8Our method 6.43 0.76 67.2

U-U [151]DKRL (CNN) 25.66 0.33 7.9Our method 35.09 0.73 68.7

TotalDKRL (CNN) 8.91 0.74 65.8Our method 7.18 0.76 67.5

Table 1: Evaluation results on the FB20K (Unseen set-ting) and Filtered setting. S-U means that the object isan Unseen-entity. U-S means that subject is an Unseen-entity. U-U means that both are Unseen-entities. Thenumber in the brackets indicates the number of sam-ples. The best scores are highlighted in bold.

Mean Reciprocal Rank (MRR), and Hits@1 forevaluation: a lower MR, higher MRR, and higherHits@1 indicate better performance. We also fol-lowed two evaluation settings, namely Raw (R)and Filtered (F) (Bordes et al., 2013).

4.2 Experimental settings

As a preprocessing for descriptions, we performedcleaning and tokenizing using (Kim, 2014) andthen removed the stop words defined by theNLTK1 and words less than 5 times in frequency.We calculated the TF-IDF and PMI values fromthe preprocessed descriptions, following which wecreated the Entity-Word graphs. The Entity-Wordgraphs contain the entities included in the train-ing and test sets. Our model learns on the giventhe Entity-Word graphs and the gold labels of thetraining entity pairs. For our model2, we selectedthe window size S as 5. In the 1st layer, we se-lected the dimension of W as 128 and ReLU asthe activation function and then applied dropoutto htf -idf , hpmi, and hself ; the ratio was 0.5. Inthe 2nd layer, we selected the dimension of W as64 and an identity function as the activation func-tion, and then applied L2 norm to htf -idf , hpmi,and hself . We selected one-hot node representa-tions as the initial node representation matrix h0outand applied Adam (Kingma and Ba, 2015) as anoptimizer with the learning rate of 0.02.

1https://www.nltk.org/2We determined parameters by performing grid search for

some candidates. We applied early-stopping on the loss onthe validation set and selected a best model for the test time.

3For comparison and analysis, we implemented DKRL.4We determined 0.5 as the weight of the DKRL(CNN)

from the validation results.

MethodsMR Hits@1

R F R FTransE (Bordes et al., 2013) 2.8 2.5 65.1 84.3TransR (Lin et al., 2015b) 2.49 2.09 70.2 91.6TKRL (RHE) (Xie et al., 2016b) 2.12 1.73 71.1 92.8PTransE (ADD, 2step) (Lin et al., 2015a) 1.7 1.2 69.5 93.6DKRL (CNN)+TransE4 2.15 1.80 70.8 90.5ProjE (wlistwise) (Shi and Weninger, 2017) 1.5 1.2 75.5 95.6CACL (Oh et al., 2018) 1.52 1.19 77.6 96.4Our method 1.51 1.13 72.6 95.0

Table 2: Evaluation results on the FB15K (Seen set-ting). The scores other than DKRL(CNN)+TransE aredirectly taken from (Oh et al., 2018) and the each orig-inal paper.

Entity1 : Bleach The DiamondDust RebellionBleach: The DiamondDust Rebellion is the second animatedfilm adaptation of the anime and manga series Bleach. ...Entity2 : Cameron Arthur ClarkeCameron Arthur “Cam” Clarke is a prolific American voiceactor and singer, ...Entity3 : Porco RossoPorco Rosso is a 1992 Japanese animated adventure film writ-ten and directed by Hayao Miyazaki. It is based on HikoteiJidai, a three-part watercolor manga by Miyazaki. ...

Table 3: Entities in the FB15K and their descriptions,and the relation “dubbing performance/actor” holdsbetween Entity1 and 2.

4.3 Evaluation results and discussionIn Table 1, our method shows a significant im-provement on FB20K. Our method treats theUnseen-entities differently from DKRL. Entity-Word graphs are created and encoded in ourmethod; therefore, our method can learn the rep-resentation of both the Seen- and Unseen-entities.Moreover, the information of the related entitiesis propagated to the Unseen-entities to assist thelearning of Unseen-entities in our method. On theother hand, since DKRL only uses the descriptionsinformation to obtain the Unseen-entity represen-tations, DKRL does not consider the related enti-ties information. Therefore, we concluded that ourmethod outperforms the DKRL on the FB20K.

In Table 25, our method shows a better MR thanthe previous models. Table 3 shows the entitiesin the FB15K and their descriptions, and the rela-tion “dubbing performance/actor” holds betweenEntity1 and 2. Our method correctly predicted thisrelation, whereas our implemented DKRL mispre-dicted this relation as “performance/actor”. Toaccurately predict this relation, the KGC modelneeds to recognize the attribute that Entity1 is

5In many previous models, the MRR was not reported;therefore, we compare our method with the others based onthe MR and Hits@1.

14

Japanese works and Entity2 is an American voiceactor. However, the description of Entity1 does notdirectly indicate that Entity1 can be attributed toJapanese works. Therefore, the KGC model needsto recognize this from a few features such as theJapanese-English words “manga” and “anime”.The word “manga” also appears in the descriptionof Entity3, which is a Japanese movie, and this de-scription also contains words specific to Japaneseworks such as “Japanese animated” and “HayaoMiyazaki” who is famous Japanese animator. Ourmethod makes use of the graph structure, Entity1is located near the Japanese works entities such asEntity3 on the graph, which propagates this infor-mation to Entity1. Therefore, our method recog-nizes that Entity1 is Japanese works, and can cor-rectly predict the relation.

5 Conclusion

In this study, we proposed a method to learn en-tity representations using entity descriptions viagraph structure. In the experiments, the perfor-mance of the proposed model showed a signifi-cant improvement on the FB20K; furthermore, itoutperforms the previous models on the FB15K.However, although the word order information(e.g., phrase) is an important clue for the re-lation prediction, our model disregards it whencreating the Entity-Word graph. Thus, in fu-ture research, we plan to integrade our encoderwith LSTM (Hochreiter and Schmidhuber, 1997)which can capture the word order information.

ReferencesBo An, Bo Chen, Xianpei Han, and Le Sun. 2018. Ac-

curate text-enhanced knowledge graph representa-tion learning. In Proceedings of NAACL.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a col-laboratively created graph database for structuringhuman knowledge. In Proceedings of SIGMOD.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Proceedings of NIPS.

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.Question answering by reasoning across documentswith graph convolutional networks. In Proceedingsof NAACL.

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, WilkoHorn, Ni Lao, Kevin Murphy, Thomas Strohmann,

Shaohua Sun, and Wei Zhang. 2014. Knowledgevault: A web-scale approach to probabilistic knowl-edge fusion. In Proceedings of SIGKDD.

Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo,and Yuji Matsumoto. 2017. Knowledge transfer forout-of-knowledge-base entities: A graph neural net-work approach. In Proceedings of IJCAI.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation.

Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and JunZhao. 2015. Knowledge graph embedding via dy-namic mapping matrix. In Proceedings of ACL.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of EMNLP.

Diederick P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof ICLR.

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In Proceedings of ICLR.

Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan,Mirella Lapata, and Hannaneh Hajishirzi. 2019.Text Generation from Knowledge Graphs withGraph Transformers. In Proceedings of NAACL.

Denis Krompaß, Stephan Baier, and Volker Tresp.2015. Type-constrained representation learning inknowledge graphs. In Proceedings of ISWC.

Yankai Lin, Zhiyuan Liu, Huan-Bo Luan, MaosongSun, Siwei Rao, and Song Liu. 2015a. Modelingrelation paths for representation learning of knowl-edge bases. In Proceedings of EMNLP.

Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. 2015b. Learning entity and relation em-beddings for knowledge graph completion. In Pro-ceedings of AAAI.

Salman Mohammed, Peng Shi, and Jimmy Lin. 2018.Strong baselines for simple question answering overknowledge graphs with and without neural net-works. In Proceedings of NAACL.

Dai Quoc Nguyen, Tu Dinh Nguyen, Dat QuocNguyen, and Dinh Phung. 2018. A novel embed-ding model for knowledge base completion basedon convolutional neural network. In Proceedings ofNAACL.

Dai Quoc Nguyen, Thanh Vu, Tu Dinh Nguyen,Dat Quoc Nguyen, and Dinh Phung. 2019. A cap-sule network-based embedding model for knowl-edge graph completion and search personalization.In Proceedings of NAACL.

Byungkook Oh, Seungmin Seo, and Kyong-Ho Lee.2018. Knowledge graph completion by context-aware convolutional learning with multi-hop neigh-borhoods. In Proceedings of CIKM.

15

Priya Radhakrishnan, Partha Talukdar, and VasudevaVarma. 2018. Elden: Improved entity linking us-ing densified knowledge graphs. In Proceedings ofNAACL.

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne Van Den Berg, Ivan Titov, and Max Welling.2018. Modeling relational data with graph convolu-tional networks. In Proceedings of ESWC.

Baoxu Shi and Tim Weninger. 2017. Proje: Embed-ding projection for knowledge graph completion. InProceedings of AAAI.

Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan Salakhutdinov, and William Co-hen. 2018. Open domain question answering usingearly fusion of knowledge bases and text. In Pro-ceedings of EMNLP.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In Proceedings of AAAI.

Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, andMaosong Sun. 2016a. Representation learning ofknowledge graphs with entity descriptions. In Pro-ceedings of AAAI.

Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016b.Representation learning of knowledge graphs withhierarchical types. In Proceedings of IJCAI.

Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In Proceedingsof EMNLP.

16


Scalable graph-based method for individual named entity identification

Sammy Khalife Michalis VazirgiannisLIX, CNRS, Ecole Polytechnique

Institut Polytechnique de Paris, 91128 Palaiseau, [email protected]@lix.polytechnique.fr

Abstract

In this paper, we consider the named entitylinking (NEL) problem. We assume a set ofqueries, named entities, that have to be iden-tified within a knowledge base. This know-ledge base is represented by a text databasepaired with a semantic graph, endowed witha classification of entities (ontology). Wepresent state-of-the-art methods in NEL, andpropose a new method for individual identi-fication requiring few annotated data samples.We demonstrate its scalability and perform-ance over standard datasets, for several on-tology configurations. Our approach is well-motivated for integration in real systems. In-deed, recent deep learning methods, despitetheir capacity to improve experimental preci-sion, require lots of parameter tuning alongwith large volume of annotated data.

1 Introduction

1.1 Basic concepts and definitionsThe purpose of Named entity discovery (NED) ininformation retrieval is two-fold. First, it aims atextracting pre-defined sets of words from text doc-uments: this corresponds to Named entity recog-nition (NER). These words are representations ofnamed entities (such as names, places, locations,...). Then, these entity mentions paired with theircontext are seen as queries to be identified withindatabase: this corresponds to named entity linking(NEL). NEL is also refered as named entity disam-biguation. The interest in NEL has grown recentlyin several fields: in bioinformatics, to obtain loc-ations of viral sequences from databases (Weis-senbacher et al., 2015), or to process biomedicallitterature (Zheng et al., 2015). It also revealedto be useful in recruitment in order to identifyemployer names in a database (Liu et al., 2018).Firstly, it is important to stress that the subtaskof NED, Named entity recognition (NER), is not

trivial since we do not have an exhaustive list ofthe possible spelling of named entities. Moreovertheir text representation can change (for example,“J. Kennedy" vs. “John Kennedy"). In this paperwe focus on the second task, Named entity linking(NEL).

Named entity (and Mention/Query): An entityis a real-world object and usually has a physicalexistence. It is denoted with a proper name. In theexpression “Named Entity", the word “Named"aims to restrict the possible set of entities to onlythose for which one or many rigid designatorsstands for the referent (Nadeau and Sekine, 2007).When a named entity appears in a document, itssurface form can also be refered as a mention. Fi-nally, a query refers to the mention, the contextwhere it appears, and a type of entity considered.

Ontology: In this paper, our definition of an on-tology is represented as a tree of entity types. Inthe following, the variable T represents the totalnumber of nodes of this tree minus one (we don’tcount the root node since it is uninformative). Ori-ginally, entities had a very limited number of types(Nadeau and Sekine, 2007), such as person (PER),organization (ORG), and localization (GPE) (i.eT = 3). These types play a central role fornamed entity recognition and identification. Anexample of ontology is in Fig. 1. More recently,due to the increase in the volume of the web se-mantics data, fine-grained classifications are avail-able, with hundreds of entity types, similarly toDBPedia ontology1 (Lehmann et al., 2015).

Knowledge base/graph: A Knowledge base is adatabase providing supplementary descriptive andsemantic information about entities. The semanticinformation is contained in a knowledge graph,where a node represents an entity, and an edge rep-resents a semantic relation. The knowledge graph

1http://wiki.dbpedia.org/services-resources/ontology

17

Entity

PER

PoliticianORG

Political Party Association

GPE

City

Figure 1: Example of ontology, T = 7

E1 - Politician - John F. KennedyJohn F. Kennedy is served as

the 35th President of the U.S.A

E2 - Political Party - Demo-cratic Party (United States)

The Democratic Party is a major con-temporary political party in the U.S.A

E3 - City - WashingtonWashington is the capital of the U.S.A

Figure 2: Representation of a unweighted directed se-mantic graph (Wikipedia/NIST TAC-KBP Challenge2010). An edge between two entities E1 and E2 rep-resents a url link from E1 web page to E2 web page.

can be of any kind (directed, weighted, ...). SeeFig. 2 for an example.

Named entity linking (NEL): Given a namedentity query, the purpose of named entity linkingis to identify the corresponding ground truth entity(gold entity) in a database (knowledge base). Fora detailed description of a concrete competition inentity linking, we refer to (Ji et al., 2014).

Individual & collective linking: Linking can bedone individually or collectively. In the first case,queries are independent. In the collective frame-work, we consider a set of queries that usually ori-ginates from the same document, and for whichgold entities (i.e ground truth entities) should havesome proximity, or coherence. In this work, wepropose individual linking approach.

1.2 Contributions

In this work, we provide a brief survey of exist-ing methods for named entity linking. Then, weinvestigate a method for individual named entitylinking. The first step of this method, refered asentity filtering, reduces entity candidates to top Kentities for one query. The second step, referedas entity identification, aims at identifying the trueentity among the remaining K candidates, based

on a new graph-based algorithm. We include anexperimental evaluation of our method with sev-eral datasets, with an analysis of the impact ofparameter K, the ontology parameter T , and a de-tailed comparison with existing approaches. Theimplementation used for experiments is availableat our repository2. We do not include in this pa-per work on Fine-grained named entity recogni-tion (Ling and Weld, 2012). Moreover, we do notinclude NIL-detection problem (detect if a query isreferring to an entity that is not in the knowledgebase, for instance (Ji et al., 2014)).

2 Related work

In the following subsections, we present threefamilies of algorithms for named entity linking.Notations: E = 1, ..., E ⊂ N: indexes of entit-ies and Q = 1, ..., Q ⊂ N: indexes of queries,ei: system’s output entity index for query index qi.

2.1 Graphs for NEDFormulation: Given a scoring function definedbetween queries and entities, let Wi,j the corres-ponding score between the query i and the entityj. For individual disambiguation, one wants toperform independent query-entity attribution. Astraightforward formulation is:

ei = argmaxj∈E

Wi,j (1)

In this case, the total cost is separable in the vari-able i, but the score Wi,j can use the knowledgegraph structure: this is the case in our approach.For the sake of completeness, we give a descrip-tion of the collective linking formulation. In thisframework the optimization formulation is differ-ent: the underlying gold entities should respectsome arbitrary semantic coherence. The coher-ence information is represented within a coher-ence function ψ : EQ → R between the en-tity candidates. Usually ψ is defined from theknowledge graph structure. For example ψ canbe defined using the opposite sign of the shortest-path function on the knowledge graph. With thesenotations, the set of selected entities are formallydefined as:

e1, ..., eQ = argmaxj1,..,jq∈EQ

[(

Q∑

l=1

Wl,jl)+ψ(j1, ..., jQ)]

(2)2If needed, please send your request at [email protected].

18

Eq. (2) can be formulated as a boolean integerprogram. Its NP-hardness (Cucerzan, 2007) doesnot allow to solve the general case for an importantnumber of queries. (Ratinov et al., 2011) evalu-ated local and global approaches to find approxim-ate solutions of an approximation of Eq. (2) withEq. (3), given a new coherence function ψ, and foreach query ql a disambiguation context of entitiesCl:

e1, ..., eQ = argmaxj1,..,jq∈EQ

[(

Q∑

l=1

Wl,jl+∑

k∈Cl

ψ(jl, jk))]

(3)The formulation with Eq. (3) is halfway

between individual and collective linking: it sug-gests to select a convenient set of disambiguationcontexts, and then solving locally for each query.In the same time, it still enforces some coherenceamong the predited entities. Collective linkingalso has other formulations: (Han et al., 2011)proposed a collective formulation for entity link-ing decisions, in which evidence can be reinforcedinto high-probability decisions.

For individual graph based linking, a rule basedon the importance of the entity node in the know-ledge graph rule has been studied experimentally(Guo et al., 2011).

a

b

e1

e2

e3

wa,e1

...

...

...we1,e2

we2,e3

...wb,e3

Figure 3: Directed query/entity bipartite weightedgraph. Nodes e1, e2, e3 are entities in the knowledgebase (same as Fig. 2). Nodes a and b are entity queriesextracted from text documents.

Dense subgraphs, PageRank: Other graph-based approaches have been developed. (Hof-fart et al., 2011), and (Alhelbawy and Gaizauskas,2014) proposed to link efficiently a query to itscorresponding entity using the weighted undirec-ted bipartite graph (Fig. 3). The idea is to extracta dense subgraph in which every query node isconnected to exactly one entity, yielding the mostlikely disambiguation. In general, this combin-

atorial optimization problem is NP-hard with re-spect to the number of nodes, since they gener-alize Steiner-tree problem (Hoffart et al., 2011).However heuristics to solve this problem havebeen experimented: (Hoffart et al., 2011) and (Al-helbawy and Gaizauskas, 2014) proposed a dis-carding algorithm using taboo search and localsimilarities with polynomial complexity. Adapt-ations of PageRank algorithm were carried out toprovide each entity a popularity score: (Usbecket al., 2014) built a weighted graph of all queriesand entities based on local and global similarit-ies, and capitalize on the Hyperlink-Induced TopicSearch (HITS) algorithm to produce node author-ity scores. Then, within similar entities to queries,only entities with high authority will be retained.

2.2 Probabilistic graphical modelsAnother interesting idea is to consider namedentity queries as random variables and theirgolden/true entities as hidden states. Unlike char-acter recognition where |E| = |Ei| = 26 for latinalphabet, the number of possible states S per en-tity is large (usually S ≥ 106). Since Viterbialgorithm has a O(N |S|2) complexity, where Nis the number of observations, inference is ineffi-cient. To overcome this issue, (Alhelbawy and Ga-izauskas, 2013) considers a reduced set of candid-ates per query: ei ∈ Ei using query text informa-tion. Using annotation, an Hidden Markov Model(HMM) is trained on the reduced set of candid-ates. Inference is made using message passing (Vi-terbi algorithm) to find the most probable namedentity sequence. Another approach using prob-abilistic graphical model has been provided by(Ganea et al., 2016), with a factor graph that usespopularity-based prior.

2.3 Embeddings and deep architecturesRecent advances in neural networks conceptionsuggested to use word embeddings and convolu-tional neural networks to solve the named entitylinking problem. (Sun et al., 2015) proposed tomaximize a corrupted cosine similarity between aquery, its annotated gold entity and a false entity.An example of learning representations for entitiesusing a neural architecture is achieved in (Yamadaet al., 2017), a linking system based on the simil-arity of average of pre-trained entity embeddingshas been proposed (Yamada et al., 2016), with aO(QE2) complexity. Finally other architectureshave been proposed (Sil et al., 2018; Raiman and

19

Raiman, 2018), the latter using a fine-grained on-tology type system and reaching promising resultson several datasets.

3 Methodology

In this section we present a novel graph-basedmethod for NEL. As a preprocessing step, we pro-pose a new but simple entity filtering method usinginformation retrieval techniques to obtain a lim-ited number of entity candidates. The novelty ofour method lies in the second subsection where wepresent a new graph-based method for final entityidentification. Source code is available at our re-pository2.

3.1 Entity filtering

To discard wrong entity candidates, we use thethree sources of information in the query q =(m, c, t): the mention name, the information con-tained in the rest of document, and the entity type.Obviously, the richer is the ontology (i.e the largeris T ), the easier the NEL problem (but harder isthe NER problem). In order to improve existingentity filtering algorithms, we propose a routinebased on three main components below. The al-gorithm is summarized in algorithm 1).

a - preProcess: For trivial queries having amention name equal to an existing entity nameand type, we implemented a naive match pre-processing. If a mention has the same name andthe same type, its gold entity is labelled as the cor-responding entity.

b - acronymDetection & acronymScore: Ac-ronym detection and expansion is a common topicin bioinformatics. We refer to (Ehrmann et al.,2013) as a survey of acronym detection methods.We implemented a simple rule-based decision foracronym detection, following (Gusfield, 1997): astring is tagged as an acronym if there are two ormore capital letters, and that consecutive distancebetween two capital letters is always one. The sim-ilarity score for acronym extension is chosen as thelength of longest common substring (Apostolicoand Guerra, 1987) between the acronym and cap-ital letters of the target.

c - JN & contextScore: When the named en-tity mention is not tagged as an acronym, compar-ison with entity titles is performed by computingN-grams for N ∈ 2, 3, 4, and use Jaccard In-dex of mention name and entity title. It is referedas JN in algorithm 1. We also mesure similarity

between the context of the query and the text de-scription of an entity in the knowledge base. Weexperimented several techniques: TF-IDF, BM25 ,BM25+ based on the probabilistic retrieval frame-work developed in the 1970s and 1980s (we referto (Robertson et al., 2009) for a recent descrip-tion). The experimental results were very similarin term of recall. We present results obtained withTFIDF (cf. Sec. 4).

Algorithm 1 Entity filtering (generation of entitycandidates)Require: Parameter K, Query (q = (m, c, t)), Entities

(ej , tj)1≤j≤E

1: preProcess(q, (ej , tj)1≤j≤E)2: ds = [ ]3: yacr ← acronymDetection(m)4: for j = 1→ E do5: if tj == t then6: if yacr == 1 then7: sn = acronymScore(m, ej)8: else9: sn = JN(m, ej)

10: end if11: st =

12(sn + contextScore(c, ej))

12: Sorted insertion by value of j : st in ds13: end if14: end for15: return ds[: K] (K top entities )

3.2 Graph-based identification

In this section, we present our graph-based methodfor named entity identification. This graph-basedmethod uses enriched features extraction from theknowledge graph, in order to re-rank top entitycandidates.

Feature extraction: Let q and e respectively aquery and an entity. T still represents the numberof distinct entity types in the ontology. Let s bea scoring function between a query and an entity.Let Nt(e) the set of entity neighbors of type t (cf.Fig 4 for an example). By convention if Nt(e) =∅, then s(q,Nt(e)) , 0. f(q, e) is the filteringscore obtained with algorithm 1. We define thefeatures vector associated with the couple (q, e),Xq,e as the scores concatenation:

(Xq,e)0 = f(q, e)

∀t ∈ 1, ...,T , (Xq,e)t = s(q,Nt(e))(4)

The label of a couple (q, e) is defined as:

Y q,e =

1 if e is the gold entity of q0 otherwise

(5)

20

Cambridge 1

Cambridgeshire

Fitzwilliam Museum

England

Wilf Mannion

Cambridge 2

Massachussets

United States

MIT Museum

Figure 4: Two homonyms: Cambridge cities. Eachcolor is assigned to a node in the ontology. If t1 is as-sociated to the entity type Country, and t2 to Footballplayer, thenNt1(Cambridge 1) = EnglandNt2(Cambridge 1) = Wilf MannionNt1(Cambridge 2) = United StatesNt2(Cambridge 2) = ∅

Supervised NEL: With this formulation, wecan train NEL standard regressors or classifiers ina supervised learning framework. At inference,the couple (q, e) maximizing the prediction scoreyields predicted entity e. If same scores are re-turned for different couples, we return the firstcandidate. (This situation didn’t occur in prac-tice). The feature extraction procedure and infer-ence are summed up in algorithm 2 and algorithm3 respectively.

Algorithm 2 Feature extraction using knowledgegraph and ontologyRequire: Knowledge Graph G, Query q, Entity candidate e

with initial filtering score s0, Types (tj)1≤j≤T

1: Xq,e = [s0]2: Get neighbor nodes of e3: for j = 1 to T do4: Aggregate text description of neighbors of type tj5: Compute score stj betweenNtj (e) and the query q6: Append stj to Xq,e

7: end for8: return Score vectors (Xq,e)1≤j≤T+1

Algorithm 3 Named entity identification (Infer-ence)Require: Knowledge base B and its graph GB , queries

(qi)1≤i≤M , scoring threshold K, trained predictor F1: for i = 1 to M do2: Use filtering on query qi and B, return a list of K

top ranked entities (e1h)1≤h≤K

3: Use algorithm 2 using GB , on K entity candidates,return new score vectors

4: Evaluate F on each vector score and use maximum aposteriori to infer estimated gold entity gi

5: end for6: return (gi)1≤i≤M (list of estimated gold entities)

Graph-based scoring functions: In the identi-

fication step, features defined from Eq. (4) requirethe choice of a scoring function. First of all, sev-eral representations for q and e are possible. Inour first experiment, we used the standard TFIDFrepresentation for the supervised learning proced-ure described previously, and the correspondingscoring function with cosine similarity. This al-lowed to increase slightly empirical accuracy overentity filtering.

In order to explore a broader class of scoringfunctions, let us introduce graph of words (GoW)representations. GoW is a representation builtover a sequence of objects in order to capturesequential relationships. Given a window size,nodes are added to the graph by their string rep-resentation, and edges are added between nodesin the same sliding window. This representa-tion has proven its efficiency for several inform-ation retrieval problems (Rousseau and Vazirgian-nis, 2013).

Indeed, bag of words representations can beconsidered as a special case of graph of wordsrepresentations, for which edge deleting opera-tions have been applied. Here, we consider thatthe query context and the entity description areboth composed of at least 10 words for GoW tobe meaningful. The final step to define a scor-ing function as in Eq. (4), is to compare the twograph structures (one from the query context andthe other from the entity description).

Given two graphs G and H, determining if G isisomorphic H allows to measure graph similarit-ies (Cordella et al., 2004). However, for severalapplications, including the topic of this paper, iso-morphic conditions are too rigid since two docu-ments can be similar without isomorphic GoWs.Also, we are interested in graph similarity meas-ures taking in account structure (word relations)and node attributes (words). For this reason, graphkernels have been popularized as a powerful toolto measure graph similarity in a continuous fash-ion.

Following the notations of Sec. 3.2, and k agraph kernel, we considered the family of scor-ing functions: (q, e) 7→ k(GoWq, GoWNt(e)) inour experiments. If Nt(e) contains more than onenode, we concatenate their text content and com-pute a GoW. As mentioned previously, this familyof functions countains some of the bag-of-wordsscoring functions, such as TFIDF. We obtainedbetter empirical results using standard graph ker-

21

nels (cf. next paragraph for examples). It shouldbe noted that we could not use graph kernels forthe first step (entity filtering), since the computa-tion time would be too long. On the contrary, theidentification step takes as input a limited amountof entity candidates, which makes the computationtime reasonable.

Graph-of-words window, graph kernels & re-gressors: We selected as a graph-of-word win-dow w = 4 (same results were obtained for w ∈3, 4, 5, 6), with different graph kernels, includ-ing Shortest-path kernel, Weisfeiler-Lehman Ker-nel. The accuracy results for each graph kernelwere very close, but higher than with TFIDF scor-ing (1% to 2% better). In Sec. 4, we report resultsfor the pyramid match graph kernel, for its lowcomplexity among standard kernels (Nikolentzoset al., 2017). Finally, we used several standardclassifiers: regression trees, support vector ma-chines, and logistic regression. We obtained bet-ter results with logistic regression (reported inTable 1).

Computational complexity: The total complex-ity (filtering and identification) is: O(M(E +KTG)). We report this in Table 2, along withsome experimental computing times.

4 Experimental setup and evaluation

The source code of our experiments along withdocumentation, and datasets samples are availableat our repository2.

4.1 Datasets, entity types, and ontology:

We used CONLL and NIST TAC-KBP 2009-2010as datasets. Each query contains its gold entityid and type. TAC-KBP: the corresponding know-ledge base is composed of 818741 entities. TAC09contains 1675 test queries, and TAC10 1074 fortrain and 1020 for test. CONLL/AIDA is com-posed of 22516 queries for training and 4379 quer-ies for test.

The other methods, mainly deep learning (DL)in Table 1 use millions of training examples fromWikipedia’s anchor links and corresponding entit-ies. In our method, we did not use this additionaltraining data, but only those provided by the ori-ginal challenges.

Also, we considered a more recent Knowledgebase (Wikipedia 2016 dump with 2880838 entit-ies) since the original Wikipedia 2010 dump is notavailable anymore. The ontology we considered

is available on DBPedia1. We provide the scriptthat builds the complete knowledge base and on-tology in our repository. To generate fine-grainedontology knowledge bases, we describe the pro-cedure (along with the code) in our repository.We must remind that our method does not includefined-grained entity recognition from the queries:we suppose this given as input in the data. Forthe implementation of graph kernels, we used theGraKeL software library (Siglidis et al., 2018).

4.2 Results:

We compare our methods with most performingbaselines. Table 1 sums up our experimental res-ults (averaged P@1 is also referred as accuracy(Sun et al., 2015)). We included standard devi-ation of the accuracy, but could not include p-significance of our method, due to the difficulty toreproduce other baselines experiments (no sourcecode is publicly available, or filtering methodis not detailed). Our method yields remarkableaccuracy on TAC09 dataset, CONLL/AIDA andTAC10 datasets. It performs better than any ex-isting graph-based methods, outperforms all ex-isting methods on two NIST TAC09 and TAC10,and is competitive with state-of-the arts methodson CONLL/AIDA. We also report impact of para-meter K on average precision P@1 (accuracy).Results are in Fig. 5. Low values of K, corres-ponding to limited exploration, leading to decreas-ing accuracy. High values of K yield too many en-tity candidates and an imbalanced learning prob-lem, resulting in a decrease of accuracy. Resultsare similar for 5 ≤K ≤ 10, and allow a 2% to 3%improvement over filtering. As expected, the pre-cision is strictly increasing with respect to T butthe variation is bounded by 5% for T ∈ [3, 249].

Table 2: Computing times rounded to the minute. Q =1000, E = 2.8× 106, G ≤ 200, K = 7, T = 249. Setup 1:Single CPU with 32Gb Ram, 4-cores 2.40GHz. Setup 2: Dis-tributed cluster with variety of 20 CPU processors equivalentto setup 1 (Spark/Hadoop technology)

Component ComplexityTime (mn.)

Setup 1 Setup 2Filtering O(ME) 196 15Identification O(QKTG) 153 10

5 Conclusion

In this paper, we proposed a new methodologyconcerning the problem of named entity linking.

22

Table 1: Comparison with state-of-the art methods for K = 7 and T = 249. PGMs stands for probabilisticgraphical model.

Method Nil detection Train. sizeP@1 (Accuracy) ± std %

TAC09 TAC10 AIDA

(Ganea et al., 2016) PGM No ∼ 106 / / 87.39

(Ganea and Hofmann, 2017) PGM/DL No ∼ 106 / / 92.22

(Sun et al., 2015) DL No ∼ 106 82.26 83.92 /

(Yamada et al., 2016) DL No ∼ 106 / 85.2 93.1

(Yamada et al., 2017) DL No ∼ 106 / 87.7 94.3

(Globerson et al., 2016) DL Yes ∼ 106 / 87.2 92.7

(Sil et al., 2018) DL Not detailed ∼ 106 / 87.4 93.0

(Raiman and Raiman, 2018) DL Not detailed ∼ 106 / 90.85 94.87

(Guo et al., 2011) Graphs Yes ∼ 104 84.89 82.40 /

(Hoffart et al., 2011) Graphs No ∼ 104 / / 81.91

Our method Graphs No ∼ 103, 104 93.67±0.06 94.70±0.05 93.56±0.06

1 5 10 15 2090

92

94

96

98

100

K

Ave

rage

accu

racy

(%)

P@1 = f(K), T=249

TAC09TAC10

CONLL AIDA

3 19 76 171 250

88

90

92

94

96

98

100

T

Ave

rage

accu

racy

(%)

P@1 = g(T), K=7

TAC09TAC10

CONLL AIDA

Figure 5: Impact of K and T on average P@1.

First, we presented an entity filtering algorithm toreturn entity candidates that improves over trivialassociation rules. Then, each entity candidate ismatched with a new representation built on a sub-graph centered on their node. These representa-tions use information contained in the ontologyof the knowledge base. Finally, we used stand-ard supervised learning to identify entities in thetop candidates from filtering. We showed experi-mentally with standard datasets that named entitylinking systematically improves over filtering us-ing graph-based identification (for 2 ≤ K ≤ 10),up to 3%. Our experiments show that our methodis competitive with state-of-the-art, and is stablewith respect to K and T , has a linear complex-ity and reasonable experimental computing time.Our linking system is relatively easy to implement,with few hyper-parameters. Last but not least, itdoes not require lots of data compared with deeplearning to reach good experimental performance:only a few thousands of training samples wereused to reach these results.

References

Ayman Alhelbawy and Robert Gaizauskas. 2013.Named entity disambiguation using hmms. In 2013IEEE/WIC/ACM International Joint Conferences onWeb Intelligence (WI) and Intelligent Agent Techno-logies (IAT), volume 3, pages 159–162. IEEE.

Ayman Alhelbawy and Robert Gaizauskas. 2014.Graph ranking for collective named entity disambig-uation. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), pages 75–80.

23

A. Apostolico and C. Guerra. 1987. The longest com-mon subsequence problem revisited. Algorithmica,2(1):315–336.

Luigi P Cordella, Pasquale Foggia, Carlo Sansone, andMario Vento. 2004. A (sub) graph isomorphism al-gorithm for matching large graphs. IEEE transac-tions on pattern analysis and machine intelligence,26(10):1367–1372.

Silviu Cucerzan. 2007. Large-scale named entity dis-ambiguation based on wikipedia data. In Proceed-ings of the 2007 EMNLP-CoNLL, pages 708–716.

Maud Ehrmann, Leonida Della Rocca, Ralf Steinber-ger, and Hristo Tanev. 2013. Acronym recogni-tion and processing in 22 languages. arXiv preprintarXiv:1309.6185.

Octavian-Eugen Ganea, Marina Ganea, Aurelien Luc-chi, Carsten Eickhoff, and Thomas Hofmann. 2016.Probabilistic bag-of-hyperlinks model for entitylinking. In Proceedings of the 25th InternationalConference on World Wide Web, pages 927–938. In-ternational World Wide Web Conferences SteeringCommittee.

Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2619–2629.

Amir Globerson, Nevena Lazic, Soumen Chakra-barti, Amarnag Subramanya, Michael Ringaard, andFernando Pereira. 2016. Collective entity resolutionwith multi-focal attention. In Proceedings of the54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),volume 1, pages 621–631.

Yuhang Guo, Wanxiang Che, Ting Liu, and Sheng Li.2011. A graph-based method for entity linking. InProceedings of 5th International Joint Conferenceon Natural Language Processing, pages 1010–1018.

Dan Gusfield. 1997. Algorithms on strings, trees andsequences: computer science and computationalbiology. Cambridge university press.

Xianpei Han, Le Sun, and Jun Zhao. 2011. Collectiveentity linking in web text: a graph-based method. InProceedings of the 34th international ACM SIGIRconference on Research and development in Inform-ation Retrieval, pages 765–774. ACM.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Fürstenau, Manfred Pinkal, Marc Spa-niol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 782–792. Association for Computa-tional Linguistics.

Heng Ji, Joel Nothman, Ben Hachey, et al. 2014. Over-view of tac-kbp2014 entity discovery and linkingtasks. In Proc. Text Analysis Conference (TAC2014),pages 1333–1339.

Jens Lehmann, Robert Isele, Max Jakob, AnjaJentzsch, Dimitris Kontokostas, Pablo N Mendes,Sebastian Hellmann, Mohamed Morsey, PatrickVan Kleef, Sören Auer, et al. 2015. Dbpedia–alarge-scale, multilingual knowledge base extractedfrom wikipedia. Semantic Web, 6(2):167–195.

Xiao Ling and Daniel S Weld. 2012. Fine-grained en-tity recognition. In Twenty-Sixth AAAI Conferenceon Artificial Intelligence.

Qiaoling Liu, Josh Chao, Thomas Mahoney, AlanChern, Chris Min, Faizan Javed, and ValentinJijkoun. 2018. Lessons learned from developing anddeploying a large-scale employer name normaliza-tion system for online recruitment. In Proceedingsof the 24th ACM SIGKDD, pages 556–565. ACM.

David Nadeau and Satoshi Sekine. 2007. A survey ofnamed entity recognition and classification. Lingv-isticae Investigationes, 30(1):3–26.

Giannis Nikolentzos, Polykarpos Meladianos, andMichalis Vazirgiannis. 2017. Matching Node Em-beddings for Graph Similarity. In Proceedings ofthe 31st AAAI Conference on Artificial Intelligence,pages 2429–2435.

Jonathan Raphael Raiman and Olivier Michel Rai-man. 2018. Deeptype: multilingual entity linkingby neural type system evolution. In Thirty-SecondAAAI Conference on Artificial Intelligence.

Lev Ratinov, Dan Roth, Doug Downey, and MikeAnderson. 2011. Local and global algorithmsfor disambiguation to wikipedia. In Proceedingsof the 49th Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 1375–1384. Associ-ation for Computational Linguistics.

Stephen Robertson, Hugo Zaragoza, et al. 2009. Theprobabilistic relevance framework: Bm25 and bey-ond. Foundations and Trends R© in Information Re-trieval, 3(4):333–389.

François Rousseau and Michalis Vazirgiannis. 2013.Graph-of-word and tw-idf: New approach to ad hocir. In Proceedings of the 22Nd ACM InternationalConference on Information & Knowledge Manage-ment, CIKM ’13, pages 59–68, New York, NY,USA. ACM.

Giannis Siglidis, Giannis Nikolentzos, Stratis Lim-nios, Christos Giatsidis, Konstantinos Skianis,and Michalis Vazirgiannis. 2018. Grakel: Agraph kernel library in python. arXiv preprintarXiv:1806.02193.

24

Avirup Sil, Gourab Kundu, Radu Florian, and WaelHamza. 2018. Neural cross-lingual entity linking.In Thirty-Second AAAI Conference on Artificial In-telligence.

Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, ZhenzhouJi, and Xiaolong Wang. 2015. Modeling mention,context and entity with neural networks for entitydisambiguation. In IJCAI, pages 1333–1339.

Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Mi-chael Röder, Daniel Gerber, Sandro Athaide Coelho,Sören Auer, and Andreas Both. 2014. Agdistis-graph-based disambiguation of named entities usinglinked data. In International semantic web confer-ence, pages 457–471. Springer.

Davy Weissenbacher, Tasnia Tahsin, Rachel Beard,Mari Figaro, Robert Rivera, Matthew Scotch, andGraciela Gonzalez. 2015. Knowledge-driven geo-spatial location resolution for phylogeographic mod-els of virus migration. Bioinformatics, 31(12):i348–i356.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. CoNLL 2016, page 250.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2017. Learning distributed rep-resentations of texts and entities from knowledgebase. Transactions of the Association for Compu-tational Linguistics, 5:397–411.

Jin G Zheng, Daniel Howsmon, Boliang Zhang, Juer-gen Hahn, Deborah McGuinness, James Hendler,and Heng Ji. 2015. Entity linking for biomedicalliterature. BMC medical informatics and decisionmaking, 15(1):S4.

25


Neural Speech Translation using Lattice Transformations and GraphNetworks

Daniel Beck† Trevor Cohn† Gholamreza Haffari‡†School of Computing and Information Systems

University of Melbourne, Australiad.beck,[email protected]

‡Faculty of Information TechnologyMonash University, Australia

[email protected]

AbstractSpeech translation systems usually follow apipeline approach, using word lattices as anintermediate representation. However, previ-ous work assume access to the original tran-scriptions used to train the ASR system, whichcan limit applicability in real scenarios. In thiswork we propose an approach for speech trans-lation through lattice transformations and neu-ral models based on graph networks. Experi-mental results show that our approach reachescompetitive performance without relying ontranscriptions, while also being orders of mag-nitude faster than previous work.

1 Introduction

Translation from speech utterances is a challeng-ing problem that has been studied both under sta-tistical, symbolic approaches (Ney, 1999; Casacu-berta et al., 2004; Kumar et al., 2015) and more re-cently using neural models (Sperber et al., 2017).Most previous work rely on pipeline approaches,using the output of a speech recognition system(ASR) as an input to a machine translation (MT)one. These inputs can be simply the 1-best sen-tence returned by the ASR system or a more struc-tured representation such as a lattice.

Some recent work on end-to-end systems by-pass the need for intermediate representations,with impressive results (Weiss et al., 2017). How-ever, such a scenario has drawbacks. From a prac-tical perspective, it requires access to the originalspeech utterances and transcriptions, which can beunrealistic if a user needs to employ an out-of-the-box ASR system. From a theoretical perspec-tive, intermediate representations such as latticescan be enriched through external, textual resourcessuch as monolingual corpora or dictionaries.

Sperber et al. (2017) proposes a lattice-to-sequence model which, in theory, can address bothproblems above. However, their model suffers

from training speed performance due to the lackof efficient batching procedures and they rely ontranscriptions for pretraining. In this work, weaddress these two problems by applying latticetransformations and graph networks as encoders.More specifically, we enrich the lattices by apply-ing subword segmentation using byte-pair encod-ing (Sennrich et al., 2016, BPE) and perform aminimisation step to remove redundant nodes aris-ing from this procedure. Together with the stan-dard batching strategies provided by graph net-works, we are able to decrease training time bytwo orders of magnitude, enabling us to matchtheir translation performance under the same train-ing speed constraints without relying on gold tran-scriptions.

2 Approach

Many graph network options exist in the litera-ture (Bruna et al., 2014; Duvenaud et al., 2015;Kipf and Welling, 2017; Gilmer et al., 2017): inthis work we opt for a Gated Graph Neural Net-work (Li et al., 2016, GGNN), which was re-cently incorporated in an encoder-decoder archi-tecture by Beck et al. (2018). Assume a directedgraph G = V, E , LV , LE, where V is a set ofnodes (v, `v), E is a set of edges (vi, vj , è) andLV and LE are respectively vocabularies for nodesand edges, from which node and edge labels (`vand è) are defined. Given an input graph withnode embeddings X, a GGNN is defined as

h0v = xv

rtv = σ

(crv∑

u∈Nv

Wrèh

(t−1)u + br

è

)

ztv = σ

(czv∑

u∈Nv

Wzèh

(t−1)u + bz

è

)

26

htv = ρ

(cv∑

u∈Nv

Wè

(rtu h(t−1)

u

)+ bè

)

htv = (1− ztv) h(i−1)

v + ztv htv

where e = (u, v, è) is the edge between nodes uand v, N (v) is the set of neighbour nodes for v, ρis a non-linear function, σ is the sigmoid functionand cv = czv = crv = |Nv|−1 are normalisationconstants.

Intuitively, a GGNN reduces to a GRU (Choet al., 2014) if the graph is a linear chain. There-fore, the GGNN acts as a generalised encoder thatupdates nodes according to their neighbourhood.Multiple layers can be stacked, allowing informa-tion to be propagated through longer paths in thegraph. Batching can be done by using adjacencymatrices and matrix operations to perform the up-dates, enabling efficient processing on a GPU.

2.1 Lattice TransformationsAs pointed out by Beck et al. (2018), GGNNs cansuffer from parameter explosion when the edge la-bel space is large, as the number of parameters isproportional to the set of edge labels. This is aproblem for lattices, since most of the informationis encoded on the edges. We tackle this problemby transforming the lattices into their correspond-ing line graphs, which swaps nodes and edges.1

After this transformation, we also add start andend symbols, which enable the encoder to prop-agate information through all possible paths in thelattice. Importantly, we also remove node scoresfrom the lattice in most of our experiments, butwe do revisit this idea in §3.3.

Having lattices as inputs allow us to incorpo-rate additional steps of textual transformations. Toshowcase this, in this work we perform subwordsegmentation on the lattice nodes using BPE. If anode is not present in the subword vocabulary, wesplit it into subwords and connect them in a left-to-right manner.

The BPE segmentation can lead to redundantnodes in the lattice. Our next transformation stepis a minimisation procedure, where such nodes arejoined into a single node in the graph. To performthis step, we leverage an efficient algorithm forautomata minimisation (Hopcroft, 1971), whichtraverses the graph detecting redundant nodes byusing equivalence classes, running in O(n log n)time, where n is the number of nodes.

1This procedure is also done in Sperber et al. (2017).

con quien

habla

hablo

<s> con quien

habla

hablo

</s>

<s> con quien

hab@@ la

hab@@ lo

</s>

<s> con quien hab@@

la

lo

</s>

<s> con quien hab@@

la

lo

</s>

Figure 1: Proposed lattice transformations. From topto bottom: 1) Original lattice with scores removed;2) Line graph transformation; 3) Subword segmenta-tion; 4) Lattice minimisation; 5) Addition of reverseand self-loop edges.

The final step adds reverse and self-loop edgesto the lattice, where these new edges have specificparameters in the encoder. This eases propaga-tion of information and is standard practice whenusing graph networks as encoders (Marcheggianiand Titov, 2017; Bastings et al., 2017; Beck et al.,2018). We show an example of all the transforma-tion steps on Figure 1.

In Figure 2 we show the architecture of our sys-tem, using the final lattice from Figure 1 as anexample. Nodes are represented as embeddingsthat are updated according to the lattice structure,resulting in a set of hidden states as the output.Other components follow a standard seq2seqmodel, using a bilinear attention module (Luonget al., 2015) and a 2-layer LSTM (Hochreiter and

27

Bilinear Attention

with

who

am

I

speaking

Embeddings GGNN Encoder Attention RNN Decoder

<s>

con

quién

hab@@

la

lo

</s>

...

Figure 2: Model architecture, using the final Spanish lattice from Figure 1 and its corresponding English translationas an example.

Schmidhuber, 1997) as the decoder.

3 Experiments

Data We perform experiments using theFisher/Callhome Speech Translation corpus,composed of Spanish telephone conversationswith their corresponding English translations.We use the original release by Post et al. (2013),containing both 1-best and pruned lattice outputsfrom an ASR system for each Spanish utterance.2

The Fisher corpus contain 150K instances and weuse the original splits provided with the datasets.Following previous work (Post et al., 2013;Sperber et al., 2017), we lowercase and removepunctuation from the English translations. Tobuild the BPE models, we extract the vocabularyfrom the Spanish training lattices, using 8K splitoperations.

Models and Evaluation All our models aretrained on the Fisher training set. For the 1-bestbaseline we use a standard seq2seq architectureand for the GGNN models, we use the same setupas Beck et al. (2018). Our implementation is basedon the Sockeye toolkit (Hieber et al., 2017) andwe use default values for most hyperparameters,except for batch size (16) and GGNN layers (8).3

For regularisation, we apply 0.5 dropout on the in-put embeddings and perform early stopping on thecorresponding Fisher dev set.

2We refer the reader to Post et al. (2013) for details on theASR system and how the lattices were generated.

3A complete description of hyperparameter values isavailable in the Supplementary Material.

1-best L L+S L+S+M

Median 32.4 34.4 34.5 34.3Ensemble 36.1 38.3 38.7 39.1

Table 1: Out-of-the-box scenario results, in BLEUscores. “L” corresponds to word lattice inputs, “L+S”and “L+S+M” correspond to lattices after subword seg-mentation and after minimisation, respectively.

Each model is trained using 5 different seedsand we report BLEU (Papineni et al., 2001) re-sults using the median performance according tothe dev set and an ensemble of the 5 models. Forthe word-based models, we remove any tokenswith frequency lower than 2 (as in Sperber et al.(2017)), while for subword models we do not per-form any threshold pruning. We report all resultson the Fisher “dev2” set.4

3.1 Out-of-the-box ASR scenario

In this scenario we assume only lattices and 1-bestoutputs are available, simulating a setting wherewe do not have access to the transcriptions. Ta-ble 1 shows that results are consistent with pre-vious work: lattices provide significant improve-ments over simply using the 1-best output. Moreimportantly though, the results also highlight thebenefits of our proposed transformations and weobtain the best ensemble performance using min-imised lattices.

4We also experimented with the Callhome test set, simi-lar to previous work. However, we did not see any differenttrends so we omit the results.

28

L+S+M L+S+M+T

Median 34.3 37.1Ensemble 39.1 42.3

Previous Work - no lattice scoresSperber et al. (2017) – 36.9

Previous Work - with lattice scoresPost et al. (2013) – 36.8Sperber et al. (2017) – 38.5

Table 2: Results with transcriptions, in BLEU scores.“L+S+M” corresponds to the same results in Table 1and “L+S+M+T” is the setting with gold transcriptionsadded to the training set.

3.2 Adding Transcriptions

The out-of-the-box results in §3.1 are arguablymore general in terms of applicability in real sce-narios. However, in order to compare with thestate-of-the-art, we also experiment with a sce-nario where we have access to the original Spanishtranscriptions. To incorporate transcriptions intoour model, we convert them into a linear chaingraph, after segmenting using BPE. With this, wecan simply take the union of transcriptions and lat-tices into a single training set. We keep the devand test sets with lattices only, as this emulates testtime conditions.

The results shown in Table 2 are consistent withprevious work: adding transcriptions further en-hance the system performance. We also slightlyoutperform Sperber et al. (2017) in the settingwhere they ignore lattice scores, as in our ap-proach. Most importantly, we are able to reachthose results while being two orders of magnitudefaster at training time: Sperber et al. (2017) reporttaking 1.5 days for each epoch while our archi-tecture can process each epoch in 15min. The rea-son is because their model relies on the CPU whileour GGNN-based model can be easily batched andcomputed in a GPU.

Given those differences in training time, it isworth mentioning that the best model in Sperberet al. (2017) is surpassed by our best ensemble us-ing lattices only. This means that we can obtainstate-of-the-art performance even in an out-of-the-box scenario, under the same training speed con-straints. While there are other constraints that maybe considered (such as parameter budget), we nev-ertheless believe this is an encouraging result forreal world scenarios.

3.3 Adding Lattice ScoresOur approach is not without limitations. In par-ticular, the GGNN encoder ignores lattice scores,which can help the model disambiguate betweendifferent paths in the lattice. As a simple first ap-proach to incorporate scores, we embed them us-ing a multilayer perceptron, using the score as theinput. This however did not produce good results:performance dropped to 32.9 BLEU in the singlemodel setting and 38.4 for the ensemble.

It is worth noticing that Sperber et al. (2017) hasa more principled approach to incorporate scores:by modifying the attention module. This is ar-guably a better choice, since the scores can di-rectly inform the decoder about the ambiguity inthe lattice. Since this approach does not affect theencoder, it is theoretically possible to combine ourGGNN encoder with their attention module, weleave this avenue for future work.

4 Conclusions and Future Work

In this work we proposed an architecture forlattice-to-string translation by treating lattices asgeneral graphs and leveraging on recent advancesin neural networks for graphs.5 Compared to pre-vious similar work, our model permits easy mini-batching and allows one to freely enrich the lat-tices with additional information, which we ex-ploit by incorporating BPE segmentation and lat-tice minimisation. We show promising results andoutperform baselines in speech translation, partic-ularly in out-of-the-box ASR scenarios, when onehas no access to transcriptions.

For future work, we plan to investigate betterapproaches to incorporate scores in the lattices.The approaches used by Sperber et al. (2017) canprovide a starting point in this direction. Thesame minimisation procedures we employ can beadapted to weighted lattices (Eisner, 2003). An-other important avenue is to explore this approachin low-resource scenarios such as ones involvingendangered languages (Adams et al., 2017; Anas-tasopoulos and Chiang, 2018).

Acknowledgements

This work was supported by the Australian Re-search Council (DP160102686). The researchreported in this paper was partly conducted atthe 2017 Frederick Jelinek Memorial Summer

5Code to replicate results available at github.com/beckdaniel/textgraphs2019_lat2seq.

29

Workshop on Speech and Language Technolo-gies, hosted at Carnegie Mellon University andsponsored by Johns Hopkins University with un-restricted gifts from Amazon, Apple, Facebook,Google, and Microsoft.

ReferencesOliver Adams, Trevor Cohn, Graham Neubig, and

Alexis Michaud. 2017. Phonemic transcription oflow-resource tonal languages. In Proceedings ofALTA, pages 53–60.

Antonis Anastasopoulos and David Chiang. 2018. TiedMultitask Learning for Neural Speech Translation.In Proceedings of NAACL.

Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Sima’an. 2017. GraphConvolutional Encoders for Syntax-aware NeuralMachine Translation. In Proceedings of EMNLP,pages 1947–1957.

Daniel Beck, Gholamreza Haffari, and Trevor Cohn.2018. Graph-to-Sequence Learning using GatedGraph Neural Networks. In Proceedings of ACL.

Joan Bruna, Wojciech Zaremba, Arthur Szlam, andYann LeCun. 2014. Spectral Networks and LocallyConnected Networks on Graphs. In Proceedings ofICLR, page 14.

Francisco Casacuberta, Hermann Ney, Franz JosefOch, Enrique Vidal, Juan Miguel Vilar, Sergio Bar-rachina, Ismael Garcıa-Varea, David Llorens, Car-los D. Martınez, Sirko Molau, Francisco Nevado,Moises Angeles Pastor, David Pico, Alberto San-chis, and Christoph Tillmann. 2004. Some ap-proaches to statistical and finite-state speech-to-speech translation. Computer Speech and Lan-guage, 18(1):25–47.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Pro-ceedings of EMNLP, pages 1724–1734.

David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gomez-Bombarelli, TimothyHirzel, Alan Aspuru-Guzik, and Ryan P Adams.2015. Convolutional Networks on Graphs forLearning Molecular Fingerprints. In Proceedings ofNIPS, pages 2215–2223.

Jason Eisner. 2003. Simpler and More General Min-imization for Weighted Finite-State Automata. InProceedings of NAACL, pages 64–71.

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley,Oriol Vinyals, and George E. Dahl. 2017. NeuralMessage Passing for Quantum Chemistry. In Pro-ceedings of ICML.

Felix Hieber, Tobias Domhan, Michael Denkowski,David Vilar, Artem Sokolov, Ann Clifton, and MattPost. 2017. Sockeye: A Toolkit for Neural MachineTranslation. arXiv preprint, pages 1–18.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

John Hopcroft. 1971. An O(n log n) Algorithm forMinimizing States in a Finite Automaton. Theoryof machines and computations.

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph ConvolutionalNetworks. In Proceedings of ICLR.

Gaurav Kumar, Graeme Blackwood, Jan Trmal, DanielPovey, and Sanjeev Khudanpur. 2015. A Coarse-Grained Model for Optimal Coupling of ASR andSMT Systems for Speech Translation. In Proceed-ings of EMNLP, pages 1902–1907.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, andRichard Zemel. 2016. Gated Graph Sequence Neu-ral Networks. In Proceedings of ICLR, 1, pages 1–20.

Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedingsof EMNLP, pages 1412–1421.

Diego Marcheggiani and Ivan Titov. 2017. Encod-ing Sentences with Graph Convolutional Networksfor Semantic Role Labeling. In Proceedings ofEMNLP.

Hermann Ney. 1999. Speech Translation: Couplingof Recognition and Translation. In Proceedings ofICASSP, pages 517–520.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofACL, pages 311–318.

Matt Post, Gaurav Kumar, Adam Lopez, DamianosKarakos, Chris Callison-Burch, and Sanjeev Khu-danpur. 2013. Improved Speech-to-Text Transla-tion with the Fisher and Callhome Spanish–EnglishSpeech Translation Corpus. In Proceedings ofIWSLT.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In Proceedings of ACL, pages1715–1725.

Matthias Sperber, Graham Neubig, Jan Niehues, andAlex Waibel. 2017. Neural Lattice-to-SequenceModels for Uncertain Inputs. In Proceedings ofEMNLP, pages 1380–1389.

30

Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, YonghuiWu, and Zhifeng Chen. 2017. Sequence-to-sequence models can directly translate foreignspeech. In Proceedings of INTERSPEECH, pages2625–2629.

31


Using Graphs for Word Embedding with Enhanced Semantic Relations

Matan ZuckermanDepartment of Software

and Information Systems EngineeringBen-Gurion University of the Negev

Beer-Sheva 84105, [email protected]

Mark LastDepartment of Software

and Information Systems EngineeringBen-Gurion University of the Negev

Beer-Sheva 84105, [email protected]

Abstract

Word embedding algorithms have become acommon tool in the field of natural languageprocessing. While some, like Word2Vec, arebased on sequential text input, others are uti-lizing a graph representation of text. In thispaper, we introduce a new algorithm, namedWordGraph2Vec, or in short WG2V, whichcombines the two approaches to gain the ben-efits of both. The algorithm uses a directedword graph to provide additional informationfor sequential text input algorithms. Our ex-periments on benchmark datasets show thattext classification algorithms are nearly as ac-curate with WG2V as with other word embed-ding models while preserving more stable ac-curacy rankings.

1 Introduction

Graph embedding in traditional studies aims torepresent nodes as vectors in low-dimensionalspace. Nodes in a graph will have similar vec-tors if they share similar attributes. An exam-ple of a node attribute can be betweeness, whichis defined as the number of times a node ap-pears on the shortest path between two othernodes. Graph-based modeling has proved itselfin many applications including node classifica-tion (Gibert et al., 2012), link prediction (Groverand Leskovec, 2016), community detection (Wanget al., 2017) and more (Goyal and Ferrara, 2018).While graph representation can be applied in a va-riety of domains, this paper will focus on naturallanguage processing applications. Language canbe represented by a graph of words in multipleways. The node (word) connections can repre-sent semantic relationships between words such as”king” and ”queen”, different grammatical formsof the same word such as walk and walked, co-occurring words, etc. Semantic similarity, whichis an inherit part of word embedding, can help with

a variety of NLP applications, such as text summa-rization (Nallapati et al., 2016), document classifi-cation (Yang et al., 2016), and machine translation(Zou et al., 2013). Creating a good feature repre-sentation of words is crucial for achieving reason-able results in all these tasks.In this study, we propose a novel word em-bedding algorithm named WordGraph2Vec, orWG2V. WordGraph2Vec combines the benefits ofexisting word embedding algorithms and tries tominimize their downfalls. While existing algo-rithms are focused on input in the form of vec-tors alone or in the form of graphs alone, Word-Graph2Vec takes into consideration both of theseinput types in its training phase. The word graphconstruction procedure used by WordGraph2Vecis similar to the graph representation in Schenker’s(2003) work, where each unique word is repre-sented by a node in the graph and there is a di-rected edge between two nodes if the correspond-ing two words follow each other in at least onesentence. The difference is that Schenker’s orig-inal word graph was used for representing indi-vidual web documents whereas WordGraph2Vecgenerates a word graph from a large corpus oftext that is expected to represent a natural lan-guage. We evaluate WordGraph2Vec vs. the state-of-the art node embedding algorithms (Grover andLeskovec, 2016; Tang et al., 2015; Perozzi et al.,2014) on the tasks of semantic relationships detec-tion and text classification.

2 Related Work

In this section, we cover the related work on textrepresentation and the benefits of graph-based rep-resentation models for natural language process-ing, along with common word embedding meth-ods and state-of-the-art algorithms for graph em-bedding.

32

2.1 Text Representation

One of the challenges in natural language process-ing is how to represent a term. The most com-mon term representation is vector representation,such as ”one-hot” vector, a vector in the size of thevocabulary, where only one dimension is equal toone and all others are zeros. More advanced meth-ods of embedding terms (mostly, words) in a vec-tor space are covered in the next sub-section.

Graph representation of text, by its naturalstructure, defines relationships between graphnodes. Each graph node represents a term, whichcan be defined in various ways including words,sentences, and n-grams. The node connections candefine the ”closeness” of terms to each other in aricher way than the ”one-hot” vector representa-tion including lexical and semantic relations, con-textual overlap, etc. A potential advantage of us-ing a directed graph model over context-dependentrepresentations, such as word2vec, is preservinginformation about the word order in the input text.

Graph representation is not new in the worldof text processing (Schenker et al., 2005; Son-awane and Kulkarni, 2014). Graph representa-tions outperformed the classical vector represen-tations of text documents, such as TF-IDF, onseveral NLP tasks including document classifi-cation (Markov et al., 2008), text summariza-tion (Garcıa-Hernandez et al., 2009), word sensedisambiguation (Agirre and Soroa, 2009), andkeyword extraction (Litvak and Last, 2008).

2.2 Word Embedding

Word Embedding is the process of represent-ing words as dense vectors in a low-dimensionalspace. This process gained momentum since thepaper of Mikolov (2013), which proposed a noveldeep learning algorithm that yields word embed-ding. Mikolov’s Word2Vec algorithm can be im-plemented in two ways. The first one is CBOW,predicting a word based on the words surround-ing it. The second one is Skip-Gram, predictingthe surrounding words of a specific word based onthis word. The target word and its neighbors areselected from a sliding window, which traversesthe corpus in linear steps. Similar words that donot fall in the same window, because they do notappear next to each other in the corpus, might notbe ”close” to each other in the low-dimensionalspace as will be shown in the next section. TheWord2Vec training objective, as implemented in

Skip-Gram, is to maximize the average log proba-bility of:

1

T

T∑

t=1

∑

−c≤j≤c,j 6=0

log p(wt+j |wt) (1)

Where T is the entire input text, c is half of thewindow size, and w are all co-occurring words.GloVe (Pennington et al., 2014) is another algo-rithm, which produces word embedding. The al-gorithm is essentially a log-bilinear model. Themodel tries to minimize the discrepancy betweenthe estimated probability of two words appearingnext to each other and the true probability of theirco-occurrence. Both GloVe and Word2Vec areconsidered state-of-the-art algorithms for creatingword embedding. Those word embedding meth-ods are suffering from the limitation of ignoringthe high-level, conceptual context of a given word.The vector of each word is trained only on its raw,low-level context and not on relations of words outof that context, which may affect the quality ofword embedding. The word order is also ignoredby the above methods.

2.3 Graph Embedding

Graph embedding models can overcome the limi-tations of the sequential input methods mentionedin the previous section. Paths in a word graph mayconnect semantically related words, which do notnecessarily appear in the same low-level context.Those connections can be utilized for enrichingthe word embedding. Thus, a graph-based modelcan take into account both the structural and thesemantic information represented by text.

An example of the above can be demonstratedby the following two sentences from a given cor-pus: ”The boy went to school” and ”The man wentto work”. If the sentence ”The boy went to work”does not exist in the corpus, a sequential inputmodel, such as Word2Vec, would not be trainedon it, whereas a graph input algorithm can also betrained on this word sequence, because the node”went” is shared in the graph between the two sen-tences. Since ”The boy went to work” is a validpath in the graph, the corresponding sentence canbe included in the training set. This way we canenhance the semantic relationships between wordsrepresented by the embedded vectors. In addition,there is a need for NLP algorithms that can exploitthe benefits of the graph input on one hand andface the challenges it brings on the other hand.

33

There are several state-of-the-art algorithms forproducing graph embedding. Graph embeddingcan be node embedding, edge embedding or both.When defining an embedding, the ”similarity” be-tween nodes is the main key. In a graph represen-tation, there are two major node similarity mea-sures that can be applied:

• Nodes with the same neighbors are logicallysupposed to be similar. For example, in a so-cial network a friend of my friend has a highprobability to be my friend.

• Nodes that are not connected to each other,but have the same structural attributes, suchas hubs.

Different algorithms approached those similari-ties in different ways. DeepWalk (Perozzi et al.,2014) is an algorithm based on the notion of arandom walk. Random walk can be used to ap-proximate similarity (Fouss et al., 2007). It is amethod which is useful in large graphs when it isnot possible for a computer to handle all of thegraph data in its memory. DeepWalk algorithmtries to maximize the log probability of observingthe last k nodes and the next k nodes in a randomwalk. The length of the random walk is set as 2k +1. The algorithm generates multiple random walksand it tries to optimize the sum of log-likelihoodsfor each random walk. Random walks, as opposedto the Word2Vec model, do not scan the data lin-early, which can be beneficial if the data is not lin-ear. (Perozzi et al., 2014) did not evaluate theDeepWalk algorithm on text data.

Graph embedding algorithms achieved supe-rior performance in domains that can be naturallymodeled as graphs including social networks, arti-cle citations, word graphs, etc. A sample applica-tion is link prediction in social networks (Groverand Leskovec, 2016), where the Node2Vec algo-rithm achieved a high AUC score.

The Node2Vec algorithm is also based on a ran-dom walk. The algorithm tries to maximize theoccurrence of subsequent nodes in a random walk.The main difference between the Node2Vec andDeepWalk is that Node2Vec has a mechanism oftrade-off between breadth first search (BFS) anddepth first search (DFS). Each search will lead toa different embedding. While in BFS the similar-ity is mostly based on neighboring nodes, in DFSfurther nodes will have a higher impact and hencewill succeed to represent the community better.

This trade-off produces better, more informativeand higher quality embedding. An example ofusing Node2Vec algorithm for text representationcan be found in Figure 1.

Figure 1: Example for Node2Vec walks. The nodesin bold represent a valid walk that was generated fromNode2Vec algorithm. A sliding window will be trainedon this walk.

The text graph used by Node2Vec in (Groverand Leskovec, 2016) is different from Schenker’smodel. The Node2Vec graph is undirected. Eachtwo co-occurring words in the text have an edgebetween them. The paper does not mentionwhether they applied a minimal word frequencythreshold to graph nodes or took into account sen-tence boundaries when defining edges.

Another node embedding algorithm is LINE(Tang et al., 2015). LINE algorithm is based ontwo types of connections:

• first-order proximity - local pairwise proxim-ity between two nodes

• second-order proximity - proximity betweenthe neighborhood structures of the nodes

The algorithm objective is to minimize the com-bination of the first and the second-order pair-wise proximity. LINE defines for each twonodes their joint probability distributions using thetwo proximity measures and then minimizes theKullback-Leibler divergence of these two distri-butions. LINE was evaluated on a text graph. InLINE, the text graph is undirected, words with fre-quency smaller than five are discarded, and wordsare considered as co-occurring if they appear inthe same sliding-window of five words. It was notmentioned in the paper whether the graph edgeswere affected by the sentence boundaries.

Both DeepWalk and Node2Vec have the lim-itation of skipping some nodes by the randompaths the algorithms generate for training. Con-sequently, some semantically related words maynever appear on the same path. The LINE algo-rithm only considers first and second order rela-tions between words and ignores the relations of

34

a higher order. This restriction may create wordembeddings with lower quality.

3 Methodology

To overcome some of the limitations of exist-ing word embedding methods, we propose a new,graph-based methodology, which has the follow-ing original contributions:

• Word graph construction: utilizingSchenker’s directed graph model andextending it to a large language corpus.

• Graph-based word embedding: introducinga new algorithm named WordGraph2Vec,which combines the advantages of theWord2Vec and Node2Vec algorithms.

3.1 Word Graph Construction

The construction of the text graph in this paper issimilar to Schenker’s work (Schenker et al., 2005),but Schenker’s graphs were constructed from rel-atively short web documents rather than from alarge corpus representing an entire language. Eachunique word in the corpus becomes a node in thegraph and there is a directed edge between twonodes if the corresponding two words are consec-utive in at least one sentence in the corpus. Stopwords and words with less than 100 occurrencesacross the corpus were removed. The graph con-tains only words and not punctuation marks ofany kind. The edge label represents the num-ber of times the second word follows the firstone. The graph is directed and weighted by theco-occurrence frequency. We preserve the edgedirections, because the order of words in a sen-tence is usually important for its meaning, whichshould also affect the calculation of edge fre-quency weights. For example, the phrase ”hotdog” will be represented in the graph with a rel-atively higher weight than ”dog hot” which mighthave an edge in the graph but with a much lowerweight as this phrase is very rare in English lan-guage. The text graph we use is different from thegraphs in LINE and Node2Vec algorithms, whichwere discussed in the Related Work section.

3.2 WordGraph2Vec Algorithm

WordGraph2Vec builds upon the Word2Vec(Mikolov et al., 2013). The main difference be-tween WordGraph2Vec to Word2Vec is that Word-Graph2Vec enriches the text by adding target

words for a specific context word in a sliding-window. In each sentence, m random contextwords are chosen. From the sliding window,which contains the context words, t target wordsare chosen. For each target word, n neighboringwords in radius r are picked from the graph andadded as a new context-target combination. Herethe training objective is to maximize the log prob-ability of words in the same sliding window andthe log probability of the new context-target com-bination. This is done by maximizing equations 1and 2.

1

T ′

T ′∑

t=1

∑

t∈m,nt∈nlog p(wnt|wt) (2)

where t indices refer only to the words that werechosen as the m context words and nt are the newtarget words from the graph that are relevant to aspecific t. T ′ is the number of new word combi-nations. In this way, the context words can predictadditional target words from the word graph thatdo not necessarily appear in the same sentence butshare a semantic relation to the context words.

Algorithm 1 WordGraph2Vec algorithmInput: M: Number of context words, T: Num-

ber of target words, N: Number of neighboringwords, R: Radius in graph, Text: Text, Sliding-Window: Size of the sliding window, G: WordsGraph, E: embedding size

Output: Word Embedding for each nodev ∈ G

1: for Sentence ∈ Text do2: CWords ←SelectCWords(Sentence,M)

3: for SlidingWindow ∈ Sentence doWord2Vec ( SlidingWindow, E)

4: for word ∈ SlidingWindow do5: if word ∈ CWords then6: TargetWords ←SelectTWords(SlidingWindow, T )

7: for target ∈ TargetWordsdo

8: graphWords ←SelectNWords(N, target,G,R)

9: for node ∈ graphWordsdo Word2Vec ( combination of node+word,E)

Figure 2 shows an example of WordGraph2Vecalgorithm where the radius is one. Two context

35

Figure 2: Example for WordGraph2Vec algorithm.

words are chosen and for each context word, onetarget word is picked from a specific sliding win-dow in the size of five. For each target word, twonew target words are picked from the graph. Theadvantage of WordGraph2Vec over Word2Vec andNode2vec is that it overcomes the limitations ofboth models. While the Word2Vec model onlylooks at linear relations between words, our algo-rithm takes into account context-target word pairsthat are not necessarily taken from existing sen-tences in the corpus. In contrast, Node2Vec sam-ples a path of words from the word graph usingrandom walks rather than considering all neigh-boring words of a specific word. This means thatNode2Vec, unlike Word2Vec, may ignore seman-tic relations between some co-occurring words inthe original text.

The Algorithm 1 pseudocode presents the pro-posed algorithm (WordGraph2Vec). The SelectC-Words function selects C random context wordsfrom a sentence. SelectTWords selects T randomtarget words from the sliding window where atleast one word is in CWords. SelectNWords selectsN additional target words from the graph withinthe radius R. Word2Vec is trained combinationsof context and target words for embedding of sizeE.

4 Empirical Evaluation

In this section, we discuss the evaluation metrics,the data, and the algorithms used in our experi-ments.

4.1 Language CorpusThe Wikipedia dump [can be downloadedfrom https://dumps.wikimedia.org/backup-index.html] dataset is a publiclyavailable resource. Wikipedia publishes everycouple of months a free xml file for all theirarticles and metadata in several languages. TheWikipedia dump used in this paper is in theEnglish language. The dump was created on”2018-07-20” and consists of approximately5,700,000 articles. The dataset contains morewords than exist in the English language as there

Parameter ValueTokens 2,183,079,274

Unique words 13,744,507Tok’ less than 5 incidence 11,430,003

Tok’ less than 100 incidence 13,436,432

Table 1: Wikipedia dump statistic.

are also words that represent proper names suchas locations, people, organizations, dates etc. TheWikipedia corpus is a collection of millions ofsentences, which are expected to represent wellthe real-world distribution of a language.

While there are many unique words in theWikipedia dump, only a small amount of them isfrequently used. Thus, 83.1% of the words appearin the corpus less than five times as can be seen inTable 1 and only 2.2% of the words appear morethan 100 times. This small group of 2.2% uniquewords is responsible for 97.2% of all word occur-rences in the entire corpus.

In the view of these findings, we implementedthe graph construction method introduced bySchenker (Schenker et al., 2005). All words withfrequency of lower than 100 times were discarded.In addition, stopwords, as they usually do not con-tribute to the semantic meaning of other words,were also removed. Punctuation marks were re-moved from the graph as well.

4.2 Evaluation Tasks and MetricsEvaluation of the word embedding methods wasperformed on two main tasks:

• Analogy test, both by Gladkova (analogy test,a) and Mikolov (analogy test, b)

• Document classification, based on bench-mark datasets

4.2.1 Analogy testBoth Gladkova (2016) and Mikolov proposedanalogy tests, which aim to test the semantic re-lations between ”similar” words. The main dif-ference between the two tests is in the combina-tion of the questions. The test Gladkova proposedcontains 99,200 analogy questions in 40 morpho-logical and semantic categories. The 40 linguis-tic relations are separated into four types of rela-tions: inflectional and derivational morphology, aswell as lexicographic and encyclopedic semantics.Examples of these types can be seen in Table 2.

36

Types ExamplesInflections car:cars, go:goingDerivation able:unable, life:lifeless

Lexicography sea:water, clean:dirtyEncyclopedia spain:madrid, dog:bark

Table 2: Examples for the four different types of anal-ogy questions

All four types are balanced, i.e., there is the samenumber of questions for each type. Mikolov pro-posed a test which contains 19,544 questions from14 different relations. The amount of questions isunbalanced across the relations. For example, thecountry:capital relation appears in over 50% of thequestions. All questions are in the structure of a isto b as c is to d. Solving analogy questions canbe done using the similarity between words calcu-lated by equations 3 and 4.

d = argmaxd∈V (sim(d, c− a+ b)) (3)

where V is a set of all words in the vocabularyexcluding a,b and c, and sim is defined as:

sim(u, v) = cosine(u, v) =(u ∗ v)||u||||v|| (4)

An answer to a question was labeled as correctif the correct word was among the top ten mostsimilar words in the returned results.

Each question was tested on the word em-bedding vectors that were generated by the dif-ferent algorithms, which are described in sub-section 5.1. For each algorithm, the percentage ofquestions that were answered correctly was pre-sented as the accuracy of each word embeddingmethod.

4.2.2 Document classificationDocument classification is a common NLP task.While the application of this task is well under-stood, the implementation of classification algo-rithms can be difficult. Using word vectors as aninput to such algorithm can help the algorithm to”understand” the meaning of the entire documentbased on the semantic relations of its words. SinceWordGraph2Vec is supposed to generate enhancedsemantic vectors, document classification seemsto be a natural task for this model. In our classifi-cation experiments, each document is represented

Dataset Class train testAGs news 4 3000 1900DBpedia 14 40000 5000

Yelp reviews 5 130000 10000Yelp polarity 2 280000 19000

Yahoo 10 140000 5000Amazon review 5 600000 130000Amazon polarity 2 1800000 200000

Table 3: datasets information, columns are the num-ber of classes in the dataset, number of samples in thetraining set and number of samples in the testing set,respectively.

by a vector calculated as the average of the em-beddings of all document words (subject to the fil-tering specifications). The datasets that were usedin this study are based on (Zhang et al., 2015) andthey are listed in Table 3

Each document from the datasets was convertedto a vector in the size of the embedding. For theclassification task in this paper, we used the fol-lowing three algorithms:

1. Neural Network algorithm:

• Input layer size embedding size• three hidden layers of 256 neurons and

dropout of 20%• dense layer• softmax activation

2. Random forest - 10 estimators, two min sam-ples and no max depth

3. Logistic regression - L2 penalty, hinge lossand 1,000 iterations.

The parameters for Random Forest and Logis-tic regression are the default parameters in scikit-learn package in Python implementation. TheNeural Network architecture was based on ourprevious experience in similar classification tasks.

We evaluated the proposed word embed-ding model in terms of model accuracy againstthe competing algorithms LINE, Word2Vec andNode2Vec described in section 5.1 .

5 Experiments

In this section, we present the settings of our ex-periments and their results.

37

algorithm Mikolov test Inflectional Derivational Encyclopedic LexicographicWord2Vec 71.0 77.9 31.8 25.7 15.2Node2Vec 53.2 59.2 7.5 20.4 14.3Line-first 57.9 65.5 8.1 19.4 11.8Line-second 49.3 75.6 13.1 15.4 13.7Emb1 50.1 47.8 13.4 17.1 7.6Emb2 60.8 68.9 21.4 20.6 10.7Emb3 52.1 53.4 14.5 18.3 8.2Emb4 63.3 71.9 24.7 21.9 11.6

Table 4: Accuracy percentage results of Mikolov Analogy test and the four main question types created in theGladkova Analogy test.

5.1 Experimental Setups

We compared WordGraph2Vec with the followingbaselines:Word2VEc: Window size 10, Skip-Gram method,embedding size 300.Node2Vec: Window size 10. 80 nodes per walk. pand q equal to one. 10 rehearsals and embeddingsize 300.LINE: Both first and second proximities. Embed-ding size of 300.All baselines and WordGraph2Vec were trained onthe Wikipedia dump described in section 4.1.Using WordGraph2Vec we generated four differ-ent word embeddings, all were trained in windowsize 10, Skip-gram method and embedding size300:Emb1: Two context words (M = 2), one targetword (T = 1) per context word. Two (N = 2) extratarget words from the graph within the radius ofone (R = 1).Emb2: Two context words (M = 2), one targetword (T = 1) per context word. One (N = 1) ex-tra target word from the graph within the radius ofone (R = 1).Emb3: Three context words (M = 3), one targetword (T = 1) per context word. Three (N = 3) ex-tra target words from the graph within the radiusof one (R = 1).Emb4: Three context words (M = 3), one targetword (T = 1) per context word. One (N = 1) ex-tra target word from the graph within the radius ofone (R = 1).

5.2 Experimental Results

5.2.1 Analogy testWe first compared the four tested embeddings ofWordGraph2Vec against all baselines in terms ofaccuracy of the analogy test. In the Mikolov test,

only 19,303 questions remained after we removedquestions that contained words, which were fil-tered in advance. In the Gladkova test, only 89,484questions remained after the same filtering opera-tion as mentioned above. The four question typesremained balanced.

As can be observed from Table 4, Word2Vecachieved the best results in both Mikolov andGladkova analogy tests. Apparently, the low-level linear context is much more significant foranalogy tasks in a natural language like En-glish. However, Emb4 in WordGraph2Vec wasranked as the second best after Word2Vec. Itmay be no coincidence as Emb4 is the mostsimilar to the Word2Vec implementation. BothEmb2 and Emb4 achieved better results than thecompeting graph-based algorithms. The Glad-kova test, which is more balanced and does notcontain significant outliers (such as country:city),present a more comprehensible result. With Word-Graph2Vec, Emb2 and Emb4 achieved the high-est accuracy scores in the Gladkova test, similarto the Mikolov’s test. Only questions under Lex-icographic type resulted in lower accuracy scoresthan the other baselines.

One reason, which might cause the lower ac-curacy results of WordGraph2Vec in comparisonto Word2Vec, is the ”noise” WordGraph2Vec cre-ates. This ”noise” can be described by the examplebelow. Let us assume that there are two sentencesin the corpus. The first, ”University applicationcould be tough”. The second, ”The safari in Kenyacould be dangerous”. With WordGraph2Vec, avalid word sequence could be ”University appli-cation could be dangerous”. The above sentenceis less likely to appear in the language and mightcause the word vectors of ”university” and ”dan-gerous” to be ”closer” to each other though their

38

Algorithms DBpedia Ag News Yelp Yelp pol Yahoo Amazon Amazon pol Rank StdEmb1 97.56 91.00 71.08 85.19 81.96 69.25 82.24 1.40Emb2 97.62 91.12 71.36 85.61 82.05 69.44 82.70 1.56Emb3 97.63 91.01 70.96 85.16 81.83 69.27 82.29 1.54Emb4 97.65 91.04 70.79 85.75 81.98 69.53 82.78 2.13LINE-first 97.55 90.98 72.34 83.87 82.43 69.45 81.07 2.25LINE-second 97.52 91.14 71.15 85.81 82.22 69.11 82.54 2.01Node2vec 97.14 90.91 70.69 82.80 81.84 68.81 79.19 3.35word2vec 97.69 91.31 70.92 86.48 82.44 69.09 83.08 2.69

Table 5: Average results across the different classification algorithms. The values are the accuracy in percent-age. Rank Std is the standard deviation values of the ranking of each embedding across all 21 combinations ofclassification algorithms and datasets.

semantic relationship is quite weak.In further investigation of our results, we com-pared Word2Vec with Emb4 embedding, which isthe closest to Word2Vec in terms of the trainingsentence set and Mikolov’s analogy test results.

Out of 19,303 questions of the analogy test,2,027 questions were answered correctly byWord2Vec and incorrectly by WordGraph2Vec.On the other hand, 515 questions were answeredcorrectly by WordGraph2Vec only. The result foreach question is a ranked list, which was cal-culated by the equation 3, of all the vocabularywords. Incorrect answer means the correct wordis not in the top ten words on that list. Figure 3presents a histogram of the positions of the cor-rect words in the list mentioned above. The distri-bution can be described as a ”long-tail” where inmost cases (41.2%) the correct answer was rankedbetween positions 10-20. This could be becauseof the ”noise” that was created by unrelated sen-tences. In addition, we examined the end of thetail as it is interesting to understand why in somequestions the correct position drops from the topten to more than 1000.

Figure 3: WordGraph2Vec histogram for the correctanalogy results.

Figure 4 presents a table with Average and Me-

dian results for the amount of occurrences of aword c from equation 3 for each position group.The purpose of equation 3 is to find the best matchfor a word c that will be as close as possible to thesubtraction of word vectors a and b. One theorythat requires further investigation is presented infigure 4. WordGraph2Vec adds new sentences bychoosing random words from each existing sen-tence and picking random words from a graphbased on their occurrence frequency in the entirecorpus. As the word c occurrence frequency inthe corpus decreases, the results of the analogytests containing this word drop. Presumably, thishappens because a smaller amount of new train-ing sentences contain the word c. Furthermore,we conducted a statistical comparison between theoccurrence frequency of words in the ranking po-sition 10-50 and 50+ and it was shown with alpha= 0.001 that the average occurrence frequency ishigher in the group of 10-50.

Figure 4: Word ranking positions vs. word frequencies.

5.2.2 Document classificationEach classification algorithm from section 5.1 wasused to predict the document labels. The inputwas a vector representation of the document cre-

39

ated by the word embedding. In total 21 resultswere collected (seven datasets and three classifi-cation algorithm). As can be seen from Table 5,Word2Vec achieved the best average accuracy re-sults on five datasets out of seven though the dif-ferences vs. most other algorithms do not appearsignificant. Emb2 was ranked in the second placein four datasets out of seven. Mostly, Word2Vecreached higher accuracy scores than Emb2, but themost interesting insight is that when Word2Vecachieved lower scores, Emb2 was still ranked inthe second place, probably as a result of the newinformation gained from the graph.

Figure 5: Accuracy ranking results for the best config-uration of the proposed algorithm and the competingbaselines. The x axis represents the average ranking ofeach embedding across the datasets.

In Figure 5, we see the ranking of Emb2 com-pared to all other baselines. Emb2 did not achievethe best accuracy scores but from Table 5 Emb2had more stable results than the baselines as thestandard deviation of its ranking is much lowerthan Line, Node2Vec and Word2Vec. From ex-amining Figure 5 it seems that Line-2 is morestable but it has a higher standard deviation thanEmb2 and its average ranking is lower. Node2Vecon average produced the lowest accuracy scoresand it is inferior to the rest of the algorithms. Be-sides, Node2Vec is very unstable as its rankinghas a standard deviation of 3.35, which is morethan double than Emb2. In Figure 5, we canalso see how the ranking of Line-1 varies fromthe first place to almost the last one. In order toverify the significance of the accuracy difference,we conducted a paired t-test. The test has shownthat Emb2 has significantly higher accuracy re-sults than the other baselines.

Table 6 presents the statistical significance re-sults between all baselines and Emb2. For p =

baselines d-avg sd se(d) tW2V -0.16 0.55 0.001 -1.32N2V 1.22 2.18 0.004 2.57

Line-1 0.32 1.17 0.002 1.25Line-2 0.06 0.56 0.001 0.49

Table 6: Statistical significance results between the bestconfiguration of the proposed algorithm, Emb2, and thebaselines.

0.05 the t − value with 20 degrees of freedom is2.086. Node2Vec was found significantly worsethan Emb2. Word2Vec reached a higher accuracyscore than Emb2, but the difference was not statis-tically significant.

6 Conclusion

In this paper, we present WordGraph2Vec, a wordembedding algorithm with semantic enhancement.The algorithm makes use of both linear and graphinput in order to strengthen the semantic relationsbetween words. Our experimental results showthat the proposed embedding did not achieve thebest results on analogy and classification tasks butwas stable across the datasets and in most caseswas ranked at the second place in terms of docu-ment classification and analogy tests accuracy. Infuture work, further settings of WordGraph2Veccan be explored, such as additional word graphconfigurations and a larger radius R for the newtarget words, which should yield target words thatare not close to the context word in the originaltext. In addition, the proposed graph-based ap-proach to word embedding can be evaluated onother NLP tasks in multiple languages.

ReferencesEneko Agirre and Aitor Soroa. 2009. Personalizing

pagerank for word sense disambiguation. In Pro-ceedings of the 12th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, pages 33–41. Association for Computa-tional Linguistics.

Francois Fouss, Alain Pirotte, Jean-Michel Renders,and Marco Saerens. 2007. Random-walk compu-tation of similarities between nodes of a graph withapplication to collaborative recommendation. IEEETransactions on knowledge and data engineering,19(3):355–369.

Rene Arnulfo Garcıa-Hernandez, Yulia Ledeneva,Griselda Matıas Mendoza, Angel Hernandez

40

Dominguez, Jorge Chavez, Alexander Gelbukh,and Jose Luis Tapia Fabela. 2009. Comparingcommercial tools and state-of-the-art methods forgenerating text summaries. In Artificial Intelligence,2009. MICAI 2009. Eighth Mexican InternationalConference on, pages 92–96. IEEE.

Jaume Gibert, Ernest Valveny, and Horst Bunke. 2012.Graph embedding in vector spaces by node attributestatistics. Pattern Recognition, 45(9):3072–3083.

Anna Gladkova, Aleksandr Drozd, and Satoshi Mat-suoka. 2016. Analogy-based detection of morpho-logical and semantic relations with word embed-dings: what works and what doesn’t. In Proceedingsof the NAACL Student Research Workshop, pages 8–15.

Palash Goyal and Emilio Ferrara. 2018. Graph embed-ding techniques, applications, and performance: Asurvey. Knowledge-Based Systems, 151:78–94.

Aditya Grover and Jure Leskovec. 2016. node2vec:Scalable feature learning for networks. In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data mining,pages 855–864. ACM.

Marina Litvak and Mark Last. 2008. Graph-basedkeyword extraction for single-document summa-rization. In Proceedings of the workshop onMulti-source Multilingual Information Extractionand Summarization, pages 17–24. Association forComputational Linguistics.

Alex Markov, Mark Last, and Abraham Kandel. 2008.The hybrid representation model for web documentclassification. International Journal of IntelligentSystems, 23(6):654–679.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,Bing Xiang, et al. 2016. Abstractive text summa-rization using sequence-to-sequence rnns and be-yond. arXiv preprint arXiv:1602.06023.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena.2014. Deepwalk: Online learning of social rep-resentations. In Proceedings of the 20th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 701–710. ACM.

Adam Schenker, Horst Bunke, Mark Last, and Abra-ham Kandel. 2005. Graph-Theoretic Techniques forWeb Content Mining. World Scientific PublishingCo., Inc., River Edge, NJ, USA.

Adam Schenker, Mark Last, Horst Bunke, and Abra-ham Kandel. 2003. Classification of web documentsusing a graph model. In Document Analysis andRecognition, 2003. Proceedings. Seventh Interna-tional Conference on, pages 240–244. IEEE.

Sheetal S Sonawane and Parag A Kulkarni. 2014.Graph based representation and analysis of text doc-ument: A survey of techniques. International Jour-nal of Computer Applications, 96(19).

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, JunYan, and Qiaozhu Mei. 2015. Line: Large-scale in-formation network embedding. In Proceedings ofthe 24th International Conference on World WideWeb, pages 1067–1077. International World WideWeb Conferences Steering Committee.

Gladkova analogy test. a. Gladkova analogytest. Available at http://vecto.space/projects/BATS/.

Mikolov analogy test. b. Mikolov analogy test. Avail-able at http://download.tensorflow.org/data/questions-words.txt.

Xiao Wang, Peng Cui, Jing Wang, Jian Pei, WenwuZhu, and Shiqiang Yang. 2017. Community pre-serving network embedding. In AAAI, pages 203–209.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1480–1489.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in neural information pro-cessing systems, pages 649–657.

Will Y Zou, Richard Socher, Daniel Cer, and Christo-pher D Manning. 2013. Bilingual word embeddingsfor phrase-based machine translation. In Proceed-ings of the 2013 Conference on Empirical Methodsin Natural Language Processing, pages 1393–1398.

41


Identifying Supporting Facts for Multi-hop Question Answering withDocument Graph Networks

Mokanarangan Thayaparan∗, Marco Valentino∗, Viktor Schlegel∗, Andre FreitasDepartment of Computer Science

University of Manchesterthayaparan.mokanarangan,marco.valentino,viktor.schlegel,andre.freitas

@manchester.ac.uk

Abstract

Recent advances in reading comprehensionhave resulted in models that surpass humanperformance when the answer is contained ina single, continuous passage of text. However,complex Question Answering (QA) typicallyrequires multi-hop reasoning – i.e. the integra-tion of supporting facts from different sources,to infer the correct answer.

This paper proposes Document Graph Net-work (DGN), a message passing architecturefor the identification of supporting facts over agraph-structured representation of text.

The evaluation on HotpotQA shows that DGNobtains competitive results when compared toa reading comprehension baseline operatingon raw text, confirming the relevance of struc-tured representations for supporting multi-hopreasoning.

1 Introduction

Question Answering (QA) is the task of inferringthe answer for a natural language question in agiven knowledge source. Acknowledged as a suit-able task for benchmarking natural language un-derstanding, QA is gradually evolving from mereretrieval task to a well-established tool for testingcomplex forms of reasoning. Recent advances indeep learning have sparked interest in a specifictype of QA emphasising Machine Comprehension(MC) aspects, where background knowledge is en-tirely expressed in form of unstructured text.

State-of-the-art techniques for MC typically re-trieve the answer from a continuous passage oftext by adopting a combination of character andword-level models with various forms of attentionmechanisms (Seo et al., 2016; Yu et al., 2018). Byemploying unsupervised pre-training on large cor-pora (Devlin et al., 2018), these models are capa-

∗: equal contribution

Document A: Erik Watts

Erik Watts (born December 19, 1967) is an American semi-retired professional wrestler. He isbest known for his appearances with World Championship Wrestling and the World WrestlingFederation in the 1990s. He is the son of WWE Hall of Famer Bill Watts.

Document B: Bill Watts

William F. Watts Jr. (born May 5, 1939) is an American former professional wrestler, promoter,and WWE Hall of Fame Inductee (2009).Watts was famous under his "Cowboy" gimmick in hiswrestling career, and then as a tough, no-nonsense promoter in the Mid-South United States,which grew to become the Universal Wrestling Federation (UWF).

Document-Document

Sentence-Document

Sentence-Document

William F. Watts Jr. (born May 5,1939) is an American former

professional wrestler, promoter, andWWE Hall of Fame Inductee (2009).

Sentence-Document

Erik Watts is theson of WWE Hall

of Famer BillWatts.

Sentence-Document

Q: When was Erik Watts' father born?

Sentence-Document

(1)

(2)Erik Watts (born December 19,1967) is an American semi-retired

professional wrestler.

Bill Watts was famous under his "Cowboy" gimmick in hiswrestling career, and then as a tough, no-nonsense promoter

in the Mid-South United States, which grew to become theUniversal Wrestling Federation (UWF).

Erik Watts is best known for hisappearances with World Championship

Wrestling and the World WrestlingFederation in the 1990s

Erik WattsBill Watts

Figure 1: Is structure important for complex, multi-hop Question Answering (QA) over unstructured textpassages? To answer this question we explore the taskof identifying supporting facts (rounded rectangles) bytransforming a corpus of documents (1) into an undi-rected graph (2) connecting sentence nodes (rectan-gles) and document nodes (hexagons).

ble of outperforming humans in reading compre-hension tasks where the context is represented bya single paragraph (Rajpurkar et al., 2018).

However, when it comes to answering complexquestions on large document collections, it is un-likely that a single passage can provide sufficientevidence to support the answer. Complex QA typ-ically requires multi-hop reasoning, i.e. the abil-ity of combining multiple information fragmentsfrom different sources.

Moreover, recent studies have raised concernson inference capabilities, generalisation and inter-pretability of current MC models (Wiese et al.,2017; Dhingra et al., 2017; Kaushik and Lipton,2018), leading to the creation of novel datasets thatpropose multi-hop reading comprehension as abenchmark for evaluating complex reasoning andexplainability (Yang et al., 2018).

42

Consider the example in Figure 1. In order toanswer the question “When was Erik Watts’ fatherborn?” a QA system has to retrieve and combinesupporting facts stored in different documents:

1. Document A: “Erik Watts is the son of WWEHall of Famer Bill Watts”;

2. Document B: “William F. Watts Jr. (bornMay 5, 1939) is an American former profes-sional wrestler, promoter, and WWE Hall ofFame Inductee (2009)”.

The explicit selection of supporting facts has adual role in a multi-hop QA pipeline:

(a) It allows the system to consider all and onlythose facts that are relevant to answer a spe-cific question;

(b) It provides an explicit trace of the reasoningprocess, which can be presented as justifica-tion for the answer.

This paper explores the task of identifying sup-porting facts for multi-hop QA over large collec-tions of documents where several passages act asdistractors for the MC model. In this setting, wehypothesise that graph-structured representationsplay a key role in reducing complexity and im-proving the ability to retrieve meaningful evidencefor the answer.

As shown in Figure 1.1, identifying support-ing facts in unstructured text is challenging as itrequires capturing long-term dependencies to ex-clude irrelevant passages. On the other hand (Fig-ure 1.2), a graph-structured representation con-necting related documents simplifies the integra-tion of relevant facts by making them mutuallyreachable in few hops. We put this observation inpractice by transforming a text corpus in a globalrepresentation that links documents and sentencesby means of mutual references.

In order to identify supporting facts on undi-rected graphs, we investigate the use of messagepassing architectures with relational inductive bias(Battaglia et al., 2018). We present the DocumentGraph Network (DGN), a specific type of GatedGraph Neural Network (GGNN) (Li et al., 2015)trained to identify supporting facts in the afore-mentioned structured representation.

We evaluate DGN on HotpotQA (Yang et al.,2018), a recently proposed dataset for assessing

MC performance on supporting facts identifica-tion. The experiments show that DGN is able toobtain improvements in F1 score when comparedto a MC baseline that adopts a sequential readingstrategy. The obtained results confirm the value ofpursuing research towards the definition of novelMC architectures, which are able to incorporatestructure as an integral part of their learning andinference processes.

2 Document Graph Network

The following section presents the DocumentGraph Network (DGN), a message passing archi-tecture designed to identify supporting facts formulti-hop QA on graph-structured representationsof documents.

Here, we discuss in details the construction ofthe underlying graph, the DGN model, and a pre-filtering step implemented to alleviate the impactof large graphs on model complexity.

2.1 Graph-structured Representation

Given an arbitrary corpus of documents D =D1, D2, . . . , Dn, we aim to build an undirecteddocument graph DG as structured representationof D (Figure 1).

The advantage of using graph-structured repre-sentations lies in reducing the inference steps nec-essary to combine two or more supporting facts.Therefore, we want to extract a representation thatincreases the probability of connecting relevantsentences with short paths in the graph. We ob-serve that multi-hop questions usually require rea-soning on two concepts/entities that are describedin different, but interlinked documents. We put inpractice this observation by connecting two docu-ments if they contain mentions to the same enti-ties.

The Document Graph (DG) contains nodes oftwo types. We represent each document Di in Das a document node di and each of its sentencesSjDi as a sentence node sjDi . We then add anedge of type esentence−document that links them.This edge type represents the fact that a specificsentence belongs to a specific document. We ap-ply coreference resolution to solve implicit entitymentions within the documents. Subsequently,we add an edge of type edocument−document be-tween two document nodes d1, d2, if the entitiesdescribed in D1 are referenced in D2 or vice-versa.

43

Collection of Documents

Extract Document Graph

Document Graph

Step 1: Graph Construction Step 2: Prefiltering

Filter Top K relevant sentence nodes

Step 3: Supporting Facts Selection

h1t-1

h2t-1

h3t-1

h4t-1

h1t-1

h2t-1

h4t-1

h3t-1

T=1

h1t-1

h2t-1

h3t-1

h4t-1

h1t-1

h2t-1

h4t-1

h3t-1

T=n

Graph Propagation

Erik Watts

Gated Graph Neural Network

Output Network

Q: When was Erik Watts'father born?

Extracted Sub Graph Bi-L

inea

r Atte

ntio

n

He is the son of WWE Hall of Famer Bill Watts.

William F. Watts Jr. (born May 5, 1939) is anAmerican former professional wrestler,promoter, and WWE Hall of Fame Inductee (2009).

QuestionEmbedding

NodeEmbeddings

Self-Atte

ntion

h1n-1 h3n-1

Sentence Nodes

0 1

...

1

hkn-1

...Supporting

Facts

Figure 2: Overview of the approach for the identification of supporting facts in a multi-hop QA pipeline. Step 1 isapplied offline for extracting a graph-structured representation from large corpora (Sec. 2.1). In Step 2, we employa filtering algorithm (Sec. 2.3) to retrieve a sub-graph containing the top k relevant sentences nodes. The final step(Step 3) adopts the DGN model for message passing and binary classification of the supporting facts (Sec. 2.4).

Given a question q, we useDG (instead ofD) asinput for the DGN model. the representation doesnot include edges between sentences since we ob-served increasing complexity in the model with-out gaining substantial benefits in terms of perfor-mance.

2.2 Architectural Overview

Figure 2 highlights the main components of theDGN architecture.

From the target corpus, we automatically ex-tract a Document Graph DG encoding the back-ground knowledge expressed in a corpus of docu-ments (Step 1). This data and its graphical struc-ture is permanently stored into a database, readyto be loaded when it is required. The first step isperformed offline, allowing the integration of newknowledge regardless of the runtime pipeline im-plemented to address the task.

In order to speed up the computation and al-leviate current drawbacks of Gated Graph Neu-ral Networks (Li et al., 2015), the question an-swering pipeline is augmented with a prefilteringstep (Step 2). The adopted algorithm (Sec 2.3),based on a relevance score, is aimed at reducingthe number of nodes involved in the computation.Current limitations of Gated Graph Neural Net-works, in fact, are mainly connected with the sizeof the input graph used for learning and predic-

tion. Performance in terms of computational ef-ficiency and learning degrades proportionally tothe number of nodes and edge types in the in-put graph. In order to reduce the negative impactof large graphs, we adopt the prefiltering step toprune DG, and retrieve a set of sentence nodesS = S1, S2, . . . , Sk expected to contain sup-porting facts for a question q.

The subsequent step (Step 3) is aimed at select-ing supporting facts for q. For this task we em-ploy the Document Graph Network (DGN) on thesubset of DG induced by S (section 2.4). Specif-ically, we apply the aforementioned architectureto learn a distributed representation of each nodein the graph via message passing. This represen-tation is then used by an Output Network (ON)to perform binary classification on the sentencenodes in S and select a set of supporting factsSF = sf1, sf2, . . . , sfm with SF ⊆ S. Inthe experiments we perform supervised learningon the training set provided by HotpotQA (Yanget al., 2018) to correctly predict the elements be-longing to SF .

2.3 Prefiltering Step

Given a question q and a set of documents D =D1, D2, . . . , Dn as context, the aim of the pre-filtering step is to retrieve a subset of the contextcontaining the k most relevant sentences to q.

44

In order to achieve this goal, we adopt a rank-ing based approach similar to the one illustrated in(Narasimhan et al., 2018). Specifically, we con-sider all the sentences occurring in the documentsand compute the similarity between each wordin a sentence and each word in the question q.We adopt pre-trained GloVe vectors (Penningtonet al., 2014) to obtain the distributed representa-tion of each word. Subsequently, we produce therelevance score of each sentence by calculating themean among the m highest similarity values. Thefinal subset is obtained by selecting the sentenceswith the top k relevance scores.

An empirical analysis suggested that m = 5gives the best results on the development set. Weevaluated this approach by computing the recallof retrieving the top k supporting facts for k =20, 25, 30, obtaining values greater than 90%for k = 25 and k = 30. Since the average num-ber of candidate sentences for each question in thecorpus is 50.89, the described algorithm allows usto discard 60.7% (k = 20), 50.87% (k = 25) and41.05% (k = 30) of irrelevant context.

2.4 Identifying Supporting FactsThe Document Graph Network (DGN) is em-ployed for the identification of supporting facts.The DGN model is based on a standard GatedGraph Neural Network architecture (GGNN) (Liet al., 2015) where the inner representation of thenodes is customised to carry out this specific task.We apply DGN on the sub-graph retrieved by thefiltering module.

In alignment with prior research in the field weencode Question(Q), Nodes(N ) and Graph(G) asfollows:

1. Question Representation: The question isstripped of punctuation and stop words andtokenised to obtain W words. These wordsare subsequently converted into a tensorQ ∈ R|W |×D using pre-trained GloVe vec-tors (Pennington et al., 2014) of dimensionD.

2. Node Representation: Similar to the ques-tion representation, each node is also con-verted to V ∈ R|W |×D using entities for doc-ument nodes and sentences words for sen-tence nodes.

3. Graph Representation: Each documentgraphDG is represented by an adjacency ma-

trix A ∈ R|V |×2|E||V | where V and E denotethe vertices and edge types respectively.

Each node (vi) is conditioned on the question(qi) using Bi-Linear Attention (Kim et al., 2018).The attention weights αi of each word w in thenodes are determined by a learned function fBAN

as shown in Equation 2. Here fBAN computes theattention scores between two matrices using a bi-linear attention function. This function has a ma-trix of weights W and a bias vector b used to cal-culate the similarity between the two matrices asVWQT + b:

eiw = fBAN (viw, qi) (1)

αiw =exp(eiw)∑|W |k=1 exp(eik)

(2)

Following the calculation of the attentionscores, the question conditioned vectors are deter-mined as follows:

vi = φ(vi, αi) (3)

Here, φ is a learned function that combines theattention scores of each word by employing a non-linear transformation.

After conditioning the nodes representation onthe question, we employ a Self-Attention Modelfunction fSAN (Vaswani et al., 2017) to calculatethe weight of each vector δi. Here, the learnedfunction fSAN is responsible for computing theweights of each vector in a node. The rationale be-hind this operation is to condense the matrices to avector suitable for a Gated Graph Neural Networkarchitecture while retaining the most discrimina-tive semantic information.

riw = fSAN (viw) (4)

δiw =exp(riw)∑|W |k=1 exp(rik)

(5)

After computing the self-attention score, wecalculate the initial annotation vectors for theGGNN as follows:

xv = σ(vi, δi) (6)

where σ is a function that returns a single vectorby multiplying the corresponding attention scoresand summing them up. The basic recurrent unit ofa GGNN can be formalised as follows:

45

h(1)v = [xTv , 0]T (7)

a(t)v = ATv:[h

(t−1)T1 ...h

(t−1)T|V | ]T + b (8)

ztv = σ(W za(t)v + U zh(t−1)v ) (9)

rtv = σ(W ra(t)v + U rh(t−1)v ) (10)

htv = tanh(Wa(t)v + U(rtv h(t−1)v )) (11)

h(t)v = (1− ztv) h(t−1)v + ztv h(t)v (12)

We perform T time steps of propagation and re-trieve the distributed nodes representation by us-ing the final hidden state. The computed represen-tation of each node implicitly captures the seman-tic information of its neighbours at a distance up toT hops. In the experiments, we found it sufficientto set T = 3.

The graph is heterogeneous with nodes repre-senting questions, sentences and documents. Asthe supporting facts identification task requiressentence classification, we retain the final hiddenstate of the sentence nodes while discarding theothers. We use the sentence representations as in-put to a feed forward neural network called OutputNetwork. We perform binary classification of eachsentence to predict whether it is a supporting factor not:

ov = g(h(T )v , xv) (13)

3 Evaluation

The experiments are motivated by the guiding re-search question of the paper: Does structure playa role in identifying supporting facts for multi-hopQuestion Answering? We further break down thequestion in the following research hypotheses:

• RH1: Existing machine comprehensionmodels benefit from reducing the context toa small number of sentences necessary to an-swer a question.

• RH2: Models operating on a graph-structured representation perform better, sup-porting the identification of relevant factswhen compared to a baseline that uses a se-quential strategy.

We seek to provide evidence for those claims byconducting the following experiments:

• Experiment 1: investigate how a representa-tive state-of-the-art MC model performs ondifferent passages with varying coherencyand length.

• Experiment 2: evaluate the capability ofthe proposed approach to identify supportingfacts in a question answering scenario wherethe relevant facts are distributed across mul-tiple documents.

Specific tests are performed to identify con-tributing features and compare the overall perfor-mance of the approach with a sequential baselinereported in the literature.

HotpotQA We ran the experiments over the re-cently proposed HotpotQA dataset (Yang et al.,2018), which requires MC models to find support-ing passages in a large set of documents, and per-form multi-hop reasoning to arrive at the correctanswer. HotpotQA provides 105,547 first para-graphs extracted from Wikipedia articles, and cor-responding question-answer pairs created by hu-man annotators. Questions are designed to only beanswerable by combining information from twoarticles and require to bridge documents via a con-cept or entity mentioned in both articles. A subsetof questions require a comparison of similar con-cepts concerning their common or differing prop-erties. Furthermore, the dataset provides labels forsupporting sentences, making it possible to per-form quantitative analysis on the retrieval of sup-porting facts.

In all of the reported experiments, if not statedotherwise, training is performed on the HotpotQAtraining set while the evaluation is performed onthe development set in the distractor setting. Inorder to address this setting, a system has to re-trieve the answer and the supporting facts for agiven question by reasoning over a set of ten doc-uments. Only two of the supplied documents areguaranteed to contain the information that is suf-ficient and necessary to answer the question. Theremaining eight documents are similar documentsretrieved by an information retrieval model (hencethe name distractor).

3.1 State-of-the-Art MachineComprehension Performance

This experiment is designed to investigate the ca-pabilities of single passage MC models to retrievethe correct answer when provided with a context

46

of varying size and coherency. For this analysis weadopt BERT (Devlin et al., 2018), a neural trans-former architecture (Vaswani et al., 2017) consti-tuting the state-of-the-art latent representation forvarious NLP tasks.

The publicly available model is pre-trained in anunsupervised manner on a large text corpus withthe objective of language modelling and next sen-tence prediction. Fine-tuning this model to spe-cific NLP tasks has shown to achieve state-of-the-art-results for many NLP tasks, among othersquestion answering and machine reading compre-hension (Devlin et al., 2018). To that end, we fine-tune the model on the training split of HotpotQAand evaluate it on the evaluation split. Beforetraining, we manually remove all the questionsthat cannot be answered by retrieving a continu-ous passage in the supporting facts (e.g. we ex-clude comparison questions that typically requireyes/no type of answers).

We evaluate the performance of BERT withsupporting facts only, and then progressively en-rich the context by a rising number of sentencesretrieved by the filtering algorithm (Sec. 2.3). Theresults of this experiment are reported in Table 1.

Note that these results can not be interpreted asa resilient comparison baseline as (1) we don’t op-timise the set of hyper-parameters associated withthe model training and (2) we ensure the existenceof supporting facts in the evaluation, since we areinterested in the intrinsic performance of BERT inanswer retrieval.

Unsurprisingly, the best results are achievedwhen the context provided to BERT is composedof supporting facts only. Conversely, the perfor-mance of the model gradually deteriorates whendistracting information is added to the context.

These results reinforce our assumption that amodule capable of identifying the correct set ofsupporting facts represents a fundamental com-ponent in a multi-hop QA pipeline. Moreover,this component may be complementary to down-stream machine comprehension models, consti-tuting a valid support to improve overall perfor-mances in answer retrieval.

3.2 Supporting Facts Identification

We compare the DGN model on the task of iden-tifying supporting facts against the neural baselinereported in (Yang et al., 2018). In order to suit thetask, the baseline architecture extends the state-of-

Table 1: F1 and exact match (EM) score of BERT whenevaluated on answer retrieval over contexts of varyingsize.

Sentences SF only +5 +10 +20 +25 +30F1 0.75 0.60 0.52 0.44 0.42 0.42EM 0.60 0.47 0.40 0.33 0.32 0.31

Table 2: Supporting facts identification: Harmonicmean (F1), Precision and Recall

Model F1 P RBaseline (Yang et al., 2018) 66.66 - -Baseline Replication (Answer + SF) 65.28 63.28 67.43Baseline Replication (SF only) 46.44 48.80 44.31DGN (best) 68.02 61.51 76.07DGN (no edge types) 45.84 - -

the-art answer passage retrieval model (Seo et al.,2016) by an additional recurrent layer that classi-fies whether a sentence is a supporting fact or not.The model is trained jointly and under strong su-pervision on the objectives of retrieving both an-swer and supporting facts. We replicate the exper-iment on our infrastructure in order to obtain moredetailed measures, such as precision an recall. Theresults of the evaluation are reported in Table 2.

The experiments show that the DGN model out-performs the baseline in terms of F1 score (≈2%improvement compared to the results reported inthe paper, ≈3% improvement compared to ourreplication), and recall (≈14% improvement overour replication). However, the baseline implemen-tation has a higher precision. We attribute that tothe fact that the baseline optimises for both answerextraction and supporting facts retrieval.

In general we observe that recall is higher thanprecision throughout the experiments. Comparedto the DGN model, the baseline is less penalisedwhen the retrieved answer still matches the ex-pected answer, even if retrieved from an unrelatedsentence spuriously. in the absence of the answerselection optimisation criterion, the DGN modelis only penalised if it fails to predict the correctsupporting facts. This forces the model to priori-tise recall over precision during training. Adding aweight to the loss calculation as an additional hy-perparameter can balance the precision and recallmetric.

We don’t evaluate DGN on the task of answerretrieval, since the proposed architecture focuseson the classification of the relevant supportingfacts. The task of jointly retrieving answer andsupporting facts is left for future work.

47

Table 3: Precision, Recall and F1 score of the DGNmodel with different values of k assigned to the pre-filtering algorithm. The best values are highlighted.

k = 20 k = 25 k = 30 fullPrefiltering (R) 84.72 90.23 94.29 100.00DGN (F1) 63.40 67.38 68.02 66.00DGN (P) 57.83 61.92 61.51 53.37DGN (R) 70.17 73.90 76.07 86.47

3.3 Analysis

In order to understand the interaction of the keycontributing parts of the architecture, we analysethe behaviour of the full pipeline in different set-tings. Specifically, we measure the DGN perfor-mance when trained and evaluated on the outputof the filtering step. During the training, we en-sure the existence of the supporting facts in the in-put graph of the DGN model. We then evaluate iton the development set by performing predictionon the subset retrieved by the filtering algorithm.The results reported in Table 3 take into accountthe combined performance of the full pipeline withdifferent hyperparameters assigned to the prefilter-ing algorithm.

Firstly, we observe the increase of recall withthe increasing number of retrieved sentences. Thisfact is unsurprising and it is in line with the higherrecall score of the filtering module. More sen-tences means broader coverage, and thus higherrecall even before executing DGN prediction.

Secondly, across the experiments, we observethat k = 30 is the best number of sentences forthe model to learn from. This is confirmed by thebest precision and overall F1 score obtained whentraining and predicting on the top 30 sentences.Moreover, we observed that the application of thefiltering algorithm sensibly speeds up the training,decreasing at the same time the amount of mem-ory required to store matrices and weights of thegraph network. The application of a light filteringis then justified both in terms of performance andcomputational complexity.

Regarding the baseline model, we aim to anal-yse the impact of multi-task learning, where themodel is jointly trained to retrieve supporting factsand the final answer. We observe a significant dropin performance (≈20% F1 score) when we opti-mise the baseline only for supporting facts identi-fication (see Baseline Replication in Table 2). Thisobservation is perfectly in line with the literature.(Hashimoto et al., 2016) report improvements on

low level tasks when jointly optimised with higherlevel tasks in a hierarchical learning setting. Re-garding multi-hop QA, the identification of sup-porting facts directly depends on the answer beingpredicted correctly and vice-versa. A plausible fu-ture work may be to understand whether DGN canbenefit from a similar multi-task learning setup.

Finally, we investigate the role of the semanticinformation expressed explicitly in the DocumentGraph. To that end, we train the DGN model us-ing the same configuration of the best performingmodel without edge type information. This resultsin a notable drop of F1 score (see Table 2) rein-forcing the evidence that explicit semantic infor-mation encoded in relational form contributes to-wards the performance of the model. A promisingfuture direction will be to investigate whether dif-ferent types of semantic representation benefit theperformance of the model and to what extent.

4 Related Work

State-of-the-art approaches for Open-DomainQuestion Answering over large collections of doc-uments employ a combination of character-levelmodels, self-attention (Wang et al., 2017), and bi-attention (Seo et al., 2016) to operate over unstruc-tured paragraphs without exploiting any structuredtext representation. Despite these methods havedemonstrated impressive results reaching in somecases super-human performances (Seo et al., 2016;Chen et al., 2017; Yu et al., 2018), recent studieshave raised important concerns related to general-isation (Wiese et al., 2017; Dhingra et al., 2017)complex reasoning (Welbl et al., 2018) and ex-plainability (Yang et al., 2018). Specifically, thelack of structured representation makes it hard forcurrent Machine Comprehension models to findmeaningful patterns in large corpora, generalisebeyond the training domain and justify the answer.

Research efforts towards the creation ofmessage-passing architectures with relational in-ductive bias (Battaglia et al., 2018) have enabledmachine learning algorithms to incorporate graph-ical structures in their training process. Thesemodels, trained over explicit entities and rela-tions, have the potential to boost generalisation,interpretability and abstract reasoning capabilities.A variety of Graph Neural Network architectureshave already demonstrated remarkable results in alarge set of applications ranging from ComputerVision, Physical Systems and Protein-Protein In-

48

teraction (Zhou et al., 2018).

Our research is in line with recent trends inQuestion Answering prone to explore message-passing architectures over graph-structured rep-resentation of documents to enhance perfor-mance and overcome challenges involved in deal-ing with unstructured text. (Sun et al., 2018)fuse text corpus with manually-curated knowl-edge bases to create heterogeneous graphs of KBfacts and text sentences. Their model, GRAFT-Net, built upon Graph Convolutional Networks(Schlichtkrull et al., 2018), is used to propagateinformation between heterogeneous nodes in thegraph and perform binary classification on en-tity nodes to select the answer. Differently fromthe proposed approach, the latter work focuses onlinks between whole paragraphs and external en-tities in a Knowledge Base. Moreover, GRAFT-Net is designed for single-hop Question Answer-ing, assuming that the question is always about asingle entity.

The proposed approach is similar to (De Caoet al., 2018) and (Song et al., 2018), where the aimis to answer complex questions that require the in-tegration of multiple text passages. However, ourresearch is focused on the identification of sup-porting facts instead of answer retrieval.

Another line of research focuses on narrow-ing down the context for later Machine Compre-hension models by selecting relevant passages assupporting facts. Work in that direction includes(Watanabe et al., 2017) which present a neuralinformation retrieval system to retrieve a suffi-ciently small paragraph and (Geva and Berant,2018) which employ a Deep Q-Network (DQN)to solve the task by learning to navigate over anintra-document tree. A similar approach is cho-sen by (Clark and Gardner, 2017). However, in-stead of operating on document structure, theyadopt a sampling technique to make the modelmore robust towards multi-paragraph documents.These approaches are not directly comparable toour work since they focus either on single para-graphs or intra-document (local) structure.

Strongly related to our work is (Yang et al.,2018) which presents HotpotQA, a novel datasetfor multi-hop QA. The authors highlight the im-portance of identifying supporting facts for im-proving reasoning and explainability of currentsystems. We compare the proposed architecturewith the baseline described in their paper. The

model is based on a state-of-the-art MC model(Seo et al., 2016) that adopts a sequential readingstrategy to identifying supporting facts from largecollections of documents.

5 Conclusion

In this paper, we investigated the role played byinterlinked sentence representation for complex,multi-hop question answering under the focus ofsupporting facts identification, i.e. retrieving theminimum set of facts required to answer a givenquestion. We emphasise that this problem is worthpursuing, showing that the performance of state-of-the-art models substantially deteriorates as thesize of the accompanying context increases.

We present Document Graph Network (DGN),a novel approach for selecting supporting facts ina multi-hop QA pipeline. The model operates overexplicit relational knowledge, connecting docu-ments and sentences extracted from large text cor-pora. We adopt a pre-filtering step to limit thenumber of nodes and train a customised GraphGated Neural Network directly on the extractedrepresentation.

We train and evaluate the DGN model on anewly proposed dataset for complex, multi-hopquestion answering over unstructured text. Theevaluation shows that DGN outperforms a base-line adopting a sequential reading strategy. Addi-tionally, we show that when trained to retrieve justsupporting facts, the performance of the baselinedegrades by ≈20%.

Perhaps most importantly, we highlight a wayto combine structured and distributional sentencerepresentation models and propose further re-search lines in that direction. As future work, weaim to investigate the role and impact of differentstructured sentence representation models withinthe inference process, linking it with the Open In-formation Extraction (Cetto et al., 2018; Niklauset al., 2018) and sentence simplification (Niklauset al., 2019, 2017) literature.

We believe that further research can be dedi-cated to inject richer structured knowledge in themodel, allowing for fine-grained message passingand improved representation learning. Anotherimportant line of research will focus on the im-plementation of advanced mechanisms and tech-niques to scale the approach to massive text cor-pora such as the whole Wikipedia.

49

Acknowledgements

The authors would like to express their gratitudetowards members of the AI Systems lab at theUniversity of Manchester for many fruitful and in-tense discussions.

ReferencesPeter W Battaglia, Jessica B Hamrick, Victor Bapst,

Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Ma-teusz Malinowski, Andrea Tacchetti, David Raposo,Adam Santoro, Ryan Faulkner, et al. 2018. Rela-tional inductive biases, deep learning, and graph net-works. arXiv preprint arXiv:1806.01261.

Matthias Cetto, Christina Niklaus, Andre Freitas,and Siegfried Handschuh. 2018. Graphene:Semantically-linked propositions in open informa-tion extraction. Prooceedings of COLING.

Danqi Chen, Adam Fisch, Jason Weston, and An-toine Bordes. 2017. Reading wikipedia to an-swer open-domain questions. arXiv preprintarXiv:1704.00051.

Christopher Clark and Matt Gardner. 2017. Simpleand effective multi-paragraph reading comprehen-sion. arXiv preprint arXiv:1710.10723.

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018.Question answering by reasoning across documentswith graph convolutional networks. arXiv preprintarXiv:1808.09920.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Bhuwan Dhingra, Kathryn Mazaitis, and William WCohen. 2017. Quasar: Datasets for question an-swering by search and reading. arXiv preprintarXiv:1707.03904.

Mor Geva and Jonathan Berant. 2018. Learning tosearch in long documents using document structure.arXiv preprint arXiv:1806.03529.

Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-ruoka, and Richard Socher. 2016. A joint many-taskmodel: Growing a neural network for multiple nlptasks. arXiv preprint arXiv:1611.01587.

Divyansh Kaushik and Zachary C Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks.arXiv preprint arXiv:1808.04926.

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang.2018. Bilinear attention networks. In Advancesin Neural Information Processing Systems, pages1571–1581.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, andRichard Zemel. 2015. Gated graph sequence neu-ral networks. arXiv preprint arXiv:1511.05493.

Medhini Narasimhan, Svetlana Lazebnik, and Alexan-der Schwing. 2018. Out of the box: Reasoning withgraph convolution nets for factual visual questionanswering. In Advances in Neural Information Pro-cessing Systems, pages 2659–2670.

Christina Niklaus, Bernhard Bermeitinger, SiegfriedHandschuh, and Andre Freitas. 2017. A sentencesimplification system for improving relation extrac-tion.

Christina Niklaus, Matthias Cetto, Andre Freitas, andSiegfried Handschuh. 2018. A survey on open infor-mation extraction. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 3866–3878, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.

Christina Niklaus, Matthias Cetto, Andre Freitas, andSiegfried Handschuh. 2019. Transforming complexsentences into a semantic hierarchy. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 3415–3427, Flo-rence, Italy. Association for Computational Linguis-tics.


Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. arXiv preprint arXiv:1806.03822.

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne van den Berg, Ivan Titov, and Max Welling.2018. Modeling relational data with graph convolu-tional networks. In European Semantic Web Confer-ence, pages 593–607. Springer.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionflow for machine comprehension. arXiv preprintarXiv:1611.01603.

Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang,Radu Florian, and Daniel Gildea. 2018. Exploringgraph-structured passage representation for multi-hop reading comprehension with graph neural net-works. arXiv preprint arXiv:1809.02040.

Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan Salakhutdinov, and William Co-hen. 2018. Open domain question answering usingearly fusion of knowledge bases and text. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 4231–4242.

50

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 189–198.

Yusuke Watanabe, Bhuwan Dhingra, and RuslanSalakhutdinov. 2017. Question Answering fromUnstructured Text by Retrieval and Comprehension.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. Transac-tions of the Association of Computational Linguis-tics, 6:287–302.

Georg Wiese, Dirk Weissenborn, and MarianaNeves. 2017. Neural domain adaptation forbiomedical question answering. arXiv preprintarXiv:1706.03610.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W Cohen, Ruslan Salakhutdinov, andChristopher D Manning. 2018. Hotpotqa: A datasetfor diverse, explainable multi-hop question answer-ing. arXiv preprint arXiv:1809.09600.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolutionwith global self-attention for reading comprehen-sion. arXiv preprint arXiv:1804.09541.

Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang,Zhiyuan Liu, and Maosong Sun. 2018. Graph neu-ral networks: A review of methods and applications.arXiv preprint arXiv:1812.08434.

51


Essentia: Mining Domain-specific Paraphraseswith Word-Alignment Graphs

Danni Ma1, Chen Chen2, Behzad Golshan2, Wang-Chiew Tan2

1Department of Computer and Information Science, University of Pennsylvania2Megagon Labs

[email protected], chen,behzad,[email protected]

Abstract

Paraphrases are important linguistic resourcesfor a wide variety of NLP applications. Manytechniques for automatic paraphrase miningfrom general corpora have been proposed.While these techniques are successful at dis-covering generic paraphrases, they often failto identify domain-specific paraphrases (e.g.,“staff ”, “concierge” in the hospitality do-main). This is because current techniquesare often based on statistical methods, whiledomain-specific corpora are too small to fit sta-tistical methods. In this paper, we present anunsupervised graph-based technique to mineparaphrases from a small set of sentences thatroughly share the same topic or intent. Oursystem, ESSENTIA, relies on word-alignmenttechniques to create a word-alignment graphthat merges and organizes tokens from inputsentences. The resulting graph is then used togenerate candidate paraphrases. We demon-strate that our system obtains high qualityparaphrases, as evaluated by crowd workers.We further show that the majority of the iden-tified paraphrases are domain-specific and thuscomplement existing paraphrase databases.

1 Introduction

Paraphrases are important linguistic resourceswhich are widely used in many NLP tasks, in-cluding text-to-text generation (Ganitkevitch et al.,2011), recognizing textual entailment (Daganet al., 2005), and machine translation (Martonet al., 2009). Today, mining paraphrases stillremains an active research area (Ferreira et al.,2018; Gupta et al., 2018; Iyyer et al., 2018; Zhanget al., 2019). Most existing work on this topicfocuses on mining general-purpose paraphrases(e.g., “prevalent”, “very common”), but fails toextract domain-specific paraphrases. For exam-ple, while “reservation”, “stay” are not para-phrases in general, they are interchangeable in the

The World Economy

fully recovered from

shrugged off

gotten rid

has

of

the crisis

already

completely

Figure 1: An instance of a word-alignment graph.

following sentence:Can we extend our reservation for two more days?

Existing paraphrase mining techniques are of-ten based on statistical methods. They cannot beimmediately applied to domain-specific corpora,because such corpora are usually smaller in sizeand lack parallel data. ESSENTIA overcomes thisproblem by using an unsupervised graph-basedmethod that mines domain-specific paraphrasesfrom a small set of short sentences sharing thesame topic or intent. ESSENTIA’s key insight isthat a collection of sentences from a specific do-main often exhibit common patterns. ESSENTIA

makes use of these properties to align tokens ofinput sentences. The resulting alignments are thensummarized in a directed acyclic graph (DAG)called the word-alignment graph. It illustrateswhich phrases can be used interchangeably andthus are potential paraphrases. Figure 1 showsthe word-alignment graph generated from thefollowing three sentences:

- The world economy has fully recovered from the crisis.

- The world economy has shrugged off the crisis completely.

- The world economy has gotten rid of the crisis already.

The word-alignment graph reveals that phrasesthat are not aligned, but share the same alignedcontext (i.e. surrounding words) are likely to bedomain-specific paraphrases. Hence, even though“fully recovered from”, “shrugged off ”, “gottenrid of ” are not aligned, they are likely para-phrases because they share the same patterns be-

52

Input

Senten

ces Word

Aligner

Word Alignment

Graph Generator

Paraphrase Generator

Figure 2: The architecture of ESSENTIA.

fore and after themselves.While this work is focused on mining para-

phrases, we believe that word-alignment graphshave other interesting applications, and we leavethem for future work. For instance, a word-alignment graph enables one to generate new sen-tences or phrases that do not appear in the originalset of sentences. “The world economy has gottenrid of the crisis completely” is a new sentence thatis generated using the graph in Figure 1.

Contributions. We present ESSENTIA, an unsu-pervised system for mining domain-specific para-phrases by creating rich graph structures fromsmall corpora. Experiments on datasets in real-world applications demonstrate that ESSENTIA

finds high-quality domain-specific paraphrases.We also validate that these domain-specific para-phrases complement and augment PPDB (Para-phrase Database), the most extensive paraphrasedatabase available in the community.

2 Essentia

The architecture of ESSENTIA (Figure 2) consistsof: (1) a word aligner which aligns similar words(and phrases) between different sentences basedon syntactic and semantic similarity; (2) a word-alignment graph generator that summarizes thealignments into a compact graph structure; and (3)a paraphrase generator that mines domain-specificparaphrases from the word-alignment graph. Wedescribe each component below.

2.1 Word alignerWe use the state-of-the-art monolingual wordaligner by Sultan et al. (2014). The input to theword aligner is a single pair of sentences and theoutput is a predicted mapping between tokens oftwo sentences. ESSENTIA uses the word aligner tocompute the alignments for all pairs of sentencesprovided as input.

Every sentence is first pre-processed by replac-ing numbers and named entities – which are iden-tified by spaCy (Honnibal and Montani, 2017) –with special symbols “NUM” and “ORG” respec-tively before it is passed to the word aligner.

The word aligner relies on paraphrase, lexicalresources and word embedding techniques to finda mapping between tokens. In other words, theword aligner finds general-purpose paraphrasesand maps their tokens accordingly. ESSENTIA fur-ther processes the output of the word aligner tomine domain-specific paraphrases.

2.2 Word-alignment graph generatorOnce the alignments between every pair of sen-tences are available, the word-alignment graphgenerator summarizes all the alignments intoa unified structure, referred to as the word-alignment graph. It is a DAG that represents allthe input sentences (see Figure 1 as an example).The process of creating the word-alignment graphis described as follows.

The first step partitions the set of input sen-tences into compatible groups. A group of sen-tences is compatible if their alignments adhere tothe following three conditions:

• Injectivity For any pair of sentences, eachword should be mapped to at most one word inthe other sentence.

• Monotonicity For any pair of sentences, if aword w1 appears before w2, then the word thatw1 maps to should also appears before the wordthat w2 maps to in the other sentence. Sentencepairs such as “Yesterday I saw him” and “I sawhim yesterday” violate this condition.

• Transitivity Given any three sentences s1, s2,and s3, if a word w1 in s1 is mapped to w2 ins2, and w2 in s2 is mapped to w3 in s3, then w1

should be only mapped to w3 in s3.

The above conditions are necessary to ensurethat the resulting representation is compact andforms a DAG. We start by partitioning the inputsentences into compatible groups. The partition-ing strategy is a simple greedy algorithm whichstarts with a single empty group. A sentence willbe added to the first group that remains compatibleupon adding this new sentence. If no such groupexists, a new empty group is created and the sen-tence is added to this group. This process repeatsuntil each sentence is assigned to one group.

Next, the word-alignment graph generator rep-resents each group as a DAG and then combinesall the DAGs using a shared start-node and end-node to create the final word-alignment graph.Specifically, a line graph is first created for each

53

sentence (i.e., a word-alignment graph for a sin-gle sentence). Then, the alignments are processed:for each pair of aligned words, their correspondingnodes are contracted to a single node. Due to theconstraints imposed earlier, one can easily showthat the resulting graph will be cycle-free.

2.3 Paraphrase generator

Given a word-alignment graph, the paraphrasegenerator considers all paths in the graph thatshare the same start and end node as paraphrasecandidates. For instance, in Figure 1, there arethree branches that start from the node “has” andend in “the”. Consequently, the phrases “fully re-covered from”, “shrugged off ”, “gotten rid of ”are extracted as paraphrase candidates.

However, not all extracted candidates are para-phrases. Consider the following sentences:- Give me directions to my parent’s place

- Give me directions to the Time Square

In this case, “my parent’s place”, “the TimeSquare” will be extracted as candidates, but it isclear that they are not valid paraphrases.

To avoid generating wrong paraphrases, we de-sign a filtering step – which can be implementedeither using rules (e.g., regular expressions) or sta-tistical methods (e.g., word similarity) – on topof the extracted candidates. Our current imple-mentation of this filtering functionality adopts arule-based heuristic that only considers candidatesof verb phrases containing three or fewer tokens,such as “access to Wi-Fi”, “hookup to Wi-Fi”.Our empirical study reveals that many such verbphrases are domain-specific paraphrases. Otherclasses of phrases, such as noun phrases, turnout to have much noise. For example, manynoun phrases are simply different options (e.g.,“today”,“tomorrow”). We leave the design ofadvanced filters for those classes as future work.

In the process of discovering paraphrases, weobserve that sentences can be “cleaned”. Thatis, some phrases can be removed without affect-ing the essential meaning of a sentence. Figure 1shows that the phrases “already” and “completely”share the same start and end node. Moreover, wesee that the start and end node are also directlyconnected with a single edge. Such phrases areoptional phrases and can be removed without af-fecting the core meaning of a sentence. By identi-fying optional phrases, we can simplify the set ofinput sentences to its “essence”, where the name

of ESSENTIA comes from.

Notes on scalability. The time required by theword aligner to compute alignments between twosentences is quite small and can be consideredas constant since the length of input sentencesis bounded in practice. Given that, the time-complexity of ESSENTIA’s pipeline for n inputsentences is O(n2) as we need to compute align-ments between all pairs of sentences. In prac-tice, the pipeline can be applied to roughly a hun-dred sentences within an hour. For a larger collec-tion of sentences, as described in Section 2.2, wefirst run a clustering algorithm to group sentencesinto smaller clusters, and then feed each cluster toESSENTIA’s pipeline.

3 Related Work

Collecting and curating a database of paraphrasesis a costly and time-consuming task in general.Although there are existing techniques to collectparaphrase pairs from crowd-workers more effi-ciently and with lower cost (Chen and Dolan,2011), there has been a great interest in developingtechniques for automatically mining paraphrasesfrom existing corpora. Barzilay and McKeown(2001) proposed the first unsupervised learning al-gorithm for paraphrase acquisition from a corpusof multiple English translations of the same sourcetext. Barzilay and Lee (2003) followed up withan approach that applied multiple-sequence align-ment to sentences gathered from parallel corpora.Pang et al. (2003) proposed a new syntax-basedalgorithm to produce word-alignment graphs forsentences. Finally, Quirk et al. (2004) applied sta-tistical machine translation techniques to extractparaphrases from monolingual parallel corpora.

The most extensive resource for paraphrases to-day is PPDB (Ganitkevitch et al., 2013; Ganitke-vitch and Callison-Burch, 2014; Pavlick et al.,2015b). PPDB consists of a huge number ofphrase pairs with confidence estimates, and hasalready been proven effective for multiple tasks.However, as our experiments show, PPDB andother resources fail to capture a large number ofdomain-specific paraphrases.

To extract domain-specific paraphrases, Pavlicket al. (2015a) extended Moore-Lewis method(Moore and Lewis, 2010) and learned para-phrases from bilingual corpora. Zhang et al.(2016) constructed Markov networks of wordsand picked paraphrases based on the frequency

54

Dataset # of extracted pairs # of valid pairs Precision

ESSENTIASnips 173 84 48.55%

HotelQA 2221 642 28.91%

FSA Snips 18 15 83.33%HotelQA 342 185 54.09 %

Table 1: Comparison between ESSENTIA and FSA baseline on paraphrase extraction

of co-occurrences. However, these systems relyon significantly large amounts of domain-specificdata (either for supervised training or conduct-ing frequency analysis), which may not alwaysbe available. ESSENTIA instead uses an un-supervised graph-based technique for paraphrasemining and does not rely on the presence of alarge amount of domain-specific data. The word-alignment graph constructed by ESSENTIA can beinterpreted as an extension of multi-sentence com-pression (Filippova, 2010). We compactly main-tain all paths and expressions in the constructedword-alignment graphs. As pointed out in Panget al. (2003), the extracted paraphrases can helpenrich the diversity of expressions regarding a spe-cific intention, and ultimately provide more train-ing examples for data-driven models.

4 Evaluation

ESSENTIA is evaluated on two datasets and isshown to generate high quality domain-specificparaphrases. We compare our system against asyntax-based alignment technique by Pang et al.(2003), which we refer to as FSA, as it generatesFinite-State Automata for compactly representingsentences in a setting similar to ours. Comparedto FSA, ESSENTIA generates 263% more para-phrases on those two datasets. We further demon-strate that most extracted paraphrases are trulydomain-specific and thus are missing from PPDB.Datasets We use two datasets to evaluateESSENTIA. The first one, commonly known as theSnips dataset (Coucke et al., 2018), is a collectionof queries submitted to smart conversational de-vices (e.g., Google Home or Alexa). Snips has tendocuments, each covering one intent such as “GetDirections”, “Get Weather” and so on. On aver-age, each document has 32 sentences, and eachsentence has 9 words. The other dataset – which iscalled HotelQA – is an industry proprietary datasetof various types of questions submitted by hotelguests regarding different amenities and services,such as “Check-out” or “Wi-Fi”. HotelQA alsoconsists of ten documents, with an average of 54

sentences per document and 10 words per sen-tence. HotelQA was our primary motivation forinvestigating this problem. The industry applica-tion requires an automatic method to identify a setof questions that are semantically equivalent.

4.1 Mining Paraphrases

Table 1 compares the performance of ESSENTIA

with the FSA baseline for paraphrase mining.Specifically, we show the number of phrasepairs extracted by ESSENTIA and FSA from bothdatasets (“# of extracted pairs” column), numberof valid paraphrases within these pairs (“# of validpairs” column), and precision (“Precision” col-umn). Although FSA has higher precision due toconservative sentence alignment, ESSENTIA ex-tracts significantly more paraphrases, improvingthe recall by 460% (Snips) and 247% (HotelQA)over the baseline. To identify valid paraphrases,we design a crowd-sourcing task on Figure-EightData Annotation Platform. In this task, we presentan extracted candidate pair (e.g., “log onto”,“connect to”) and a domain (e.g., “Wi-Fi”) tohuman annotators, and ask them to decide whetherthe two phrases are paraphrases or not.

ESSENTIA discovers a large number of para-phrases missing from PPDB, which has the high-est coverage among the existing paraphrase re-sources (Pavlick and Callison-Burch, 2016). Moreprecisely, we take the 726 correct extractions ofESSENTIA (as verified by human annotators) andsearch to see if they appear in PPDB even withlow confidence scores. We find that only 4% ofour discovered paraphrases appear in PPDB. Thisin turn shows the effectiveness of ESSENTIA indiscovering paraphrases, because it goes beyondPPDB by using only a few sentences. Table 2 listssome domains and examples of domain-specificparaphrases detected by ESSENTIA.

Finally, to better understand how ESSENTIA’sperformance can be improved and what opportu-nities lie ahead for further research, we review asample of ESSENTIA’s incorrect extractions andidentify two major classes of errors. One class

55

Domain Example paraphrasesRestaurant search recommend a good place

suggest a placeRestaurant reservation get me a place

get me a spotGet directions show me the way

get me directionsGet weather need the weather

want the weatherRequest ride find a taxi

need an uberShare location share my location

send my locationHotel Wi-Fi log onto the Wi-Fi

connect to Wi-FiHotel checkout extend our checkout

have a late checkout

Table 2: Examples of domain-specific paraphrases.

consists of expressions that are alternative optionsbut not necessarily paraphrases (e.g., “avoidingthe highway”, “avoiding toll road”). Anotherclass contains expressions that involve the sametopic but have slightly different intentions (e.g.,“tell me the Wi-Fi password”, “how to connect toWi-Fi”). While the two error classes we discusshere are the most prevalent ones, an in-depth anal-ysis of error classes and their frequencies (whichwe leave as future work) can be quite insightful.

5 Conclusion and Future Work

We present ESSENTIA, an unsupervised graph-based system for extracting domain-specific para-phrases, and demonstrate its effectiveness usingdatasets in real-world applications. Empirical re-sults show that ESSENTIA can generate high qual-ity domain-specific paraphrases that are largelyabsent from mainstream paraphrase databases.

Future work involves various directions. Onedirection is to derive domain-specific sentencetemplates from corpora. These templates can beuseful for natural language generation in question-answering systems or dialogue systems. Second,the current method can be extended to mine para-phrases from a wide range of syntactic units otherthan verb phrases. Also, the word aligner can beimproved to align prepositions more accurately, sothat the generated alignment graph would revealmore paraphrases. Finally, ESSENTIA can also beused to identify linguistic patterns other than para-phrases, such as phatic expressions (e.g., “Excuseme”, “All right”), which will in turn allow us toidentify the essential constituents of a sentence.

ReferencesRegina Barzilay and Lillian Lee. 2003. Learning

to paraphrase: an unsupervised approach usingmultiple-sequence alignment. In Proceedings ofNAACL-HLT 2003.

Regina Barzilay and Kathleen R McKeown. 2001. Ex-tracting paraphrases from a parallel corpus. In Pro-ceedings of NAACL-HLT 2003, pages 50–57.

David L. Chen and William B. Dolan. 2011. Collect-ing highly parallel data for paraphrase evaluation. InProceedings of ACL 2011, pages 190–200.

Alice Coucke, Alaa Saade, Adrien Ball, TheodoreBluche, Alexandre Caulier, David Leroy, ClementDoumouro, Thibault Gisselbrecht, Francesco Calta-girone, Thibaut Lavril, et al. 2018. Snips voice plat-form: an embedded spoken language understandingsystem for private-by-design voice interfaces. arXivpreprint arXiv:1805.10190.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The PASCAL recognising textual entailmentchallenge. In Machine Learning Challenges Work-shop, pages 177–190.

Rafael Ferreira, George DC Cavalcanti, Fred Freitas,Rafael Dueire Lins, Steven J Simske, and MarceloRiss. 2018. Combining sentence similarities mea-sures to identify paraphrases. Computer Speech &Language, pages 59–73.

Katja Filippova. 2010. Multi-sentence compression:Finding shortest paths in word graphs. In Proceed-ings of COLING 2010, pages 322–330.

Juri Ganitkevitch and Chris Callison-Burch. 2014. Themultilingual paraphrase database. In Proceedings ofLREC 2014, pages 4276–4283.

Juri Ganitkevitch, Chris Callison-Burch, CourtneyNapoles, and Benjamin Van Durme. 2011. Learningsentential paraphrases from bilingual parallel cor-pora for text-to-text generation. In Proceedings ofEMNLP 2011, pages 1168–1179.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. PPDB: The paraphrasedatabase. In Proceedings of NAACL-HLT 2013,pages 758–764.

Ankush Gupta, Arvind Agarwal, Prawaan Singh, andPiyush Rai. 2018. A deep generative framework forparaphrase generation. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Proceedings of NAACL-HLT 2018, pages 1875–1885.

56

Yuval Marton, Chris Callison-Burch, and PhilipResnik. 2009. Improved statistical machine trans-lation using monolingually-derived paraphrases. InProceedings of EMNLP 2009, pages 381–390.

Robert C Moore and William Lewis. 2010. Intelligentselection of language model training data. In Pro-ceedings of ACL 2010, pages 220–224. Associationfor Computational Linguistics.

Bo Pang, Kevin Knight, and Daniel Marcu. 2003.Syntax-based alignment of multiple translations:Extracting paraphrases and generating new sen-tences. In Proceedings of NAACL-HLT 2003, pages102–109.

Ellie Pavlick and Chris Callison-Burch. 2016. SimplePPDB: A paraphrase database for simplification. InProceedings of ACL 2016, pages 143–148.

Ellie Pavlick, Juri Ganitkevitch, Tsz Ping Chan,Xuchen Yao, Benjamin Van Durme, and ChrisCallison-Burch. 2015a. Domain-specific paraphraseextraction. In Proceedings of ACL-IJCNLP 2015,pages 57–62.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015b. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of ACL-IJCNLP2015, pages 425–430.

Chris Quirk, Chris Brockett, and William Dolan.2004. Monolingual machine translation for para-phrase generation. In Proceedings of EMNLP 2004,pages 142–149.

Md. Arafat Sultan, Steven Bethard, and Tamara Sum-ner. 2014. Back to basics for monolingual align-ment: Exploiting word similarity and contextual ev-idence. Transactions of the Association for Compu-tational Linguistics, pages 219–230.

Lilin Zhang, Zhen Weng, Wenyan Xiao, Jianyi Wan,Zhiming Chen, Yiming Tan, Maoxi Li, and Ming-wen Wang. 2016. Extract domain-specific para-phrase from monolingual corpus for automatic eval-uation of machine translation. In Proceedings ofWMT 2016, pages 511–517.

Yuan Zhang, Jason Baldridge, and Luheng He. 2019.PAWS: Paraphrase adversaries from word scram-bling. In Proceedings of NAACL-HLT 2019, pages1298–1308.

57


Layerwise Relevance Visualization in Convolutional Text GraphClassifiers

Robert Schwarzenberg, Marc Hubner, David Harbecke, Christoph Alt, Leonhard HennigGerman Research Center for Artificial Intelligence (DFKI), Berlin, Germany

[email protected]

Abstract

Representations in the hidden layers of DeepNeural Networks (DNN) are often hard to in-terpret since it is difficult to project them intoan interpretable domain. Graph ConvolutionalNetworks (GCN) allow this projection, but ex-isting explainability methods do not exploitthis fact, i.e. do not focus their explanations onintermediate states. In this work, we presenta novel method that traces and visualizes fea-tures that contribute to a classification decisionin the visible and hidden layers of a GCN. Ourmethod exposes hidden cross-layer dynamicsin the input graph structure. We experimen-tally demonstrate that it yields meaningful lay-erwise explanations for a GCN sentence clas-sifier.

1 Introduction

A Deep Neural Network (DNN) that offers – orfrom which we can retrieve – explanations for itsdecisions arguably is easier to improve and de-bug than a black box model. Loosely followingMontavon et al. (2018), we understand an expla-nation as a “collection of features (...) that havecontributed for a given example to produce a deci-sion.” For humans to make sense of an explanationit needs to be expressed in a human-interpretabledomain, which is often the input space (Montavonet al., 2018).

For instance, in the vision domain Bach et al.(2015) propose the Layerwise Relevance Propa-gation (LRP) algorithm that propagates contribu-tions from an output activation back to the firstlayer, the input pixel space. Arras et al. (2017)implement a similar strategy for NLP tasks, wherecontributions are propagated into the word vectorspace and then summed up over the word vectordimensions to create heatmaps in the plain text in-put space.

In the course of the backpropagation of con-tributions, hidden layers in the DNN are visited,but rarely inspected.1 A major challenge when in-specting and interpreting hidden states lies in thefact that a projection into an interpretable domainis often made difficult by the non-linearity, non-locality and dimension reduction/expansion acrosshidden layers.

In this work, we present an explainabilitymethod that visits and projects visible and hiddenstates in Graph Convolutional Networks (GCN)(Kipf and Welling, 2016) onto the interpretable in-put space. Each GCN layer fuses neighborhoodsin the input graph. For this fusion to work, GCNsneed to replicate the input adjacency at each layer.This allows us to project their intermediate repre-sentations back onto the graph.

Our method, layerwise relevance visualization(LRV), not only visualizes static explanations inthe hidden layers, but also tracks and visual-izes hidden cross-layer dynamics, as illustrated inFig. 1. In a qualitative and a quantitative anal-ysis, we experimentally demonstrate that our ap-proach yields meaningful explanations for a GCNsentence classifier.

2 Methods

In this section we explain how we combine GCNsand LRP for layerwise relevance visualization. Webegin with a short recap of GCNs and LRP to es-tablish a basis for our approach and to introducenotation.

2.1 Graph Convolutional NetworksAssume a graph G represented by (A,H(0))where A is a degree-normalized adjacency matrix,which introduces self loops to the nodes in G, and

1Although, Arras et al. (2017) propose inspecting hiddenlayers with LRP as a possible future direction.

58

A total of 116 patients were randomized .(2.08) (14.21) (8.56) (41.18) (0.0) (30.28) (2.56) (1.12)

det

nmod

case

nummod

nsubjpass

auxpass punct

GCN

GCN

FC


det

nmod

case

nummod

nsubjpass

auxpass punct

GCN

GCN

FC


det

nmod

case

nummod

nsubjpass

auxpass punctGCN

GCN

FC

Figure 1: Layerwise Relevance Visualization in a Graph Convolutional Network. Left: Projection of relevancepercentages (in brackets) onto the input graph structure (red highlighting). Edge strength is proportional to therelevance percentage an edge carried from one layer to the next. Right: Architecture (replicated at each layer) ofthe GCN sentence classifier (input bottom, output top). Node and edge relevance were normalized layerwise. Thepredicted label of the input was RESULT, as was the true label.

H(0) ∈ IRn×d an embedding matrix of the n nodesin G, ordered in compliance with A. A GCN layerpropagates such graph inputs according to

H(l+1) = σ(AH(l)W (l)

), (1)

which can be decomposed into a feature projec-tion H ′(l) = H(l)W (l) and an adjacency projec-tion AH ′(l), followed by a non-linearity σ.

The adjacency projection fuses each node withthe features in its effective neighborhood. At eachGCN layer, the effective neighborhood becomesone hop larger, starting with a one-hop neighbor-hood in the first layer. The last layer in a GCNclassifier typically is fully connected (FC) andprojects its inputs onto class probabilities.

2.2 Layerwise Relevance PropagationTo receive explanations for the classifications ofa GCN classifier, we apply LRP. LRP explains aneuron’s activation by computing how much eachof its input neurons contributed2 to the activation.Contributions are propagated back layerwise.

2We use the terms contribution and relevance inter-changeably.

Montavon et al. (2017) show that the positivecontribution of neuron h

(l)i , as defined in Bach

et al. (2015) with the z+-rule, is equivalent to

Ri =∑

j

h(l)i w

+ij∑

k h(l)k w

+kj

Rj , (2)

where∑

k sums over the neurons in layer l and∑j over the neurons in layer (l + 1). Eq. 2

only allows positive inputs, which each layer re-ceives if the previous layers are activated usingReLUs.3 LRP has an important property, namelythe relevance conservation property:

∑j Rj←k =

Rk, Rj =∑

k Rj←k, which not only conservesrelevance from neuron to neuron but also fromlayer to layer (Bach et al., 2015).

2.3 Layerwise Relevance VisualizationWe combine graph convolutional networks andlayerwise relevance propagation for layerwise rel-evance visualization as follows. During train-ing, GCNs receive input graphs in the form of(A,H(0)) tuples. This is efficient and allows

3LRP is capable of tracking negative contributions, too.In this work, we focus on positive contributions.

59

batching but it poses a problem for LRP: If theadjacency matrix is considered part of the input,one could argue that it should receive relevancemass. This, however, would make it hard to meetthe conservation property.

Instead, we treat A as part of the model in apost-hoc explanation phase, in which we constructan FC layer with A as its weights. The GCN layerthen consists of two FC sublayers. During the for-ward pass, the first one performs the feature pro-jection and the second one – the newly constructedone – the adjacency projection.

To make use of Eq. 2 in the adjacency layer,we need to avoid propagating negative activations.This is why we apply the ReLU activation early,right after the feature projection, which yields thepropagation rule

H(l+1) = Aσ(H(l)W (l)). (3)

When unrolled, this effectively just moves one ad-jacency projection from inside the first ReLU tooutside the last ReLU. Alternatively, it should bepossible to first perform the adjacency projectionand afterwards the feature projection.

Eq. 3 allows us to directly apply Eq. 2 to thetwo projection sublayers in a GCN layer. Dur-ing LRP, we cache the intermediate contributionmaps R(l) ∈ IRn×f that we receive right afterwe propagated past the feature projection sublayerand compute node i’s contribution in that layer asR(i(l)) =

∑cR

(l)ic . In addition, we also compute

the edge relevance e(l)(i,j) between nodes i and j asthe amount of relevance that it carried from layerl − 1 to layer l.

In what follows, we use R(i(l)) and e(l)ij to visu-alize hidden state dynamics (i.e. relevance flow) inthe input graph. Furthermore, similar to Xie andLu (2019), we make use of perturbation experi-ments to verify that our method indeed identifiesthe relevant components in the input graph.

3 Experiments

Our experiments are publicly available:https://github.com/DFKI-NLP/lrv/.We trained a GCN sentence classifier on a 20ksubset of the PubMed 200k RCT dataset (Der-noncourt and Lee, 2017). The dataset containsscientific abstracts in which each sentence isannotated with either of the 5 labels BACK-GROUND, OBJECTIVE, METHOD, RESULT,or CONCLUSION. Dernoncourt and Lee (2017)

describe the task as a sequential classification tasksince all sentences in one abstract are classifiedin sequence, but we treat it as a conventionalsentence classification task, classifying eachsentence in isolation.

In a preprocessing step, we used ScispaCy(Neumann et al., 2019) for tokenization and de-pendency parsing to generate the input graphs forthe GCN sentence classifier. For the node em-beddings in the dependency graph, we utilizedpre-trained fastText embeddings (Mikolov et al.,2018). Edge type information was discarded.

Our classifier, implemented and trained usingPyTorch (Paszke et al., 2017), consists of twostacked GCN layers (Eq. 3), followed by a max-pooling operation and a subsequent FC layer. Alllayers were trained without bias and optimizedwith the Adam optimizer (Kingma and Ba, 2014).After each optimization step we clamped the neg-ative weights of the last FC layer to avoid negativeoutputs.

During training, we batched inputs, but in thepost-hoc explanation phase, single graphs from thetest set were forwarded through the network to im-plement the strategy outlined in Sec. 2.3. We firstconstructed the adjacency layer with A, then per-formed a forward pass during which we cached theinputs at each layer for LRP.

LRP started at the output neuron with the max-imal activation, before the softmax normalization.For all intermediate layers we used Eq. 2. Sincecoefficients in the input word vectors are nega-tive, we used a special propagation rule (Montavonet al., 2017), which we omit here, due to spacecontraints. We cached the intermediate contribu-tion maps right after the max-pooling operation,after the second and after the first (input space)feature projection layers in the GCN layers.

For the qualitative analysis, we visualized thecontributions of each node at each layer as well asthe relevance flow over the graph’s edges acrosslayers, as outlined in Sec. 2.3. For the quantita-tive validation, we deleted a growing portion ofthe globally most relevant edges (starting with themost relevant ones) and monitored the model’sclassification performance. In a second experi-ment, we did the same starting with the least rele-vant edges. Here, global edge relevance refers tothe sum of an edge’s relevance across all layers,∑

l e(l)ij .

60

4 Results

After training, our model achieved a weighted F1

score of 0.822 on the official (20k) test set. Wethen performed the post-hoc explanation with thetrained model, receiving layerwise explanations asthe one depicted in Fig. 1. More examples can befound in the appendix.

Fig. 1 (top) shows which vectors in the finallayer the GCN bases its classification decision on.The middle and bottom segments reveal how muchrelevant information the model fused in these vec-tors from neighboring nodes during the forwardpass:

According to the LRV in Fig. 1, the modelprimarily bases its classification decision on thefusion of the vector representations of total,patients, and 116. The vector representationsof these nodes are accumulated in the vector of116 in two steps.

Interestingly, the vector of patients con-tributes a larger portion after it has been fused withtotal. Furthermore, randomized becomesmore relevant after the first GCN layer, contribut-ing to the second most relevant n-gram (were,randomized). Other n-grams, such as (A,total) or (of, patients) appear less rele-vant.

Note that if we had simply redistributed thecontributions in the final layer, using only theadjacency matrix, without considering the acti-vations in the forward pass as LRP does, thesen-grams would have received an unsubstantiatedshare. Furthermore, a conventional, single expla-nation in the input space would have hardly re-vealed the hidden dynamics discussed above.

As explained in Sec. 3, we also conducted in-put perturbation experiments to validate that ourmethod indeed identifies the graph componentsthat are relevant to the GCN decision. Fig. 2 sum-marizes the results: The performance of our modeldegrades much faster when we delete edges thatcontributed a lot to the model’s decision, accord-ing to our method. We take this as evidence thatour method indeed correctly identifies the relevantcomponents in the input graphs.

5 Related Work

Niepert et al. (2016); Wu et al. (2019) introducegraph networks that natively support feature visu-alization (explanations) but their models are not

0 20 40 60 80 100

0.78

0.8

0.82

Percentage of deleted edges

Wei

ghte

dF 1

least relevantmost relevant

Figure 2: Perturbation experiment.

graph convolutional networks in the sense of Kipfand Welling (2016) that we address here.

Velickovic et al. (2017) propose Graph Atten-tion Networks and also project an explanation of ahidden representation onto the input graph. Theydo not, however, continue across multiple layers.Furthermore, their explanation technique differsfrom ours. The authors use attention coefficientsto scale edge thicknesses in their explanatory visu-alization. In contrast, we base edge relevance onthe amount of relevance an edge carried from onelayer to the next, according to LRP. Our methoddoes this in a conventional GCN, which does notapply the attention mechanism.

The body of work on the explainability of graphneural nets also includes the GNN explainer byYing et al. (2019), a model-agnostic explainabil-ity method. Since their method is model-agnostic,it can be applied to GCNs. Nevertheless, it wouldnot reveal hidden dynamics, as our method does.

In parallel to our work, in the last couple ofmonths, several more works on explainability inthe context of graph neural nets were published.Pope et al. (2019), for instance, evaluate sev-eral explainability methods on data sets from thechemistry and vision domains. The methods donot include LRP, however, and the authors do notvisualize relevance layerwise. The two works thatare most related to our approach were publishedvery recently by Xie and Lu (2019) and Baldas-sarre and Azizpour (2019). Both implement LRPfor graph networks (inter alia). Xie and Lu (2019)introduce the notion of node importance visualiza-tion, which is also reflected in our explanationsand Baldassarre and Azizpour (2019), similar to

61

our approach, suggest to trace (and visualize) ex-planations over feature-less edges. Both works,however, address a different task from ours: Incontrast to our graph classification task, they per-form node classification and identify k-hop neigh-bor contributions in the proximity of a central nodeof interest.

6 Conclusions

We presented layerwise relevance visualization inconvolutional text graph classifiers. We demon-strated that our approach allows to track, visualizeand inspect visible as well as hidden state dynam-ics. We conducted qualitative and quantitative ex-periments to validate the proposed method.

This is a focused contribution; further researchshould be conducted to test the approach in otherdomains and on other data sets. Future versionsshould also exploit LRP’s ability to expose nega-tive evidence and the method could be extended tonode classifiers.

7 Acknowledgements

This research was partially supported by the Ger-man Federal Ministry of Education and Researchthrough the project DEEPLEE (01IW17001). Wewould also like to thank the anonymous reviewersfor their feedback on the paper and Arne Binderfor his feedback on the code base.

ReferencesLeila Arras, Franziska Horn, Gregoire Montavon,

Klaus-Robert Muller, and Wojciech Samek. 2017.”What is relevant in a text document?”: An in-terpretable machine learning approach. PloS one,12(8):e0181142.

Sebastian Bach, Alexander Binder, Gregoire Mon-tavon, Frederick Klauschen, Klaus-Robert Muller,and Wojciech Samek. 2015. On pixel-wise explana-tions for non-linear classifier decisions by layer-wiserelevance propagation. PLoS ONE, 10(7):e0130140.

Federico Baldassarre and Hossein Azizpour. 2019. Ex-plainability techniques for graph convolutional net-works. arXiv preprint arXiv:1905.13686.

Franck Dernoncourt and Ji Young Lee. 2017. Pubmed200k rct: a dataset for sequential sentence classifica-tion in medical abstracts. IJCNLP 2017, page 308.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907.

Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-vances in pre-training distributed word representa-tions. In Proceedings of the International Confer-ence on Language Resources and Evaluation (LREC2018).

Gregoire Montavon, Sebastian Lapuschkin, AlexanderBinder, Wojciech Samek, and Klaus-Robert Muller.2017. Explaining nonlinear classification decisionswith deep taylor decomposition. Pattern Recogni-tion, 65:211–222.

Gregoire Montavon, Wojciech Samek, and Klaus-Robert Muller. 2018. Methods for interpreting andunderstanding deep neural networks. Digital SignalProcessing, 73:1–15.

Mark Neumann, Daniel King, Iz Beltagy, and WaleedAmmar. 2019. Scispacy: Fast and robust models forbiomedical natural language processing.

Mathias Niepert, Mohamed Ahmed, and KonstantinKutzkov. 2016. Learning convolutional neural net-works for graphs. In International conference onmachine learning, pages 2014–2023.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.

Phillip E Pope, Soheil Kolouri, Mohammad Rostami,Charles E Martin, and Heiko Hoffmann. 2019. Ex-plainability methods for graph convolutional neuralnetworks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages10772–10781.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2017. Graph attention networks. arXiv preprintarXiv:1710.10903.

Felix Wu, Tianyi Zhan, Amauri Holonda de Souza Jr.,Christopher Fifty, Tao Yu, and Kilian Weinberger Q.2019. Simplifying graph convolutional network.arXiv preprint.

Shangsheng Xie and Mingming Lu. 2019. Interpret-ing and understanding graph convolutional neuralnetwork using gradient-based attribution methods.arXiv preprint arXiv:1903.03768.

Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zit-nik, and Jure Leskovec. 2019. Gnn explainer: A toolfor post-hoc explanation of graph neural networks.arXiv preprint arXiv:1903.03894.

62


TextGraphs 2019 Shared Task on Multi-Hop Inference forExplanation Regeneration∗

Peter Jansen? and Dmitry Ustalov†, ‡

?School of Information, University of Arizona, [email protected]

†Data and Web Science Group, University of Mannheim, Germany‡Yandex, Russian Federation

[email protected]

Abstract

While automated question answering systemsare increasingly able to retrieve answers to nat-ural language questions, their ability to gener-ate detailed human-readable explanations fortheir answers is still quite limited. The SharedTask on Multi-Hop Inference for ExplanationRegeneration tasks participants with regener-ating detailed gold explanations for standard-ized elementary science exam questions by se-lecting facts from a knowledge base of semi-structured tables. Each explanation containsbetween 1 and 16 interconnected facts thatform an “explanation graph” spanning core sci-entific knowledge and detailed world knowl-edge. It is expected that successfully combin-ing these facts to generate detailed explana-tions will require advancing methods in multi-hop inference and information combination,and will make use of the supervised trainingdata provided by the WorldTree explanationcorpus. The top-performing system achieveda mean average precision (MAP) of 0.56, sub-stantially advancing the state-of-the-art overa baseline information retrieval model. De-tailed extended analyses of all submitted sys-tems showed large relative improvements inaccessing the most challenging multi-hop in-ference problems, while absolute performanceremains low, highlighting the difficulty of gen-erating detailed explanations through multi-hop reasoning.

1 Introduction

Multi-hop inference is the task of combining morethan one piece of information to solve an infer-ence task, such as question answering. This cantake many forms, from combining free-text sen-tences read from books or the web, to combininglinked facts from a structured knowledge base. The

∗The two authors contributed equally to this work.

Figure 1: The explanation regeneration task suppliesa model with a question and its correct answer (top),and the model must successfully regenerate the goldexplanation for why the answer to the question is cor-rect by selecting the appropriate set of interconnectedfacts from a knowledge base (bottom). Gold explana-tions range from having 1 to over 16 facts, with thisexample containing 3 facts.

Shared Task on Explanation Regeneration asks par-ticipants to develop methods to reconstruct goldexplanations for elementary science questions, us-ing a corpus of explanations that provides super-vision and instrumentation for this multi-hop in-ference task. Each explanation is represented asan “explanation graph”, a set of up to 16 atomicfacts drawn from a knowledge base of 4,950 factsthat, together, form a detailed explanation for thereasoning required to answer a question. The ex-planations include both core scientific facts as wellas detailed world knowledge, integrating aspectsof multi-hop reasoning and common-sense infer-ence. It is anticipated that linking these facts toachieve strong performance at rebuilding the goldexplanation graphs will require methods to performmulti-hop inference.

63

Large language models have recently demon-strated human-level performance on elementaryand middle school standardized multiple choicescience exams, achieving 90% on the elementarysubset, and 92% on middle-school exams (Clarket al., 2019). While these models are able to answermost questions correctly, they are generally unableto explain the reasoning behind their answers toa user, for example generating the explanation inFigure 1. This inability to perform interpretable,explanation-centered inference places strong limitson the utility of these underlying solution meth-ods. For example, an intelligent tutoring systemthat provides students correct answers but that isunable to explain why they are correct limits thestudent’s ability to acquire a deep understandingof the subject matter. Similarly, in the medical do-main, a system that recommends a patient receivea particular surgery but that is unable to explainwhy presents challenges towards trusting that thesystem has made the correct medical decision.

Multi-hop inference provides a natural mecha-nism for producing explanations by aggregatingmultiple facts into an “explanation graph”, or a se-ries of facts that were used to perform the inferenceand arrive at a particular answer. By providingthese same facts to a user in the form of a human-readable explanation, the user is able to inspect thereasoning made by an automated algorithm, both tounderstand its reasoning and evaluate its soundness.An additional implication of multi-hop inferenceis the ability to meaningfully combine facts usingsmaller, human-scale (or child-scale) knowledgeresources to perform the inference task. For exam-ple, the RoBERTa model (Liu et al., 2019) used toachieve 90% accuracy on science exam questionanswering by Clark et al. (2019) was pre-trainedon 160GB of text, while the WorldTree explanationcorpus (Jansen et al., 2018) used here shows thesesame questions can be answered and provided withdetailed explanations using only 500KB of text,a difference of more than 5 orders of magnitude.1

Unfortunately multi-hop reasoning is currently verychallenging, and current methods have strong limi-tations due to noise in this information aggregationprocess, the limitations of existing training data,and the ultimate numbers of facts required to builddetailed explanations. These contemporary chal-

1The knowledge base in the WorldTree explanation corpusis approximately 500KB, a factor of 320,000 times less thanthe 160GB of text used to train the RoBERTa language model(Liu et al., 2019).

lenges are briefly described in Section 2.We propose “explanation regeneration” as a

stepping-stone task on the path towards large-scalemulti-hop inference for question answering andexplanation generation. Explanation regenerationsupplies a model with both a question and correctanswer, and asks the model to regenerate a detailedgold explanation (generated by a human annotator)by selecting one or more facts in a knowledge basethat the model believes should be in the explana-tion. As the results of this shared task show, evenwith the question and correct answer provided, re-generating a detailed explanation proves to be anextremely challenging task, even when the factsare drawn from a comparatively small knowledgebase. It is our hope that this stepping-stone taskwill help inform methods of combining informationto support inference, and provide instrumentationto develop algorithms capable of combining largenumbers of facts (10+) that appear challenging toreach with current methods for multi-hop inference.

2 Contemporary Challenges inMulti-hop Inference

Semantic Drift. One of the central challenges toperforming multi-hop inference is that meaning-fully combining facts – i.e. traversing from onefact to another in a knowledge graph – is a noisyprocess, in large part because the signals we havefor knowing whether two facts are relevant to an-swering a question (and can thus be meaningfullycombined) are imperfect. Often times those signalsare as simple as lexical overlap – two sentences (ornodes) in a knowledge graph sharing one or moreof the same words. Sometimes this lexical over-lap is a useful traversal mechanism – for example,knowing both “a fly is a kind of [insect]” and “an[insect] has six legs”, two facts that connect on theword insect, helps answer the question about in-sect identification in Figure 1. Unfortunately, oftentimes these signals can lead to information that isnot on context or relevant to answering a particularquestion – for example, combining “a [tree] is akind of living thing” and “[trees] require sunlight tosurvive” would be unlikely to help answer a ques-tion about “Which adaptations help a tree survivethe heat of a forest fire?”.

The observation that chaining facts together onimperfect signals often leads inference to go off-context and become errorful is the phenomenon of“semantic drift” (Fried et al., 2015), and has been

64

Figure 2: An example gold explanation graph that contains 11 facts. Top: the question and its correct answer.Bottom: the 11 facts of the gold explanation. Each fact is represented as a row in a semi-structured table, drawnfrom a knowledge base of 62 tables totalling approximately 4,950 table rows. Colored edges represent how factsinterconnect with each other and/or the question or answer text based on lexical overlap (i.e. sharing one or moreof the same lemmas).

demonstrated across a wide variety of representa-tions and traversal algorithms including words anddependencies (Fried et al., 2015), embeddings (Panet al., 2017), sentences and sentence-level graphs(Jansen et al., 2017), as well as aggregating entireparagraphs (Clark and Gardner, 2018). Typicallymulti-hop models see small performance benefits(of between 1% to 5%) when aggregating 2 piecesof information, and may see small performancebenefits when aggregating 3 pieces of information,then performance decreases as progressively moreinformation is aggregated due to this “semanticdrift”. Khashabi et al. (2019) analytically show thatsemantic drift places strong limits on the amountof information able to be combined for inference.

Long Inference Chains. Jansen et al. (2016,2018) showed that even inferences for elementaryscience require aggregating an average of 6 facts(and as many as 16 facts) to answer and explainthe reasoning behind those answers when com-mon sense knowledge is included. With contempo-rary inference models infrequently able to combinemore than 2 facts, the current state-of-the-art is

still far from being able to meaningfully combineenough information to produce detailed and thor-ough explanations to 4th grade science questions.

Multi-hop methods are not required to an-swer questions on many “multi-hop” datasets.Chen and Durrett (2019) show that it is possibleto achieve near state-of-the-art performance ontwo popular multi-hop question answering datasets,WikiHop (Welbl et al., 2018) and HotPotQA (Yanget al., 2018), using baseline models that do not per-form multi-hop inference. Because new multi-hopinference algorithms are often characterized usingtheir accuracy on the question answering task asa proxy for their capacity to perform multi-hopinference, rather than explicitly evaluating an algo-rithm’s capacity to aggregate information by con-trolling the amount of information it can combine(as in Fried et al. (2015)), we currently do not havewell-controlled characterizations of the informa-tion aggregation abilities of many proposed multi-hop algorithms. The WorldTree explanation corpus(Jansen et al., 2018) used in this dataset providesdetailed supervised training and evaluation data

65

Question: A student placed an ice cube on a plate in the sun. Ten minutes later, only water was on the plate.Which process caused the ice cube to change to water?

Answer Candidates: (A) condensation (B) evaporation (C) freezing (*D) melting

Gold Explanation from WorldTree Corpus:Explanatory Role Fact (Table Row)

CENTRAL melting means changing from a solid into a liquid by adding heat energyGROUNDING an ice cube is a kind of solidGROUNDING water is a kind of liquid

CENTRAL water is in the solid state, called ice, for temperatures between -273C and 0 CLEXGLUE heat means heat energyLEXGLUE adding heat means increasing temperatureCENTRAL if an object absorbs solar energy then that object will increase in temperatureCENTRAL if an object is in the sunlight then that object will absorb solar energyCENTRAL the sun is a source of (light ; light energy) called sunlightLEXGLUE to be in the sun means to be in the sunlightCENTRAL melting is a kind of process

Explanation Regeneration Task (Ranking):Rank Gold Fact (Table Row)

1 * melting is a kind of process2 thawing is similar to melting3 melting is a kind of phase change4 melting is when solids are heated above their melting point5 amount of water in a body of water increases by (storms ; rain ; ice melting)6 an ice cube is a kind of object7 * an ice cube is a kind of solid8 freezing point is similar to melting point9 melting point is a property of a (substance ; material)10 glaciers melting has a negative impact on the glaicial environment11 plate tectonics is a kind of process12 sometimes piles of rock are formed by melting glaciers depositing rocks13 melting point can be used to identify a pure substance14 ice crystals means ice15 the (freezing point of water ; melting point of water) is 0C16 the melting point of iron is 1538C17 the melting point of oxygen is -218.8C18 * melting means changing from a solid into a liquid by adding heat energy19 adding salt to a liquid decreases the melting point of that liquid20 ice is a kind of food...

Ranks of gold rows: 1, 7, 18, 53, 102, 384, 408, 858, 860, 3778, 3956Average precision of ranking: 0.149

Figure 3: An example ranking from the tf.idf baseline system for the explanation reconstruction task. Top: theelementary science question and multiple choice answer candidates, with the correct answer highlighted (the cor-rect answer is supplied to the model). Middle: the gold explanation for this question, supplied by the WorldTreecorpus. Each fact/sentence is represented as a row in a semi-structured table (see Section 4 for a description ofthe explanation corpus and knowledge base). Bottom: the baseline system’s rankings of the facts in the knowledgebase, where facts believed to be in the gold explanation are preferentially ranked to the top of the list.

for how multiple facts can link to produce detailedexplanations, providing a targeted method of instru-menting multi-hop performance.

Chance Performance on Knowledge Graphs.Jansen (2018) empirically demonstrated that se-mantic drift can be overpoweringly large or decep-tively low, depending on the text resources used to

build the knowledge graph, and the criteria usedfor selecting nodes. While the chance of hoppingto a relevant node on a graph constructed from sen-tences in an open-domain corpus like Wikipediacan be very small, using a term frequency modelcan increase this chance performance by ordersof magnitude, increasing chance traversal perfor-mance beyond the performance of some algorithms

66

reported in the literature. Unfortunately evaluatingthe chance performance on a knowledge graph iscurrently a very expensive manual task, and wecurrently suffer from a methods problem of beingable to disentangle the performance of novel multi-hop algorithms from the chance performance of theknowledge graphs they use.

Explicit Training Data for Multi-hop Inferenceand Explanation Construction. Because of thedifficulty and expense associated with manually an-notating inference paths in a knowledge base, mostmulti-hop inference algorithms have lacked explicitsupervision for the multi-hop inference task. Asa result, models have often had to use other latentsignals – like answering a question correctly – asa proxy for doing well at the multi-hop inferencetask, even if they do not have a strong correlationwith producing meaningful combinations of infor-mation or strong explanations (Jansen et al., 2017).

3 Task Description

The explanation regeneration task supplies boththe question and correct answer, and requires amodel to build an explanation for why the answeris correct. We consider this a stepping-stone tasktowards multi-hop inference for question answer-ing as the model (strictly speaking) is only requiredto perform an explanation construction task, andis not required to perform the question answeringtask of inferring the correct answer to the question– though models are free to also undertake this stepif they wish.

To encourage a wide variety of techniques bothgraph-based and otherwise, the evaluation of expla-nation reconstruction is framed as a ranking task.For a given question, the model is given the ques-tion and correct answer text, and must selectivelyrank a list of knowledge base facts such that thosethe model believes are a part of a gold explanationfor that question are preferentially ranked to thetop of a list.

An example question and gold explanation graphare shown in Figure 2. The question asks a studentto infer what process causes an ice cube to turn intowater when placed in the sun. The detailed explana-tion is aimed at supplying all facts required to havea detailed understanding of the situation to arriveat the correct answer, and includes both core scien-tific knowledge (e.g. “if an object absorbs solarenergy then that object will increase in tempera-ture”) and world knowledge (e.g. “an ice cube is

a kind of solid”). This scientific and world knowl-edge is generally not supplied in the question, but isknowledge a computational algorithm would likelyrequire in order to arrive at a complete explanationthat would be meaningful to someone who may notpossess that world knowledge. In this way the levelof detail in the explanations is aimed at a youngchild that possesses minimal world knowledge, andthe explanations tend to represent instantiated ver-sions of scripts or frames (Schank and Abelson,1975; Baker et al., 1998) that a model would haveto understand or use to completely reconstruct theexplanation.

An example of the explanation reconstructiontask framed as a ranking problem is shown in Fig-ure 3. Here, an example model (the tf.idf baseline)must preferentially rank facts from the knowledgebase that it believes are part of the gold explanationto the top of the ranked list. In the case of the ex-ample question about ice melting in the sun, onlythree of the facts listed in the gold explanation areranked within the top 20 facts (here, “melting isa kind of process” and “an ice cube is a kind ofsolid”, ranked at positions 1 and 7, respectively,as well as the core scientific fact “melting meanschanging from a solid to a liquid by adding heatenergy”, ranked at position 18). Explanation recon-struction performance is evaluated in terms of meanaverage precision (MAP) by comparing the rankedlist of facts with the gold explanation. In Section 7,we perform extended analyses that further breakdown the performance of each system submittedto this shared task using both automated analysesas well as a manual analyses of the relevance ofhighly ranked explanation sentences.

4 Training and Evaluation Dataset

The data used in this shared task comes from theWorldTree explanation corpus (Jansen et al., 2018).The data includes approximately 2,200 standard-ized elementary science exam questions 3rd to 5th

grade drawn from the Aristo Reasoning Challenge(ARC) corpus (Clark et al., 2018). 1,657 of thesequestions include detailed explanations for theiranswers, in the form of graphs of separate atomicfacts that are connected together by having lexicaloverlap (i.e. shared words) with each other, and/orthe question or answer text. For this shared task,the corpus is divided into the standard ARC train,development, and test sets. Considering only ques-tions that contain gold explanations, this results in

67

Figure 4: The distribution of explanation lengths in thetraining set, represented as numbers of discrete facts(or “table rows”) in the explanation. On average, eachquestion contains 6.3 facts in its explanation.

Figure 5: The distribution of facts with a given explana-tory role, calculated within question. For the averageexplanation, 61% of explanation facts are labeled ashaving a central role, while 19% are labeled as ground-ing, and the remaining 21% as lexical glue. For theaverage explanation containing 6 facts, approximately4 of these facts will (on average) be labeled as central,one will be labeled grounding, and one will be labeledas lexical glue.

a total of 902 questions for training, 214 for devel-opment, and 541 for test.2 The remaining questionsthat do not have gold explanation graphs requiredspecialized reasoning (e.g. spatial or mathemati-cal reasoning) that did not easily lend itself to themethod of textual explanation used in constructingthis corpus.

Each explanation is represented as a reference toone or more facts in a semi-structured knowledgebase of tables (the “tablestore”). The tablestorecontains 62 tables, each organized around a particu-lar kind of knowledge (e.g. taxonomic knowledge,part-of knowledge, properties, changes, causality,coupled relationships, etc.) developed through adata-driven analysis of understanding the needs

2A new version of the WorldTree corpus that substantiallyexpands the size of the dataset is anticipated shortly.

of elementary science explanations (Jansen et al.,2016). Each “fact” is represented as one row in agiven table, and can be used as either structured orunstructured text. As a structured representation,each table row represents an n-ary relation whoserelational roles are afforded by the columns in eachtable. As an unstructured representation, each tableincludes “filler” columns that allow each row to beread off as a human-readable plain text sentence, al-lowing the same data to be used for both structuredand unstructured techniques.

The WorldTree tablestore contains 4,950 tablerows/facts, 3,686 of which are actively used in atleast one explanation. Explanation graphs com-monly reuse the same knowledge (i.e. the sametable row) used in other explanations. The mostcommon fact (“an animal is a kind of organism”)is used in 89 different explanations, and approxi-mately 1,500 facts are reused in more than one ex-planation. Explanations were designed to includedetailed world knowledge with the goal of being“meaningful to a 5 year old child”, and range fromhaving a single fact to over 16 facts, with the dis-tribution of explanation lengths shown in Figure 4.More details, analyses, and summary statistics areincluded in Jansen et al. (2018).

For each explanation, the WorldTree corpus alsoincludes annotation for how important each factis towards the explanation. There are three cate-gories of importance, with their distribution withinquestions shown in Figure 5:

CENTRAL: These facts are at the core of the ex-planation, and are often core scientific con-cepts in elementary science. For example, ina question primarily testing a knowledge ofchanges of states of matter, “melting meanschanging from a solid to a liquid by addingheat energy” would be considered having acentral role.

GROUNDING: These facts tend to link core sci-entific facts in the explanation with specificexamples found in the question. For example,a question might require reasoning about icecubes, butter, ice cream, or other solids melt-ing. These facts (e.g. “ice is a kind of solid”)are considered as having a grounding role.

LEXICAL GLUE: The explanation graphs inWorldTree require each explanation sentenceto be explicitly linked to either the question

68

text, answer text, or other explanation sen-tences based on lexical overlap (i.e. two sen-tences having one or more shared words).Facts with a “lexical glue” role tend to ex-press synonymy or short definitional relation-ships, potentially between short multi-wordexpressions, such as “adding heat means in-creasing heat energy”. These are used tobridge two facts in an explanation togetherwhen the facts use different words for simi-lar concepts (e.g. one fact refers to “addingheat”, while another fact refers to “increasingheat energy”). These facts are important forcomputational explanations, allowing explicitlinking between referents, but would likelybe considered excess detail when deliveringexplanations to most human users.

This explanatory role annotation makes it possi-ble to separately evaluate how many central facts,grounding facts, and lexical glue (or synonymy) re-lations that a given inference method reconstructs.For two algorithms with similar performance, thisallows determining whether one primarily recon-structs more of the core “central” facts, making itmore likely to be useful to a human user.

5 Shared Task Online Competition Setup

Similar to our previous experience in shared taskorganization (Panchenko et al., 2018), we used theCodaLab platform for running the competition on-line.3 For the convenience of the participants, theshared task was divided into two phases. In thePractice phase which began on May 20, 2019, wereleased the participant kit that included the fulltraining and development datasets along with thePython code of the scoring program used in thecompetition and the Scala code for the tf.idf base-line.4 In the Evaluation phase held from July 12till August 9, 2019, we provided the participantswith a masked version of the test set the rankingsof which were shuffled randomly. The actual testset was stored on CodaLab and was not availableto the participants, who had to upload their ownrankings to receive the MAP value computed onCodaLab. Each team was limited to 15 trials; onlyone result could be published on the leaderboard.

3https://competitions.codalab.org/competitions/20150

4https://github.com/umanlp/tg2019task

PerformanceTeam (MAP)

ChainsOfReasoning (COR) 0.563pbanerj6 (ASU) 0.413Red Dragon AI (RDAI) 0.402 0.477*jenlindadsouza (JS) 0.394Baseline (tf.idf) 0.296

Table 1: The leaderboard performance of the submit-ted systems for the explanation regeneration task onthe held out test set. (* denotes that the team ulti-mately achieved higher performance post-deadline, anddescribes this additional in their system description pa-per.)

6 System Descriptions and Performance

The shared task received public entries from 4 par-ticipating teams, with the performance of their sys-tems shown in Table 1. In this section we brieflydescribe these systems.

Baseline. A term frequency model that uses atf.idf weighting scheme (e.g. Ch. 6, Manning et al.,2008) to determine lexical overlap between eachrow in the knowledge base with the question andanswer text. For each row, the cosine similarity be-tween a vectors representing the question text androw text is calculated, and this process is repeatedfor the answer text. These two cosine similaritiesserve as features to an SVMrank ranking classifier(Joachims, 2006),5 which, for a given question, pro-duces a ranked list of rows in the knowledge basemost likely to be part of the gold explanation forthat question.

Model 1 (JS). The system by D’Souza et al.(2019) performs explanation regeneration first byidentifying facts that have high matches with ques-tions using a set of overlap criteria, then by ensur-ing this set of initial facts can meaningfully pairtogether using a set of coherency criteria. Overlapcriteria are evaluated using ConceptNet conceptsand triples (Liu and Singh, 2004), FrameNet predi-cates and arguments (Baker et al., 1998), OpenIEtriples (Angeli et al., 2015), as well as lexical fea-tures such as words and lemmas. This results in 76feature categories that are ranked using SVMrank.An error analysis identified 11 common and clearcategories of errors that were addressed by rerank-ing candidate rows using a series of hand-craftedrules, such as “if an explanation sentence contains

5http://svmlight.joachims.org/

69

a named entity that is not found in the question oranswer, reduce its rank”. This rule-based rerankingresulted in a large 5% performance boost to themodel.

Model 2 (ASU). The system by Banerjee (2019)models explanation regeneration using a re-rankingparadigm, where a first model is used to providean initial ranking, and the top-N facts ranked bythat system are re-ranked to improve overall per-formance. Initial ranking was explored using bothBERT (Devlin et al., 2019) and XLNet (Yang et al.,2019) transformer models, fine-tuned on the super-vised explanation reconstruction data provided bythe training set. Experiments showed that initialranking performance was improved when trainedwith additional contextual information, in the formof including parts of gold explanations with ques-tion text when training the row relevance rankingtask. The reranking procedure involved evaluatingboth relevance and cosine similarity between expla-nation rows in a shortlist of top ranked rows, wherea shortlist size of N=15 demonstrated maximumreranking performance.

Model 3 (RDAI). This series of systems by Chaiet al. (2019) explores fine-tuned variations of tf.idfand BERT-based models. A BERT model is aug-mented with a regression module trained to predictthe relevance score for each (question text, explana-tion row) pair, where this relevance score is calcu-lated using an improved tf.idf method. Due to thecompute time required, the model is used to rerankthe top 64 predictions made by the tf.idf module.

Model 4 (COR). This best-performing systemby Das et al. (2019) presents two models: a BERTbaseline that ranks individual facts, and a BERTmodel that ranks paths of facts. Where other sub-missions used BERT as a reranking model, herethe BERT baseline is used to rank the entire setof facts in the knowledge base, increasing perfor-mance to 0.56 MAP on the development set. Thisteam observed that for 76% of questions, all theremaining facts in the explanation are within 1-hop of the top 25 candidates returned by a tf.idfmodel. They then construct a path ranking model,where a BERT model is trained with valid shortchains of valid multi-hop facts from the top 25candidates. Because of the large number of possi-ble permutations of multi-fact combinations, thecomputational requirements of this chain model aresignificantly higher, and due to this limitation the

chain model was evaluated only using the top 25 ortop 50 candidates. While this path ranking modelslightly underperformed the BERT baseline, it didso while substantially undersampling the space ofpossible starting points for chains of reading (top25 candidate facts vs all 4,950 facts). The teamthen show how an ensemble method that uses thepath ranking model for high-confidence cases, andthe BERT baseline for low-confidence cases canachieve higher performance than either model in-dependently.

7 Extended Evaluation and Analysis

The annotation in the WorldTree corpus and itssupporting structured knowledge base allows per-forming detailed automated and semi-automatedcharacterizations of model performance. To helpassess each model’s capacity to perform multi-hopinference, we perform an evaluation of model per-formance using lexical overlap between questionsand facts as a means of determining the necessity ofrequiring multiple hops to find and preferentiallyrank a given fact. To mitigate issues with fully-automated evaluations of explanation regenerationperformance, we also include a manual evaluationof the relevance of highly-ranked facts in Table 3.We include additional automated characterizationsof performance in Table 4.

7.1 Performance by Lexical Overlap /Multiple Hops

Ostensibly the easiest explanatory facts for manymodels to locate are those that contain a large num-ber of shared words with the question and/or an-swer text,6 while those with only a single sharedword can be difficult or improbable to locate(Jansen, 2018). Those explanatory facts that donot contain shared words with the question or an-swer require multi-hop methods to locate, travers-ing from question text through one or more otherexplanatory facts before ultimately being identified.This distinction is shown in Figure 6.

Breaking down performance by the amount oflexical overlap (shared words) with the questionand/or answer helps characterize how well a givenmodel is performing at the multi-hop inferencetask. A model particularly able to retrieve factswith a high amount of lexical overlap may show

6Jansen (2018) empirically demonstrated that sentencescontaining 2 or more shared words with the question and/oranswer text can have an extremely high chance performanceat being retrieved.

70

Questions Baseline TeamMetric N tf.idf JS ASU RDAI COREvaluating overlap considering only nouns, verbs, adjectives, and adverbs:

(1-hop) Rows with 2 or more shared words with Q/A 541 0.44 0.55 0.59 0.64 0.68(1-hop) Rows with 1 shared word with Q/A 275 0.12 0.21 0.36 0.30 0.48(2+ hop) Rows without shared words with Q/A 88 0.00 0.13 0.20 0.15 0.31

Evaluating overlap without filtering (all words considered):(1-hop) Rows with 2 or more shared words with Q/A 541 0.35 0.45 0.50 0.54 0.61(1-hop) Rows with 1 shared word with Q/A 275 0.18 0.29 0.39 0.34 0.50(2+ hop) Rows without shared words with Q/A 88 0.00 0.12 0.24 0.19 0.35

Table 2: Explanation reconstruction performance broken down by the level of lexical overlap a given fact has withthe question and/or answer. 1-hop refers to facts that have at least one shared word with the question or answer.2+ hops refers to facts that do not have lexical overlap with question or answer text, and must be traversed to fromthe question text through other facts. Results across all models show that performance at finding facts generallydecreases as the proportion of lexical overlap between the question text and a given fact decreases. Performancereflects mean average precision on the explanation regeneration task. Note that average performance in this analysisis normalized by the number of questions a given criterion applies to (N), and not the total number of questions inthe evaluation corpus, and as such may vary from lexical overlap results reported in participant papers.

a large overall performance in explanation recon-struction, but be poor at performing multi-hop in-ference. Similarly, a model particularly able to per-form multi-hop inference without a strong retrievalcomponent may have its multi-hop performancemasked by an overall low score at the multi-hopinference task. Performance on identifying factsthat do not have lexical overlap with the questionor answer is a strong indicator of multi-hop infer-ence performance, as these facts can only be foundthrough indirect means, such as “hopping” to otherintermediate facts between them and the questionor answer text.

Model performance broken down by explanationrows that contain lexical overlap with question oranswer terms is included in Table 2. Here, lexicaloverlap is assessed by the intersection of the set oflemmas in both question and answer text, versus theset of words in a given table row. This means thatmultiple mentions of the same word, or words thatreduce to the same lemma, are considered only asingle word of overlap. For example, if the questionand answer contained three occurrences of the word“organisms”, and a given table row also containedtwo occurrences of “organism”, this would stillonly count as one word of lexical overlap betweenquestion and row text.

Table 2 shows that for all models submitted tothe shared task, the largest contributor to modelperformance is from locating explanation sentencesthat have 2 or more shared words with the questionor answer. Similarly, models also derive moderateperformance from locating explanation sentences

that contain only a single word of overlap betweenquestion and answer. All models show their lowestperformance on locating gold explanation facts thatdo not contain lexical overlap with the question oranswer, ranging from a MAP of nearly zero (for thetf.idf model, which exclusively uses lexical overlapto rank explanation sentences), to a MAP of upto one half of a given model’s “2+ shared word”performance, depending on whether only contentlemmas (nouns, verbs, adjectives, and adverbs) orall lemmas are considered for lexical overlap.

Recent work has demonstrated that it is possiblefor models to achieve high performance on multi-hop datasets without performing multi-hop infer-ence (Chen and Durrett, 2019; Min et al., 2019),highlighting the need to directly instrument multi-hop performance versus overall performance togauge progress on this challenging task. The eval-uation in Table 2 shows that higher overall expla-nation regeneration performance does not neces-sarily imply better multi-hop performance. Thebest-performing model achieves a MAP of 0.35on ranking 2+ hop facts, up from the negligible2+ hop performance of the baseline model. Whilethis 2+ hop performance is low in an absolute sense,it represents a substantial improvement in the state-of-the-art on this dataset.

It is important to note that examining the per-formance on facts without lexical overlap is nota complete assessment of multi-hop performance.Indeed, it is common for certain clusters of facts tocontain lexical overlap not only with the questionand answer, but also with each other. Identify-

71

QuestionRecycling newspapers is good for the

environment because it:Answer: helps conserve resources.

Gold Explanation Sentences that share2 or more words with Q or A

1. Recycling resources has a positive impact on theenvironment and the conservation of those resources.

Gold Explanation Sentences that share1 word with Q or A

2. A newspaper is made of paper.3. Trees are a kind of resource.4. “To be good for” means “to have a positive impact on”.

Gold Explanation Sentences thatdo not share words with Q or A

5. Trees are a source of paper.

Figure 6: Example explanation sentences with differ-ent degrees of lexical overlap with the question/answer.Top/Middle: gold explanation sentences that have twoor more (top) or exactly one (middle) shared wordswith the question or answer (bolded). Bottom: goldexplanation sentences that do not have shared wordswith the question or answer, and are only connectedbased on shared words with other explanation sen-tences (underlined).

ing this inter-fact cohesion to successfully locatethese clusters of explanatory facts is still a formof multi-hop inference, as it requires integratingknowledge from more than one fact – even if eachof those facts contains strong retrieval cues such aslexical overlap with question text. As such, assess-ing performance on facts without lexical overlapwith question text is only one method of assessingmulti-hop performance on particularly challengingmulti-hop problems, and not a complete characteri-zation of multi-hop performance.

7.2 Manual Evaluation of ExplanationQuality

For each question in the WorldTree corpus, an an-notator has provided a set of gold facts that providea detailed explanation for why the answer is correct.While this enables supervised training and fully au-tomatic evaluation of explanation generation, theexplanation annotation is non-exhaustive – that is,it is possible for there to be facts in the knowledgebase that may be relevant to building an explanationfor a given question, but that are not included in thegold explanation. This is a pragmatic limitation ofthe ability to perform entirely automated evaluationusing this dataset, as there are often multiple (poten-

tially overlapping) ways of building an explanationfor the answer to a question. As a result of thislimitation, rows ranked highly by some algorithmsmay be genuinely useful for building explanations,but would be marked incorrect by the automatedevaluation, under-estimating performance in somecircumstances. Performing a small-scale manualevaluation of explanation quality at regular mile-stones helps provide a balance between speed ofevaluation during model development, and accu-racy in model characterization. To address thisneed in evaluation accuracy, we performed a man-ual characterization of model performance for eachof the 4 shared task model submissions, as well asthe baseline model.

We performed a manual evaluation of fact rel-evance for all facts ranked within the top 20 foreach model on 14 randomly selected questions7 inthe held-out test set. This resulted in 758 manualevaluations of fact relevance. For a given question,all facts ranked in the top 20 across each modelwere pooled into a single list such that the annota-tor was blind to which model(s) selected them. Thefacts were ranked on a 4 point scale: (1) Gold, (2)Highly Relevant facts that could appear in a goldexplanation, (3) Possibly Relevant facts generallyon broadly similar topics to the question or entitiesin the question, and (4) facts that are Not Relevantto the question.8 Examples of these ratings can befound in Figure 7.

The results of this manual analysis are shownin Table 3, presented as proportions of the top-Nranked rows for each model. In Table 3, the propor-tion marked gold is equivalent to the Precision@Nmetric in Table 4, but measured using 14 ques-tions instead of the entire test set. Using this as agauge of accuracy, we observe that there is gener-ally strong agreement between the Precision@Nvalues in this sample of 14 questions as to the entiretest set, with values generally within a few percent-age points9. This manual evaluation shows that

7The 14 questions selected for manual evaluation were theinitial questions in the test set.

8One of the challenges with such a rating system is itssubjectivity. Different explanations can be written for thesame question that may contain many of the same facts, orlargely different facts. It is also possible that a very detailedexplanation that includes a large amount of world knowledgemight include more “possibly relevant/topical” facts than a lessdetailed, more high-level explanation. The methodologicalissues with manually evaluating explanation quality are left tofuture work.

9Notable exceptions are the manual evaluations of the Top-5 values for the ASU and RDAI models, which can vary by

72

Baseline TeamManual Rating tf.idf JS ASU RDAI CORTop 5 Ranked Rows

Marked Gold 27% 32% 46% 36% 49%Highly Relevant 19% 22% 26% 27% 20%Possibly Relevant or Topical 40% 25% 16% 30% 24%Not Relevant 14% 22% 13% 7% 7%Manual Relevance@5 (Gold+HR) 46% 54% 72% 63% 79%

Top 10 Ranked RowsMarked Gold 17% 23% 31% 29% 34%Highly Relevant 15% 15% 21% 26% 17%Possibly Relevant or Topical 45% 34% 25% 30% 33%Not Relevant 33% 29% 22% 15% 16%Manual Relevance@10 (Gold+HR) 32% 38% 52% 55% 51%

Top 20 Ranked RowsMarked Gold 11% 14% 18% 18% 20%Highly Relevant 13% 12% 18% 23% 19%Possibly Relevant or Topical 40% 39% 29% 38% 40%Not Relevant 36% 35% 35% 22% 21%Manual Relevance@20 (Gold+HR) 24% 26% 36% 41% 39%

Table 3: Manually rated relevance judgements for the top 5, top 10, and top 20 ranked rows, across 14 randomlysampled questions. Results at top 20 reflect 758 manual judgements. “Marked gold” performance is equivalent toPrecision@N performance in Table 4 with 14 samples instead of 541. Results show that generally between 12%and 27% of top-ranked facts that are not marked as gold may also be highly relevant to building an explanationfor a given question. This adjusted performance, summing both gold and manually-rated highly relevant facts, isprovided as “Manual Relevance@N”.

QuestionQ: Many grass snakes are green. The color of the snake most likely helps it to:

(A) climb tall trees(B) fit into small spaces.

(*C) hide when threatened(D) shed its skin

Manual Relevance Rating Example RowGold Camouflage is used for protection/hiding by prey from predators.Highly Relevant Camouflage is a kind of adaptation for hiding in an environment.Possibly Relevant/Topical Eyes can sense light energy for seeing.Not Relevant Many animals are herbivores.

Figure 7: Example manual ratings of fact relevance for the explanation reconstruction task. “Gold” ratings areautomatically determined by whether a fact is marked as gold in the explanation for a given question. For factsnot marked as gold, manual ratings of “Highly Relevant”, “Possibly Relevant/Topical”, and “Not Relevant” wereadded.

across all models, between 12% and 27% of themost highly ranked facts may also be highly rel-evant to building an explanation for the question,but are not included as part of the gold explanation.All in all, this highlights the importance of treat-ing gold explanation annotation as a minimum set

as much as ±7% from the full test set. All Top-20 values arewithin 1% of the full test sample in Table 4

of facts for assessing coverage and completeness,but that assessing relevance of highly ranked factsis still best accomplished by including at least amodest manual evaluation.

7.3 Additional Performance Evaluation

Additional automatic performance evaluations areshown in Table 4, which includes evaluation by

73

Questions Baseline TeamMetric N tf.idf JS ASU RDAI CORMean Average Precision (MAP)

MAP 541 0.30 0.40 0.42 0.48 0.57

MAP by Explanatory RoleCENTRAL rows 531 0.34 0.49 0.42 0.58 0.59GROUNDING rows 347 0.19 0.18 0.17 0.23 0.37LEXICALGLUE rows 382 0.07 0.12 0.31 0.18 0.32

MAP by Table Knowledge TypesRetrieval tables 541 0.31 0.43 0.43 0.46 0.54Inference-supporting tables 541 0.26 0.33 0.39 0.42 0.46Complex inference tables 541 0.20 0.15 0.28 0.31 0.33

Precision@NPrecision@1 541 55% 67% 63% 79% 79%Precision@2 541 43% 49% 47% 63% 69%Precision@3 541 36% 44% 44% 54% 59%Precision@4 541 31% 37% 41% 48% 53%Precision@5 541 27% 32% 39% 43% 48%Precision@10 541 17% 21% 27% 29% 34%Precision@20 541 11% 13% 18% 18% 21%

Table 4: Explanation reconstruction performance broken down by the explanatory role of facts, table knowledgetypes, and using a Precision@N metric. Note that average performance in this analysis is normalized by the numberof questions a given criterion applies to N, and not the total number of questions in the evaluation corpus, and assuch may vary from lexical overlap results reported in participant papers. Excluding a small number of missingrow references also causes a slight performance increase in some models in the second or third decimal placecompared to official leaderboard results.

explanatory role, table knowledge category, as wellas evaluations of model performance using Preci-sion@N that serves as an automated comparisonto the manual relevance evaluation in the previoussection.

7.3.1 Performance by Explanatory RolesTable 4 shows performance broken down by theexplanatory role of the facts being analysed. Here,each model creates a ranked list of facts, and thefacts in a given question’s gold explanation thatdo not match a filter (either CENTRAL, GROUND-ING, or LEXICAL GLUE) are removed both fromthe gold explanation and from the ranked list. Meanaverage precision is then calculated as normal. Thisevaluation allows assessing how well each modelis able to find facts that provide different roles inan explanation, and whether a given method fo-cuses on core or central scientific facts, or factsthat ground entities in the question or answer tothose core facts. Similarly, it allows assessingthe performance on the “lexical glue” facts thathelp bridge explanation facts together in compu-tational explanations through synonymy relations,

but would likely be considered overly verbose orunnecessary when read by an adult human.

The evaluation shows that all models generallyperform best on identifying core or central facts,while ranking grounding facts less highly. “Lex-ical glue” facts that serve as the connecting gluebetween facts that use different words to describethe same concept showed the highest variation inperformance, with the baseline model nearly ignor-ing these facts, while two teams rank these nearlyas high as the best performance on the groundingfacts.

7.3.2 Performance by Table KnowledgeTypes

Tables can express very different kinds of knowl-edge, with varying levels of complexity and roles inan explanation. The tables in the WorldTree corpusare broadly divided into three main types:

Retrieval Tables: Tables that generally supply tax-onomic (kind-of) or property knowledge.

Inference-Supporting Tables: Tables that in-clude knowledge of actions, object affordances,

74

uses of materials or devices, sources of things, re-quirements, or affect relationships.

Complex Inference Tables: Tables that includeknowledge of causality, processes, changes,coupled relationships, and if/then relationships.

Table 4 shows explanation reconstruction per-formance by table knowledge type. Generally forall models submitted to the shared task, perfor-mance for retrieval knowledge types was highest,followed by knowledge from inference supportingtables. Knowledge from complex inference tableswas the most challenging for models to find andpreferentially rank.

8 Conclusion

The TextGraphs 2019 Shared Task on Multi-HopInference for Explanation Regeneration receivedfour team submissions that exceeded the perfor-mance of the baseline system. The systems useda variety of methods from additional knowledgeresources (such as ConceptNet or FrameNet) todirectly training language models to perform multi-hop inference by predicting chains of facts. Thetop-performing system increased baseline perfor-mance by nearly a factor of two on this task, achiev-ing a new state-of-the-art.

Multi-hop performance. In spite of theseachievements, our extended analysis shows thatthe performance on the most challenging multi-hopinference problems – those facts that do not havelexical overlap with question or answer and must bereached by traversing indirectly through other facts– is still low. Though the bulk of the performanceof these systems clusters around identifying factsthat have large amounts of lexical overlap with thequestion or answer (i.e. 2 or more facts), we haveseen how these easier-to-locate facts can serve as aspring board to launch more targeted searches forother facts in the explanation.

Language models and training data. The high-est performing systems in this shared task madeuse of language models, which have repeatedlydemonstrated record-breaking performance on awide range of language classification tasks in re-cent years. These language models tend to havelarge requirements for supervised training data thatare difficult to meet in cases where large-scale man-ual annotation is required, such as in construct-ing detailed explanations containing world knowl-

edge. The WorldTree explanation corpus providesa unique resource of large, many-fact structured ex-planations to train the multi-hop inference task, butthe manual generation of these explanations meansthe corpus (at approximately 1.6k explanations) isorders of magnitude smaller than resources that aretraditionally used to train language models. In spiteof this, teams on this shared task have proposedmethods to address training relevance judgementswith this scale of data, and achieved state-of-the-art results. While it is unlikely that large-scalesupervised structured training data resources willsoon become available to test the ultimate limits oflanguage models for explanation generation, the re-sults of this shared task naturally pose the questionof whether statistical methods will continue to ex-ceed structured knowledge base approaches to ex-planation generation (using resources such as Con-ceptNet and FrameNet), particularly as the fieldturns to investigating common sense knowledgeand other world-knowledge-centered approachesto inference.

Explanation Regeneration as a proxy task formulti-hop inference models. While explanationregeneration does not require a model to find acorrect answer to a question, it does help distillthe problem of multi-hop inference to center onthe task of combining multiple facts together inmeaningful ways to support explainable inference.Explanation-centered inference and interpretablemachine learning currently take on a variety offorms, from using representations amenable to hu-man explanation for the inference process (e.g.,Swartout et al., 1991; Abujabal et al., 2017; Liet al., 2018), to using black-box systems to arriveat answers that are later mapped onto other, moreexplainable models (e.g., Ribeiro et al., 2016). Byfocusing on the task of meaningfully combiningmultiple facts to build explanations, our hope is thatexplanation regeneration can serve as a stepping-stone task toward complex inference systems capa-ble of building long chains of inference that bothautomatically answer questions while providing de-tailed human-readable explanations for why theirreasoning is correct.

9 Acknowledgements

The organizers wish to express their thanks to allshared task teams for their participation. We thankElizabeth Wainwright and Steven Marmorstein forcontributions to the WorldTree explanation corpus,

75

who were funded by the Allen Institute for Ar-tificial Intelligence (AI2). Peter Jansen’s workon the shared task was supported by NationalScience Foundation (NSF Award #1815948, “Ex-plainable Natural Language Inference”). DmitryUstalov’s work on the shared task at the Universityof Mannheim was supported by the Deutsche For-schungsgemeinschaft (DFG) foundation under the“JOIN-T” project.

ReferencesAbdalghani Abujabal, Rishiraj Saha Roy, Mohamed

Yahya, and Gerhard Weikum. 2017. QUINT: In-terpretable Question Answering over KnowledgeBases. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations, EMNLP 2017, pages 61–66, Copenhagen, Denmark. Association for Compu-tational Linguistics.

Gabor Angeli, Melvin Jose Johnson Premkumar, andChristopher D. Manning. 2015. Leveraging Lin-guistic Structure For Open Domain Information Ex-traction. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),ACL-IJCNLP 2015, pages 344–354, Beijing, China.Association for Computational Linguistics.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet Project. In Pro-ceedings of the 36th Annual Meeting of the Asso-ciation for Computational Linguistics and 17th In-ternational Conference on Computational Linguis-tics - Volume 1, ACL ’98/COLING ’98, pages 86–90, Montreal, QC, Canada. Association for Compu-tational Linguistics.

Pratyay Banerjee. 2019. TextGraphs-2019 SharedTask: Explanation ReGeneration using LanguageModels and Iterative Re-Ranking. In Proceedingsof the Thirteenth Workshop on Graph-Based Meth-ods for Natural Language Processing, TextGraphs-13, Hong Kong. Association for Computational Lin-guistics.

Yew Ken Chai, Sam Witteveen, and Martin Andrews.2019. Language Model Assisted Explanation Gen-eration TextGraphs-13 Shared Task System Descrip-tion. In Proceedings of the Thirteenth Workshop onGraph-Based Methods for Natural Language Pro-cessing, TextGraphs-13, Hong Kong. Associationfor Computational Linguistics.

Jifan Chen and Greg Durrett. 2019. UnderstandingDataset Design Choices for Multi-hop Reasoning.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers), NAACL-

HLT 2019, pages 4026–4032, Minneapolis, MN,USA. Association for Computational Linguistics.

Christopher Clark and Matt Gardner. 2018. Simpleand Effective Multi-Paragraph Reading Comprehen-sion. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), ACL 2018, pages 845–855,Melbourne, VIC, Australia. Association for Compu-tational Linguistics.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have Solved Question An-swering? Try ARC, the AI2 Reasoning Challenge.arXiv:1803.05457.

Peter Clark et al. 2019. From ‘F’ to ‘A’ on the N.Y.Regents Science Exams: An Overview of the AristoProject. arXiv:1909.01958.

Rajarshi Das, Ameya Godbole, Manzil Zaheer, She-hzaad Dhuliawala, and Andrew McCallum. 2019.Reasoning over Chains of Facts for ExplainableMulti-hop Inference. In Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing, TextGraphs-13, HongKong. Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), NAACL-HLT 2019, pages 4171–4186, Min-neapolis, MN, USA. Association for ComputationalLinguistics.

Jennifer D’Souza, Isaiah Onando Mulang’, and SorenAuer. 2019. Team SVMrank: Leveraging Feature-rich Support Vector Machines for Ranking Expla-nations to Elementary Science Questions. In Pro-ceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing,TextGraphs-13, Hong Kong. Association for Com-putational Linguistics.

Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-hai Surdeanu, and Peter Clark. 2015. Higher-orderLexical Semantic Models for Non-factoid AnswerReranking. Transactions of the Association for Com-putational Linguistics, 3:197–210.

Peter Jansen. 2018. Multi-hop Inference for Sentence-level TextGraphs: How Challenging is MeaningfullyCombining Information for Science Question An-swering? In Proceedings of the Twelfth Workshopon Graph-Based Methods for Natural LanguageProcessing, TextGraphs-12, pages 12–17, New Or-leans, LA, USA. Association for Computational Lin-guistics.

76

Peter Jansen, Niranjan Balasubramanian, Mihai Sur-deanu, and Peter Clark. 2016. What’s in an Explana-tion? Characterizing Knowledge and Inference Re-quirements for Elementary Science Exams. In Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers, COLING 2016, pages 2956–2965, Os-aka, Japan. The COLING 2016 Organizing Commit-tee.

Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-ter Clark. 2017. Framing QA as Building and Rank-ing Intersentence Answer Justifications. Computa-tional Linguistics, 43(2):407–449.

Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton Morrison. 2018. WorldTree:A Corpus of Explanation Graphs for Elemen-tary Science Questions supporting Multi-hop In-ference. In Proceedings of the Eleventh Interna-tional Conference on Language Resources and Eval-uation, LREC 2018, pages 2732–2740, Miyazaki,Japan. European Language Resources Association(ELRA).

Thorsten Joachims. 2006. Training Linear SVMs inLinear Time. In Proceedings of the 12th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’06, pages 217–226, New York, NY, USA. ACM.

Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,Ashish Sabharwal, and Dan Roth. 2019. On the Ca-pabilities and Limitations of Reasoning for NaturalLanguage Understanding. arXiv:1901.02522.

Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and JieboLuo. 2018. VQA-E: Explaining, Elaborating, andEnhancing Your Answers for Visual Questions. InComputer Vision – ECCV 2018, 15th European Con-ference, Munich, Germany, September 8–14, 2018,Proceedings, Part VII, ECCV 2018, pages 570–586,Cham, Switzerland. Springer International Publish-ing.

Hugo Liu and Push Singh. 2004. ConceptNet — APractical Commonsense Reasoning Tool-Kit. BTTechnology Journal, 22(4):211–226.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach. arXiv:1907.11692.

Christopher D. Manning, Prabhakar Raghavan, andHinrich Schutze. 2008. Introduction to InformationRetrieval. Cambridge University Press, New York,NY, USA.

Sewon Min, Eric Wallace, Sameer Singh, Matt Gard-ner, Hannaneh Hajishirzi, and Luke Zettlemoyer.2019. Compositional Questions Do Not Necessi-tate Multi-hop Reasoning. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2019, pages 4249–4257,

Florence, Italy. Association for Computational Lin-guistics.

Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai,and Xiaofei He. 2017. MEMEN: Multi-layer Em-bedding with Memory Networks for Machine Com-prehension. arXiv:1707.09098.

Alexander Panchenko, Anastasia Lopukhina, DmitryUstalov, Konstantin Lopukhin, Nikolay Arefyev,Alexey Leontyev, and Natalia Loukachevitch. 2018.RUSSE’2018: A Shared Task on Word Sense In-duction for the Russian Language. In Computa-tional Linguistics and Intellectual Technologies: Pa-pers from the Annual International Conference “Di-alogue”, pages 547–564, Moscow, Russia. RSUH.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. “Why Should I Trust You?”: Ex-plaining the Predictions of Any Classifier. In Pro-ceedings of the 22Nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Min-ing, KDD ’16, pages 1135–1144, San Francisco,CA, USA. ACM.

Roger C. Schank and Robert P. Abelson. 1975. Scripts,Plans, and Knowledge. In Proceedings of the FourthInternational Joint Conference on Artificial Intelli-gence, IJCAI-75, pages 151–157, Tbilisi, Georgia,USSR. IJCAI Organization.

William Swartout, Cecile Paris, and Johanna Moore.1991. Explanations in Knowledge Systems: De-sign for Explainable Expert Systems. IEEE Expert,6(3):58–64.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing Datasets for Multi-hopReading Comprehension Across Documents. Trans-actions of the Association for Computational Lin-guistics, 6:287–302.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.XLNet: Generalized Autoregressive Pretraining forLanguage Understanding. arXiv:1906.08237.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HOTPOTQA: A Dataset forDiverse, Explainable Multi-hop Question Answer-ing. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2018, pages 2369–2380, Brussels, Belgium.Association for Computational Linguistics.

77


ASU at TextGraphs 2019 Shared Task: Explanation ReGeneration usingLanguage Models and Iterative Re-Ranking

Pratyay BanerjeeSchool of Computing, Informatics, and Decision Systems Engineering, Arizona State University

[email protected]

Abstract

In this work we describe the system fromNatural Language Processing group at Ari-zona State University for the TextGraphs 2019Shared Task. The task focuses on Expla-nation Regeneration, an intermediate step to-wards general multi-hop inference on largegraphs. Our approach consists of modeling theexplanation regeneration task as a learning torank problem, for which we use state-of-the-art language models and explore dataset prepa-ration techniques. We utilize an iterative re-ranking based approach to further improve therankings. Our system secured 2nd rank in thetask with a mean average precision (MAP) of41.3% on the test set.

1 Introduction

Question Answering in natural language often re-quires deeper linguistic understanding and reason-ing over multiple sentences. For complex ques-tions, it is very unlikely to build or have a knowl-edge corpora that contains a single sentence an-swer to all the questions from which a modelcan simply cherrypick. The knowledge requiredto answer a question may be spread over multi-ple passages (Rajpurkar et al., 2016; Lai et al.,2017). Such complex reasoning requires systemsto perform multi-hop inference, where they needto combine more than one piece of information(Welbl et al., 2018).

In this shared task of Explanation ReGenera-tion, the systems need to perform multi-hop infer-ence and rank a set of explanatory facts for a givenelementary science question and correct answerpair. An example is shown in Table 1. The taskprovides a new corpora of close to 5000 explana-tions, and a set of gold explanations for each ques-tion and correct answer pair (Jansen et al., 2018).The example highlights an instance for this task,where systems need to perform multi-hop infer-ence to combine diverse information and identify

Question: Which of the following is an exam-ple of an organism taking in nutrients?(A) a dog burying a bone (B) a girl eating anapple (C) an insect crawling on a leaf (D) aboy planting tomatoesGold Explanation Facts:A girl means a human girl. : GroundingHumans are living organisms. : GroundingEating is when an organism takes in nutrientsin the form of food. : CentralFruits are kinds of foods. : GroundingAn apple is a kind of fruit. : GroundingIrrelevant Explanation Facts:Some flowers become fruits.Fruit contains seeds.Organisms; living things live in their habitat;their homeConsumers eat other organisms

Table 1: An example of Explanation Regeneration

relevant explanation sentences needed to answerthe specific question.

Explanation ReGeneration is a challengingtask. This is due to the presence of other irrelevantsentences in the corpora with respect to the givenquestion, which have a good lexical and semanticoverlap (Jansen, 2018). Ideally, the explanationsneed to be in order, but for the sake of simplicity,ordering of the explanations are ignored.

In the dataset, to measure the performance ofthe system over different types of explanations, theexplanations are further categorized into classes.These classes differ in the importance of the expla-nation in explaining the correct answer. These cat-egories are Central, Grounding and Lexical Glue.Central facts are often core scientific facts rele-vant to answering the question. Grounding arefacts which connect to other core scientific facts,present in the explanation. Examples of Central

78

and Grounding facts are present in Table 2. Lexi-cal glue facts express synonymy or definitional re-lationships. An example of Lexical glue facts :“glowing means producing light”.

This paper describes a system developed by theNatural Language Processing group of ArizonaState University. We approach the task of expla-nation regeneration as a learning to rank (Burgeset al., 2005) problem. The system utilizes stateof the art neural language models (Devlin et al.,2019; Yang et al., 2019), finetunes them on theknowledge corpora and trains them to perform thetask of ranking using customized dataset prepara-tion techniques. We further improve on the rank-ing using an iterative re-ranking algorithm.

We make the following contributions in the pa-per: (a) We evaluate different ways for datasetpreparation to use neural language models for thetask of explanation generation; (b) We evaluatedifferent language models for ranking and analysetheir performance on the task; (c) We show howto use iterative re-ranking algorithm to further im-prove performance; (d) We also provide a detailedanalysis of the dataset.

In the following sections we first give anoverview of our system. We describe the individ-ual components of the system in Sections 3,4. Weevaluate each component on the provided valida-tion set and show the performance on the hiddentest set in Section 5. We conclude with a detailederror analysis and evaluating our model with rele-vant metrics in Section 6,7,8.

2 Approach

In recent years, several language models (Devlinet al., 2019; Yang et al., 2019; Peters et al., 2018)have shown considerable linguistic understand-ing and perform well in tasks requiring multi-hopreasoning such as question answering (Rajpurkaret al., 2016; Mihaylov et al., 2018; Khashabi et al.,2018; Lai et al., 2017), and document rankingtasks (Callan et al., 2009).

For the task of Explanation ReGeneration wechose BERT (Devlin et al., 2019) and XLNET(Yang et al., 2019), two state-of-the-art neurallanguage models and explore their effectivenessin capturing long inference chains and perform-ing multi-hop inference. BERT and XLNETare pretrained using Masked Language Modelling(MLM) and Probabilistic Masked Language Mod-elling (PMLM) respectively. These pretraining

tasks enables BERT and XLNET to understandthe dependencies between masked and unmaskedwords. This is needed to capture relevant con-cepts, words and entities linking between central,grounding and lexical glue facts. We finetunethese language models on the knowledge corporaof 5000 explanations using their respective lan-guage modelling tasks.

To rank the explanations, we learn the rele-vance of each explanation for a given questionand correct answer pair. We evaluate multipledataset preparation techniques for finetuning thelanguage models. We also evaluate different rele-vance learner models by attaching different kindsof classification and regression heads over the lan-guage models. From the relevance learner, we ob-tain the relevance scores and an initial ranking ofthe explanations. We perform further re-rankingusing a custom re-ranking algorithm similar to inBanerjee et al..

3 Dataset Preparation and RelevanceLearner

We prepare multiple datasets for the followingtasks. The preparation techniques are describedin the following sub-sections.

3.1 Language ModellingThe language models are initially finetuned on theExplanation Knowledge corpora using MLM andPMLM respectively. The dataset for this task isprepared using scripts from pytorch-transformerpackage. We prepare both MLM and PMLMdatasets and finetune the respective language mod-els. We follow the steps as mentioned by Devlinet al. and Yang et al. for generating the languagemodel datasets.

3.2 Relevance Learning using Classificationhead

We model the relevance learning task as a two-class classification task with class 0 represent-ing irrelevance and class 1 representing relevance.Here by relevance, we mean the fact is part ofthe explanation. Finally, we take the probabilityscores of class 1, and use them as relevance scores.Formally,

Rel(Ej , Qi, Ai) = P (Ej ∈ G|Qi, Ai) (1)

where Ej is the jth explanation, Q,A are the ithquestion and correct answer pair and G is the setof gold explanation facts.

79

The dataset provides the set of gold facts, butdoes not provide the set of irrelevant facts. Tocreate the irrelevant set, for each gold fact we re-trieve top k similar facts present in the explana-tion corpora, but not present in the gold fact set.This is done to make the model learn relevanceto the context of passage and correct answer, andnot focus on similar looking sentences. We com-pute the similarity between sentences using cosinesimilarity between sentence vectors (Honnibal andMontani, 2017). We repeat the gold explanation ktimes to maintain the balance between the classes.

We prepare another dataset where we providea context passage. This passage comprises of al-ready selected n facts, and the rest |G| − n goldfacts are labelled as relevant class 1. We find ir-relevant facts for the |G| − n gold facts using thesame process as above. In this case, we learn thefollowing probability:

Rel(Ej , Cn, Qi, Ai) = P (Ej ∈ G|Cn, Qi, Ai)(2)

where Cn represents the n already selected facts,1 ≤ n ≤ 16, as there are a maximum of 16 andminimum of 1 gold explanation facts. This contextis given only during the training phase, while dur-ing the validation and testing, we only provide thequestion and correct answer pair along with expla-nation Ej . Moreover, we ensure that the dataset isbalanced between two classes. To make the modellearn longer dependencies, we train using a con-text. This classification task optimizes classifica-tion accuracy metric.

3.3 Relevance Learning using Regressionhead

The datasets for the regression tasks are similar tothe datasets of classification head. Instead of twoclass classification, we provide target scores of 6,5, 4, 0 for Central, Grounding, Lexical Glue andIrrelevant facts respectively. The above scoringscheme was decided to give central and ground-ing facts higher precedence, as they are core fora proper explanation. All target scores were en-sured to be balanced. As described in the abovesection, we prepare two datasets, one with andanother without context explanation sentences.The regression task optimizes mean-squared-errorscores.

4 Iterative Re-Ranking

We sort the Relevance scores from the RelevanceLearner models and perform an initial ranking ofthe explanation sentences. We feed this initialranked explanation facts to our iterative re-rankingalgorithm, which is defined as follows.

Let N be the depth of re-ranking from the top,i.e, we run re-ranking for N rounds. Let E0 bethe top explanation fact in the initial ranking, Ei

be the last selected explanatory fact and Ej (i <j ≤ N + i) is the current candidate explanationfact for a given question Q and correct answer A.We compute a weighted relevance score using thetop i (i <= N ) selected facts as:

Wscore(Ej , Ei) =

∑ik=0Rel(Ek) ∗ Sim(Ej , Ek)∑i

k=0Rel(Ek)(3)

We sort ranking scores of the candidate expla-nation facts and choose the top explanation fact forthe i+1 th round, where the ranking score is givenby :

score(Ej , Ei, Q,A)

=Wscore(Ej , Ei) ∗ Sim(Ej , Q : A)(4)

Here Rel is the relevance score from the Rele-vance Learner models and Sim is the cosine sim-ilarity of the explanation sentence vectors fromSpacy (Honnibal and Montani, 2017). For thefacts whose initial rank is greater than depth N ,we keep the initial ranking as is. The itera-tive re-ranking algorithm is designed to exploitthe overlapping nature of the explanations. Theabove score gives importance to the initial rele-vance score (facts already ranked by relevance),the vector similarity of the candidate explanationand both the selected explanation sentences andquestion/correct answer pair.

5 Experiments and Test Results

The training dataset for the task contains 1191questions, each having 4 answers. The gold expla-nations set has a minimum size of 1 and maximumsize of 16. The validation dataset had 265 ques-tions and the hidden test set has 1248 questions.The explanation knowledge corpora has around5000 explanation sentences. The two relevancelearner training dataset has size of 99687 withoutcontext and 656250 with context. Several com-binations of context are generated using the gold

80

selected explanation facts, leading to such a largetraining corpus. We evaluate both BERT Largeand XLNET large, using both the tasks and the twodifferent datasets. In Table 2 are the results of ourevaluation on the validation set. All the metrics areMean-Average-Precision unless mentioned other-wise. All the metrics are on the validation set.

Task v/s Model BERT XLNETClassification 0.3638 0.3254Classification with Context 0.3891 0.3473Regression 0.3288 0.3164Regression with Context 0.3466 0.3273

Table 2: Comparison of the Relevance Learners withdifferent dataset preparation techniques without re-ranking

It can be observed in Table 2 that the two classclassification head with context performs best andBERT Large outperforms XLNET Large for thisparticular task. In Table 3, we compare the Rel-evance Learners before and after Iterative Re-ranking. It can be seen that Iterative Re-rankingimproves the scores of both the Relevance Learn-ers by around 2.5%.

Table 4 compares the MAP for different expla-nation roles before and after iterative re-ranking.It can be seen that Iterative re-ranking improvesMAP for Central and Grounding explanations butpenalizes Lexical Glue.

Figure 1 shows performance of the RelevanceLearner and Iterative Re-ranking for questionswith different length of gold explanations. It canbe seen that the model performs well for expla-nations whose length are less than or equal to 5.Performance decreases with increasing length ofgold explanations.

1Background and Neg roles were found in the gold expla-nation set but definition for them are not shared.

N v/s Model BERT XLNET1 0.3891 0.34733 0.4000 0.35565 0.4062 0.366110 0.4181 0.370115 0.4225 0.373820 0.4204 0.372130 0.4191 0.3665

Table 3: Comparison of Relevance Learners with Iter-ative Re-ranking till depth N

Explanation Roles BERT N=15CENTRAL 0.3589 0.3912GROUNDING 0.0631 0.0965LEXICAL GLUE 0.1721 0.1537BACKGROUND 1 0.0253 0.0226NEG 1 0.0003 0.000586

Table 4: MAP for different Explanation Roles forBERT trained with classification head and context, be-fore and after re-ranking till N=15

Figure 1: MAP v/s Length of the Gold Explanation

Table 5 shows the MAP scores for the best mod-els on both the Validation and the hidden Test set.

6 Error Analysis

In the following sub-sections we analyse our sys-tem components, the performance of the final re-ranked Relevance Learner system and the sharedtask dataset.

6.1 Model Analysis1. XLNET performs poorly compared to BERT

for this task. The difference arises due totwo reasons, the way the datasets are pre-pared and the way the language models arefinetuned. The dataset preparation tech-niques BERT captures deeper chains and bet-ter ranks those explanation which have low

Model Validation TestBERT with Context 0.3891 0.3983ReRanked N=15 0.4225 0.4130Baseline SVM Rank 0.28 0.2962

Table 5: Validation and Test MAP for the best Rele-vance Learner, Reranked and the provide baseline mod-els

81

Gold Explanation Predicted Explanationheat means temperature increases adding heat means increasing temperaturesunlight is a kind of solar energy the sun is the source of solar energy called sunlightlook at means observe observe means seea kitten is a kind of young; baby cat a kitten is a young; baby cat

Table 6: Similar Explanations present in top 30

Gold Explanation Predicted Explanation

an animal is a kind of living thingan animal is a kind of organism

an organism is a living thing

a frog is a kind of aquatic animala frog is a kind of amphibian

an amphibian is a kind of animala leaf is a part of a tree

a leaf is a part of a,green planta tree is a kind of plant

Table 7: Single-hop and Multi-hop Errors in top 30

Gold Explanationto reduce means to decreasePredicted Explanationto lower means to decreaseless means a reduced amount

Table 8: Errors due to Sentence Vectors in top 30

direct lexical or semantic overlap with thequestion and correct answer.

2. XLNET focuses on explanations mainly onthe words whose word vectors are closely re-lated to the question and answer, and per-forms poorly for explanations which are oneor two hop away. The dataset with context,improves the performance for both, enablingcapturing deeper chains to some extent.

3. The way the datasets are prepared introducesbias against some explanation facts. For ex-ample, the Lexical Glue facts are of the type“X means Y” and the Grounding facts are ofthe type “X is kind of Y”. Using sentencevectors for identifying similar but irrelevantexplanations leads to a set of explanations be-ing particularly tagged as irrelevant. Theseare ranked low even if they are relevant forthe validation set. This leads to poor perfor-mance compared to Central facts.

4. The Iterative Re-ranking algorithm improvesthe performance irrespective of the Rele-vance Learner model. The algorithm givesimportance to the relevance score, vector

Gold Explanationtemperature rise means become warmerPredicted Explanationwarm up means increase temperaturewarmer means greater; higher in temperature

Table 9: Model unable to understand ordering in Lexi-cal Glue in top 30

similarity with previously selected explana-tions and vector similarity with the questionanswer pair. This introduces a bias againstLexical Glue explanations, as they only havea word common with the entire question, an-swer and previously selected facts. The op-timal depth of the re-ranking correlates withthe maximum length of explanations.

5. The model is able to rank Central explana-tions with a high precision. Central factspossess a significant overlap with the ques-tion and correct answer. The re-ranking al-gorithm improves the precision even further.From Figure 1, it can be inferred, the expla-nations with length 1 only contain Central ex-planations. For explanations with length 2,the model precision drops considerably. Thisoccurs because the model is poor in rankingLexical Glue and Grounding explanations.

6. For Lexical Glue and Grounding explana-tions, which have the form “X means Y” and“X is kind of Y”, the model is not able to un-derstand the order between X and Y required

82

for the explanation, i.e, instead of “X meansY”, it ranks “Y means X” higher. Table 9 isone such instance. This contributes to the lowMAP for these types of explanations.

7. The use of sentence vectors for similarityintroduces errors shown in Table 8, wherethe correct explanation contains “reduce”, butthe predicted explanations which have simi-lar words like “lower”, “less” and “reducedamount”, are ranked higher.

8. Out of the total 226 questions in the vali-dation set, there were only 13 questions forwhich the system could not predict any of thegold explanation facts in the top 30. The factspredicted for these had a high word over-lap, both symbolic and word-vector wise, butwere not relevant to the explanation set.

6.2 Dataset Analysis

1. We further looked at the gold annotations andtop 30 model predictions and identified fewpredictions having similar semantic meaningbeing present. For example in Table 6, thepredicted explanations were present in top30. This shows there can be alternate expla-nations other than the provided gold set.

2. In Table 7 we can see the model makes bothkinds of errors. For few gold explanations, itbrings two alternate explanation facts and forsome explanations it combines the facts andranks a single explanation in the top 30. Thisalso shows there can be several such combi-nations possible.

7 Discussion

From the analysis, we can observe that multiplealternate explanations are possible. This is analo-gous to multiple paths being available for the ex-planation of a phenomenon. Our model precisionshould improve with availability of such alternateexplanations. We recommend enriching the goldannotations with possible alternatives for futurerounds of the task.

It is promising to see a language model based onstacked attention layers is able to perform multi-hop inference with a reasonable precision, with-out much feature engineering. The use of neu-ral language models and sentence vector similar-ities bring errors, such as point 6 and 7 in the

Error Analysis section. We can introduce sym-bolic and graph based features to capture order-ing (Witschel, 2007). We can also compute graphfeature-enriched sentence vectors using principlesof textual and visual grounding (Cai et al., 2018;Guo et al., 2016; Yeh et al., 2018; Grover andLeskovec, 2016). In our system design, we didnot use the different explanation roles and the de-pendencies between them. Using such features theprecision is likely to improve further. Our Iterativere-ranking algorithm shows it can improve the pre-cision even more, given a reasonably precise Rel-evance Learner model. This is the first time thistask has been organized and there is lot of scopefor improvement in precision.

8 Conclusion

In this paper, we have presented a system that par-ticipated in the shared task on explanation regen-eration and ranked second out of 4 participatingteams. We designed a simple system using a neu-ral language model as a relevance learner and aniterative re-ranking algorithm. We have also pre-sented detailed error analysis of the system output,the possible enrichments in the gold annotations ofthe dataset and discussed possible directions forfuture work.

ReferencesPratyay Banerjee, Kuntal Kumar Pal, Arindam Mitra,

and Chitta Baral. 2019. Careful selection of knowl-edge to solve open book question answering. In Pro-ceedings of the 57th Conference of the Associationfor Computational Linguistics, pages 6120–6129.

Christopher Burges, Tal Shaked, Erin Renshaw, AriLazier, Matt Deeds, Nicole Hamilton, and Gre-gory N Hullender. 2005. Learning to rank using gra-dient descent. In Proceedings of the 22nd Interna-tional Conference on Machine learning (ICML-05),pages 89–96.

Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey ofgraph embedding: Problems, techniques, and appli-cations. IEEE Transactions on Knowledge and DataEngineering, 30(9):1616–1637.

Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao.2009. Clueweb09 data set.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association for

83

Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Aditya Grover and Jure Leskovec. 2016. node2vec:Scalable feature learning for networks. In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data mining,pages 855–864. ACM.

Shu Guo, Quan Wang, Lihong Wang, Bin Wang, andLi Guo. 2016. Jointly embedding knowledge graphsand logical rules. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 192–202.

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Peter Jansen. 2018. Multi-hop inference for sentence-level textgraphs: How challenging is meaningfullycombining information for science question answer-ing? In Proceedings of the Twelfth Workshop onGraph-Based Methods for Natural Language Pro-cessing (TextGraphs-12), pages 12–17.

Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton T. Morrison. 2018.Worldtree: A corpus of explanation graphs forelementary science questions supporting multi-hopinference. In Proceedings of the 11th InternationalConference on Language Resources and Evaluation(LREC).

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,Shyam Upadhyay, and Dan Roth. 2018. Lookingbeyond the surface: A challenge set for reading com-prehension over multiple sentences. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers), volume 1, pages 252–262.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. Race: Large-scale readingcomprehension dataset from examinations. arXivpreprint arXiv:1704.04683.

Todor Mihaylov, Peter Clark, Tushar Khot, and AshishSabharwal. 2018. Can a suit of armor conduct elec-tricity? a new dataset for open book question an-swering. In EMNLP.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 2383–2392. Asso-ciation for Computational Linguistics.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. Transac-tions of the Association for Computational Linguis-tics, 6:287–302.

Hans Friedrich Witschel. 2007. Multi-level associationgraphs - a new graph-based model for InformationRetrieval. In Proceedings of the Second Workshopon TextGraphs: Graph-Based Algorithms for Nat-ural Language Processing, pages 9–16, Rochester,NY, USA. Association for Computational Linguis-tics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237.

Raymond A Yeh, Minh N Do, and Alexander GSchwing. 2018. Unsupervised textual grounding:Linking words to image concepts. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 6125–6134.

84


Red Dragon AI at TextGraphs 2019 Shared Task:Language Model Assisted Explanation Generation

Yew Ken ChiaRed Dragon AI

[email protected]

Sam WitteveenRed Dragon AI

[email protected]

Martin AndrewsRed Dragon AI

[email protected]

Abstract

The TextGraphs-13 Shared Task on Explana-tion Regeneration (Jansen and Ustalov, 2019)asked participants to develop methods to re-construct gold explanations for elementary sci-ence questions. Red Dragon AI’s entries usedthe language of the questions and explanationtext directly, rather than a constructing a sep-arate graph-like representation. Our leader-board submission placed us 3rd in the compe-tition, but we present here three methods of in-creasing sophistication, each of which scoredsuccessively higher on the test set after thecompetition close.

1 Introduction

The Explanation Regeneration shared task askedparticipants to develop methods to reconstructgold explanations for elementary science ques-tions (Clark et al., 2018), using a new corpusof gold explanations (Jansen et al., 2018) thatprovides supervision and instrumentation for thismulti-hop inference task.

Each explanation is represented as an “explana-tion graph”, a set of atomic facts (between 1 and16 per explanation, drawn from a knowledge baseof 5,000 facts) that, together, form a detailed ex-planation for the reasoning required to answer andexplain the resoning behind a question.

Linking these facts to achieve strong perfor-mance at rebuilding the gold explanation graphsrequires methods to perform multi-hop inference -which has been shown to be far harder than infer-ence of smaller numbers of hops (Jansen, 2018),particularly for the case here, where there is con-siderable uncertainty (at a lexical level) of howindividual explanations logically link somewhat‘fuzzy’ graph nodes.

Data Python Scala Python Leaderboardsplit Baseline Baseline Baseline1e9 Submission

Train 0.0810 0.2214 0.4216

Dev 0.0544 0.2890 0.2140 0.4358

Test 0.4017

Table 1: Base MAP scoring - where the PythonBaseline1e9 is the same as the original Python Baseline,but with the evaluate.py code updated to assumemissing explanations have rank of 109

1.1 Dataset Review

The WorldTree corpus (Jansen et al., 2018) is anew dataset is a comprehensive collection of ele-mentary science exam questions and explanations.Each explanation sentence is a fact that is relatedto science or common sense, and is representedin a structured table that can be converted to free-text. For each question, the gold explanations havelexical overlap (i.e. having common words), andare denoted as having a specific explanation rolesuch as CENTRAL (core concepts); GROUNDING(linking core facts to the question); and LEXICALGLUE (linking facts which may not have lexicaloverlap).

1.2 Problem Review

As described in the introduction, the general taskbeing posed is one of multi-hop inference, wherea number of ‘atomic fact’ sentences must be com-bined to form a coherent chain of reasoning tosolve the elementary science problem being posed.

These explanatory facts must be retrieved froma semi-structured knowledge base - in which thesurface form of the explanation is represented as aseries of terms gathered by their functional role inthe explanation.

For instance, for the explanation “Grass snakeslive in grass” is encoded as “[Grass snakes] [livein] [grass]”, and this explanation is found in a

85

PROTO-HABITATS table. However, in the sametable there are also more elaborate explanations,for example : “Mice live in in holes in the groundin fields / in forests.” is expressed as : “[mice][live in] [in holes in the ground] [in fields OR inforests]”. And more logically complex : “Mostpredators live in/near the same environment astheir prey.” being expressed as : “[most] [preda-tors] [live in OR live near] [the same environmentas their prey]”.

So, whereas the simpler explanations fit in theusual Knowledge-Base triples paradigm, the morecomplex ones are much more nuanced about whatactually constitutes a node, and how reliable thearcs are between them. Indeed, there is also a col-lection of if/then explanations, including ex-amples such as : “[if] [something] [has a] [posi-tive impact on] [something else] [then] [increas-ing] [the] [amount of] [that something] [has a][positive impact on] [that something else]” - wherethe explanation has meta-effect on the graph itself,and includes ‘unbound variables’. 1

2 Preliminary Steps

In this work, we used the pure textual form of eachexplanation, problem and correct answer, ratherthan using the semi-structured form given in thecolumn-oriented files provided in the dataset. Foreach of these we performed Penn-Treebank to-kenisation, followed by lemmatisation using thelemmatisation files provided with the dataset, andthen stop-word removal.2

Concerned by the low performance of thePython Baseline method (compared to the ScalaBaseline, which seemed to operate using an al-gorithm of similar ‘strength’), we identified anissue in the organizer’s evaluation script wherepredicted explanations that were missing any ofthe gold explanations were assigned a MAP scoreof zero. This dramatically penalised the PythonBaseline, since it was restricted to only returning10 lines of explanation. It also effectively forcesall submissions to include a ranking over all ex-planations - a simple fix (with the Python Baselinerescored in Table 1) will be submitted via GitHub.This should also make the upload/scoring processfaster, since only the top ∼1000 explanation linesmeaningfully contribute to the rank scoring.

1The PROTO-IF-THEN explanation table should havebeen annotated with a big red warning sign

2PTB tokenisation and stopwords from the NLTK pack-age)

3 Model Architectures

Although more classic graph methods were ini-tially attempted, along the lines of Kwon et al.(2018), where the challenge of semantic drift inmulti-hop inference was analysed and the effec-tiveness of information extraction methods wasdemonstrated, the following 3 methods (whichnow easily surpass the score of our competitionsubmission) were ultimately pursued due to theirsimplicity/effectiveness.

Data Optimised Iterated BERT

split TF-IDF TF-IDF Re-ranking

Train 0.4525 0.4827 0.6677

Dev 0.4581 0.4966 0.5089

Test 0.4274 0.4576 0.4771

Time 0.02 46.97 92.96

Table 2: MAP scoring of new methods. The timingsare in seconds for the whole dev-set, and the BERTRe-ranking figure includes the initial Iterated TF-IDFstep.

3.1 Optimized TF-IDF

As mentioned above, the original TF-IDF imple-mentation of the provided Python baseline scriptdid not predict a full ranking, and was penalizedby the evaluation script. When this issue wasremedied, its MAP score rose to 0.2140.

However, there are three main steps that signif-icantly improve the performance of this baseline:

1. The original question text included all the an-swer choices, only one of which was correct(while the others are distractors). Removingthe distractors resulted in improvement;

2. The TF-IDF algorithm is very sensitive tokeywords. Using the provided lemmatisationset and NLTK for tokenisation helped to alignthe different forms of the same keyword andreduce the vocabulary size needed;

3. Stopword removal gave us approximately0.04 MAP improvement throughout - remov-ing noise in the texts that was evidently ‘dis-tracting’ for TF-IDF.

As shown in Table 2, these optimisation stepsincreased the Python Baseline score significantly,without introducing algorithmic complexity.

86

3.2 Iterated TF-IDF

While graph methods have shown to be effectivefor multi-hop question answering, the schema inthe textgraphs dataset is unconventional (as illus-trated earlier). To counter this, the previous TF-IDF method was extended to simulate jumps be-tween explanations, inspired by graph methods,but without forming any actual graphs:

1. TF-IDF vectors are pre-computed for allquestions and explanation candidates;

2. For each question, the closest explanationcandidate by cosine proximity is selected,and their TF-IDF vectors are aggregated bya max operation;

3. The next closest (unused) explanation is se-lected, and this process was then applied it-eratively up to maxlen=128 times3, withthe current TF-IDF comparison vector pro-gressively increasing in expressiveness. Ateach iteration, the current TF-IDF vector wasdown-scaled by an exponential factor of thelength of the current explanation set, as thiswas found to increase development set resultsby up to +0.0344.

By treating the TF-IDF vector as a representa-tion of the current chain of reasoning, each succes-sive iteration builds on the representation to accu-mulate a sequence of explanations.

The algorithm outlined above was additionallyenhanced by adding a weighting factor to eachsuccessive explanation as it is added to the cumu-lative TF-IDF vector. Without this factor, the ef-fectiveness was lower because the TF-IDF repre-sentation itself was prone to semantic drift awayfrom the original question. Hence, each succes-sive explanation’s weight was down-scaled, andthis was shown to work well.4

3.3 BERT Re-ranking

Large pretrained language models have beenproven effective on a wide range of downstreamtasks, including multi-hop question answering,such as in Liu et al. (2019) on the RACE dataset,

3 This maxlen value was chosen to minimise computa-tion time, noting that explanation ranks below approximately100 have negligible impact on the final score.

4Full, replicable code is available on GitHub for all3 methods described here, at https://github.com/mdda/worldtree_corpus/tree/textgraphs

and Xu et al. (2019) which showed that large fine-tuned language models can be beneficial for com-plex question answering domains (especially in adata-constrained context).

Inspired by this, we decided to adapt BERT(Devlin et al., 2018) - a popular language modelthat has produced competitive results on a varietyof NLP tasks - for the explanation generation task.

For our ‘BERT Re-ranking’ method, we attach aregression head to a BERT Language Model. Thisregression head is then trained to predict a rele-vance score for each pair of question and explana-tion candidate. The approach is as follows :

1. Calculate a TF-IDF relevance score for everytokenised explanation against the tokenised‘[Problem] [CorrectAnswer] [Gold explana-tions]’ in the training set. This will rate thetrue explanation sentences very highly, butalso provide a ‘soft tail’ of rankings acrossall explanations;

2. Use this relevance score as the predictiontarget of the BERT regression head, whereBERT makes its predictions from the original‘[Problem] [CorrectAnswer]’ text combinedwith each potential Explanation text in turn(over the training set);

3. At prediction time, the explanations areranked according to their relevance to ‘[Prob-lem] [CorrectAnswer]’ as predicted by theBERT model’s output.

We cast the problem as a regression task (ratherthan a classification task), since treating it as a taskto classify which explanations are relevant wouldresult in an imbalanced dataset because the goldexplanation sentences only comprise a small pro-portion of the total set. By using soft targets (givento us by the TF-IDF score against the gold answersin the training set), even explanations which arenot designated as “gold” but have some relevanceto the gold paragraph can provide learning signalfor the model.

Due to constraints in compute and time, themodel is only used to rerank the topn = 64 pre-dictions made by the TF-IDF methods.

The BERT model selected was of “Base” sizewith 110M parameters, which had been pretrainedon BooksCorpus and English Wikipedia. Wedid not further finetune it on texts similar to theTextGraphs dataset prior to regression training. In

87

1 2 3 4 5 6 7 8 9 10+

Gold explanation lengths

0.0

0.2

0.4

0.6

0.8

Mea

nM

AP

scor

e

Mean MAP score against Gold explanation lengths

OptimizedTFIDF

IterativeTFIDF

IterativeTFIDF + BERT

Figure 1: Mean MAP score against gold explanationlengths

other tests, we found that the “Large” size modeldid not help improve the final MAP score.

4 Discussion

The authors’ initial attempts at tackling the SharedTask focussed on graph-based methods. However,as identified in (Jansen, 2018), the uncertainty in-volved with interpreting each lexical representa-tion, combined with the number of hops required,meant that this line of enquiry was put to one side5.

While the graph-like approach is clearly attrac-tive from a reasoning point of view (and will be thefocus of future work), we found that using purelythe textual aspects of the explanation databasebore fruit more readily. Also. the complexity ofthe resulting systems could be minimised such thatthe description of each system could be as consiseas possible.

Specifically, we were able to optimise the TF-IDF baseline to such an extent that our ‘Opti-mised TF-IDF’ would now place 2nd in the sub-mission rankings, even though it used no specialtechniques at all.6

The Iterated TF-IDF method, while more algo-rithmically complex, also does not need any train-ing on the data before it is used. This shows howeffective traditional text processing methods canbe, when used strategically.

The BERT Re-ranking method, in contrast, doesrequire training, and also applies one of the moresophisticated Language Models available to ex-tract more meaning from the explanation texts.

Figure 1 illustrates how there is a clear trend to-

5Having only achieved 0.3946 on the test set6Indeed, our Optimized TF-IDF, scoring 0.4581 on the

dev set, and 0.4274 on the test set, could be considered a newbaseline for this corpus, given its simplicity.

Explanation Optimised Iterated BERT

role TF-IDF TF-IDF Re-ranking

GROUNDING 0.1373 0.1401 0.0880

LEX-GLUE 0.0655 0.0733 0.0830

CENTRAL 0.4597 0.5033 0.5579

BACKGROUND 0.0302 0.0285 0.0349

NEG 0.0026 0.0025 0.0022

ROLE 0.0401 0.0391 0.0439

Table 3: Contribution of Explanation Roles - Dev-SetMAP per role (computed by filtering explanations ofother roles out of the gold explanation list then com-puting the MAP as per normal)

wards being able to build longer explanations asour semantic relevance methods become more so-phisticated.

There are also clear trends across the data in Ta-ble 3 that show that the more sophisticated meth-ods are able to bring more CENTRAL explanationsinto the mix, even though they are more ‘textuallydistant’ from the original Question and Answerstatements. Surprisingly, this is at the expense ofsome of the GROUNDING statements.

Since these methods seem to focus on differentaspects of solving the ranking problem, we havealso explored averaging the ranks they assign tothe explanations (essentially ensembling their de-cisions). Empirically, this improves performance7

at the expense of making the model more obscure.

4.1 Further Work

Despite our apparent success with less sophis-ticated methods, it seems clear that more ex-plicit graph-based methods appears will be re-quired to tackle the tougher questions in thisdataset (for instance those that require logical de-ductions, as illustrated earlier, or hypothetical sit-uations such as some ‘predictor-prey equilibrium’problems). Even some simple statements (such as‘Most predators ...’) present obstacles to existingKnowledge-Base representations.

In terms of concrete next steps, we are ex-ploring the idea of creating intermediate forms ofrepresentation, where textual explanations can belinked using a graph to plan out the logical steps.However these grander schemes suffer from being

7The combination of ‘Iterated TF-IDF’ and ‘BERT Re-ranking’ scoring 0.5195 on the dev set, up from their scoresof 0.4966 and 0.5089 respectively

88

incrementally less effective than finding additional‘smart tricks’ for existing methods!

In preparation, we have begun to explore doingmore careful preprocessing, notably :

1. Exploiting the structure of the explanationtables individually, since some columns areknown to be relationship-types that would besuitable for labelling arcs between nodes in atypical Knowledge Graph setting;

2. Expanding out the conjunction elementswithin the explanation tables. For instancein explanations like “[coral] [lives in the][ocean OR warm water]”, the different sub-explanations “(Coral, LIVES-IN, Ocean)”and “(Coral, LIVES-IN, WarmWater)” canbe generated, which are far closer to a ‘graph-able’ representation;

3. Better lemmatisation : For instance ‘ice cube’covers both ‘ice’ and ‘ice cube’ nodes. Weneed some more ‘common sense’ to coverthese cases.

Clearly, it is early days for this kind of multi-hop inference over textual explanations. At thispoint, we have only scratched the surface of theproblem, and look forward to helping to advancethe state-of-the-art in the future.

Acknowledgments

The authors would like to thank Google for ac-cess to the TFRC TPU program which was usedin training and fine-tuning models during experi-mentation for this paper.

ReferencesPeter F. Clark, Isaac Cowhey, Oren Etzioni, Tushar

Khot, Ashish Sabharwal, Carissa Schoenick, andOyvind Tafjord. 2018. Think you have solved ques-tion answering? Try ARC, the AI2 Reasoning Chal-lenge. ArXiv, arXiv:1803.05457.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. Computing Research Repository,arXiv:1810.04805.

Peter Jansen. 2018. Multi-hop inference for sentence-level textgraphs: How challenging is meaningfullycombining information for science question answer-ing? In Proceedings of the Twelfth Workshop onGraph-Based Methods for Natural Language Pro-cessing (TextGraphs-12), pages 12–17.

Peter Jansen and Dmitry Ustalov. 2019. TextGraphs2019 Shared Task on Multi-Hop Inference for Ex-planation Regeneration. In Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing (TextGraphs-13), HongKong. Association for Computational Linguistics.

Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton Morrison. 2018. WorldTree:A Corpus of Explanation Graphs for ElementaryScience Questions supporting Multi-hop Inference.In Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018), Miyazaki, Japan. European Language Re-sources Association (ELRA).

Heeyoung Kwon, Harsh Trivedi, Peter Jansen, Mi-hai Surdeanu, and Niranjan Balasubramanian. 2018.Controlling information aggregation for complexquestion answering. In European Conference on In-formation Retrieval, pages 750–757. Springer.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Dongfang Xu, Peter Jansen, Jaycie Martin, Zheng-nan Xie, Vikas Yadav, Harish Tayyar Madabushi,Oyvind Tafjord, and Peter Clark. 2019. Multi-class hierarchical question classification for mul-tiple choice science exams. arXiv preprintarXiv:1908.05441.

89


Team SVMrank: Leveraging Feature-rich Support Vector Machines forRanking Explanations to Elementary Science Questions

Jennifer D’Souza1, Isaiah Onando Mulang’2, Soren Auer1

1TIB Leibniz Information Centre for Science and Technology, Hannover, Germanyjennifer.dsouza|[email protected]

2Fraunhofer IAIS, Sank Augustin, [email protected]

Abstract

The TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration(MIER-19) tackles explanation generation foranswers to elementary science questions. Itbuilds on the AI2 Reasoning Challenge 2018(ARC-18) which was organized as an ad-vanced question answering task on a datasetof elementary science questions. The ARC-18 questions were shown to be hard to an-swer with systems focusing on surface-levelcues alone, instead requiring far more power-ful knowledge and reasoning.

To address MIER-19, we adopt a hybridpipelined architecture comprising a feature-rich learning-to-rank (LTR) machine learningmodel, followed by a rule-based system forreranking the LTR model predictions. Our sys-tem was ranked fourth in the official evalu-ation, scoring close to the second and thirdranked teams, achieving 39.4% MAP.

1 Introduction

The TextGraphs 2019 Shared Task on Multi-HopInference for Explanation Regeneration (Jansenand Ustalov, 2019) was organized for the semanticevaluation of systems for providing explanationsto elementary science question answers. The taskitself was formulated as a ranking task, where thegoal was to rerank the relevant explanation sen-tences in a given knowledge base of over 4,000candidate explanation sentences for a given pairof an elementary science question and its correctanswer. The QA part of the MIER-19 dataset,including the questions and their multiple-choiceanswers, had been released previously as the AI2Reasoning Challenge (Clark et al., 2018) datasetcalled ARC-18. Since answering science ques-tions necessitates reasoning over a sophisticatedunderstanding of both language and the world andover commonsense knowledge, ARC-18 specifi-

Question Granite is a hard material and forms fromcooling magma. Granite is a type ofAnswer igneous rockExplanation[rank 1] igneous rocks or minerals are formed frommagma or lava cooling[rank 2] igneous is a kind of rock[rank 3] a type is synonymous with a kind[not in gold expl] rock is hard[not in gold expl] to cause the formation of means to form[not in gold expl] metamorphic rock is a kind of rock[not in gold expl] cooling or colder means removing orreducing or decreasing heat or temperature

Table 1: Example depicting the Multi-Hop InferenceExplanation Regeneration Task. The multi-hop infer-ence task was formulated around the presence of alexical overlap (shown as underlined words) betweenexplanation sentences with the question or answer orother correct explanation sentences. Since the lexicaloverlap criteria was not strictly defined around onlythe correct explanation candidates (as depicted with thelast four explanation candidates), the task necessitateduse of additional domain and world knowledge to ruleout incorrect explanation candidates.

cally encouraged progress on advanced reason-ing QA where little progress was made as opposedto factoid-based QA. This was highlighted whensophisticated neural approaches for factoid-basedQA (Parikh et al., 2016; Seo et al., 2016; Khotet al., 2018) tested on the ARC-18 did not achievegood results. Now with the MIER-19 task, theARC-18 objective of testing QA systems for ad-vanced reasoning dives deeper into the reasoningaspect by focusing on reranking explanations forquestions and their correct answer choice.

In this article, we describe the version of oursystem that participated in MIER-19. Systemsparticipating in this task assume as input the ques-tion, its correct answer, and a knowledge base ofover 4,000 candidate explanation sentences. Thetask then is to return a ranked list of the expla-nation sentences where facts in the gold explana-

90

tion are expected to be ranked higher than facts notpresent in the gold explanation (cf. Table 1). Ourteam was ranked fourth in the official evaluation,scoring within a point gap to the second and thirdranked teams, achieving an overall 39.40% MAP.

Going beyond mere Information Retrieval suchas tf−idf for ranking relevant sentences to aquery, our system addresses the explanation sen-tence ranking task as follows: we (a) adopt lexical,grammatical, and semantic features for obtainingstronger matches between a question, its correctanswer, and candidate explanation sentences forthe answer in a pairwise learning-to-rank frame-work for ranking explanation sentences; and (b)perform a reranking of the returned ranked expla-nation sentences via a set of soft logical rules tocorrect for obvious errors made by the learning-based ranking system.1

The remainder of the article is organized as fol-lows. We first give a brief overview of the MIER-19 Shared Task and the corpus (Section 2). Afterthat, we describe related work (Section 3). Finally,we present our approach (Section 4), evaluationresults (Section 5), and conclusions (Section 6).

2 The MIER-19 Shared Task

2.1 Task Description

The MIER-19 task (Jansen and Ustalov, 2019) fo-cused on computing a ranked list of explanationsentences (as shown in Table 1) for a question andits correct answer (QA) from an unordered collec-tion of explanation sentences. Specifically, givena question, its known correct answer, and a list ofn explanation sentences, the goal was to (1) de-termine whether an explanation sentence is rele-vant as justification for the QA, and if so, (2) rankthe relevant explanation sentences in order by theirrole in forming a logical discourse fragment.2

2.2 The Task Corpus

To facilitate system development, 1,190 Elemen-tary Science (i.e. 3rd through 5th grades) ques-tions were released as part of the training data.

1Our code is released for facilitating future work https://bit.ly/2lZo9eW

2Each explanation sentence is also annotated with its ex-plicit discourse role in the explanation fragment for trainingand development data (i.e. as central if it involves core QAconcepts, or as lexical glue if it simply serves as a connectorin the explanation sentence sequence, or as background infor-mation, etc.). However, we do not consider this annotation aspart of the data since it is not available for the test set.

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25

# Q

A In

stan

ces

# Explanation Sentences

Training Data

Development Data

Figure 1: Explanation sentences per question-answerpair in the training and development dataset.

Each question is a 4-way multiple choice ques-tion with the correct answer known. Further, everyquestion in the training data is accompanied by upto 21 explanation sentences picked from a human-authored tablestore of candidate explanation sen-tences (see Section 2.2.1 for more details). Simi-larly, development data was provided, containing264 multiple choice questions with a known cor-rect answer and their explanations. The distribu-tion of explanation sentences per QA in the train-ing and development datasets is depicted in Fig. 2.

This dataset of explanations for elementary sci-ence QA was originally released as the WorldTreecorpus (Jansen et al., 2018).

2.2.1 The Explanations TablestoreThe task corpus separately comprised a tablestoreof manually authored 4,789 candidate explanationsentences. Explanations for the QA instances wereobtained from one or more tables in the tablestore.

Total unique explanation sentences: 4,789Seen in training data: 2,694Seen in development data: 964Seen in training and development data: 589

The tablestore comprised 62 separate tableseach containing explanation sentences around aparticular relation predicate such as “kind of”(e.g., an acorn is a kind of seed), “part of” (e.g.,bark is a part of a tree), “cause” (e.g., drought maycause wildfires), etc., and a number of tables spec-ified around specific properties such as “actions”of organisms (e.g., some adult animals lay eggs),the “properties of things” (e.g., an acid is acidic),or “if-then” conditions (e.g., when an animal shedsits fur, its fur becomes less dense). Table 2 listsprominent explanation table types used in at least1% of the training and development explanations.

91

KINDOF 25.22SYNONYMY 14.27ACTION 6.48IF-THEN 5.31CAUSE 4.17USEDFOR 4.17PROPERTIES-THINGS 3.58

REQUIRES 2.87PARTOF 2.74COUPLEDRELATIONSHIP 2.67SOURCEOF 1.89CONTAINS 1.79AFFECT 1.73MADEOF 1.69

ATTRIBUTE-VALUE-RANGE 1.53CHANGE 1.53CHANGE-VEC 1.43EXAMPLES 1.43PROPERTIES-GENERIC 1.21TRANSFER 1.11AFFORDANCES 1.08

Table 2: Explanation table types (21 of 63 in total) sorted by the proportion of their occurrence for their respectivesentences participating in at least 1% of the training and development set QA explanations.

3 Background and Related Work

Elementary Science QA requiring diverse textrepresentations. In a study conducted on theNew York Regents standardized test QA, Clarket al. (2013) identified at least five QA cate-gories in elementary science, viz. taxonomic ques-tions, definition-based questions, questions basedon properties of things (e.g., parts of an object),and questions needing several steps of inference.With the Aristo system (Clark et al., 2016), theydemonstrated that a system operating at differentlevels of textual representation and reasoning wasneeded to address diverse QA types since it sub-stantially outperformed a singular information re-trieval approach.

From these prior insights about the benefit ofa heterogeneous system on a dataset with diverseQA types, our approach follows suit in using a setof features over diverse representations of the QAand its explanation.

Commonsense Knowledge for Explanations.The relevance of commonsense knowledge in

reasoning task settings was demonstrated by tworecent systems (Paul and Frank, 2019; Bauer et al.,2018). Paul and Frank (2019), in a sentimentanalysis task, specifically devise a novel methodto extract, rank, filter and select multi-hop rela-tion paths from ConceptNet (Liu and Singh, 2004)to interpret the expression of sentiment in termsof their underlying human needs, thereby obtain-ing boosted task performance for predicting hu-man needs. Bauer et al. (2018) for a narrative QAtask, that required the model to reason, gather, andsynthesize disjoint pieces of information withinthe context to generate an answer, employed Con-ceptNet for multi-step reasoning where they con-structed paths starting from concepts appearing inthe question to concepts appearing in the context,aiming to emulate multi-hop reasoning.

Relatedly, we employ commonsense knowledgeby tracing commonalities between conceptual cat-egories of QA and explanation sentence words.

Explanations for Elementary Science QA.One of the first attempts creating justifications foranswers to elementary science exam questions wasby Jansen et al. (2017) which jointly addressed an-swer extraction and justification creation. Sincein answering science exam questions, many ques-tions require inferences from external knowledgesources, they return the sentences traversed in in-ferring the correct answer as a result. After iden-tifying the question’s focus words, they generatejustifications by aggregating multiple sentencesfrom a number of textual knowledge bases (e.g.,study guides, science dictionaries) that preferen-tially (i.e. based on a number of measures de-signed to assess how well-integrated, relevant, andon-topic a given justification is) connect sentencestogether on the focus words, selecting the answercorresponding to the highest-ranked justification.By this method, they obtained a boost in QA per-formance and, further, the inference mechanism asan additional result justifying the QA process.

4 Our Approach

Unlike Jansen et al. (2017) who jointly performanswer extraction and answer explanation infer-ence, our approach only addresses the task of an-swer explanation inference assuming we are giventhe question, its correct answer, and a knowledgebase of candidate explanation sentences. 3

In the methodology for the manual authoring ofexplanations to create the explanation tablestorefor elementary science QA, it was followed thatan explanation sentence:

• overlaps lexically with the question or an-swer, or overlaps lexically with other expla-nation sentences to the QA which we call theoverlap criteria; and

3The MIER-19 shared task does not evaluate selecting thecorrect answer, hence the choice was up to the participantswhether to model an approach assuming the correct answeror to perform explanation extraction as a function of correctanswer selection.

92

• the sequence of explanation sentences form alogically coherent discourse fragment whichwe call the coherency criteria.

We model both criteria within a pairwiselearning-to-rank (LTR) approach.

Let (q, a, e) be a triplet consisting of a questionq, its correct answer a, and a candidate explanationsentence e that is a valid or invalid candidate fromthe given explanations tablestore.

4.1 Features for Learning-to-Rank

First, given a (q, a, e) triplet, we implement theoverlap criteria by invoking a selected set of fea-ture functions targeting lexical overlap betweenthe triplet elements. For this, each triplet is en-riched lexically by lemmatization and affixation toensure matches with word variant forms. How-ever, more often than not, a QA lexically matcheswith irrelevant candidate explanation sentences(consider the latter explanation sentences in theexample in Table 1) resulting in semantic drift.Therefore, we hypothesize that the semantic driftcan be controlled to some extent with matches atdifferent levels of grammatical and semantic ab-straction of the q, a, and e units, which we alsoencode as features.

Specifically, to compute the features, each q,a, and e unit are represented as bags of: words,lemmas, OpenIE (Angeli et al., 2015) relationtriples, concepts from ConceptNet (Liu and Singh,2004), ConceptNet relation triples, Wiktionarycategories, Wiktionary content search matchedpage titles, and Framenet (Fillmore, 1976) pred-icates and arguments. 4 These representations re-sulted in 76 feature categories shown in Table 3which are used to generate (q, a, e) triplet instanceone-hot encoded feature vectors.

Second, given as input the (q, a, e) triplet fea-ture vectors, we model the criteria of valid ver-sus invalid explanation sentences and the prece-dence between explanation sentences, i.e. the co-herency criteria, within the supervised pairwiseLTR framework.

4.2 Pairwise Learning-to-Rank

Pairwise LTR methods are designed to handleranking tasks as a binary classification problem for

4OpenIE relations and FrameNet structures are extractedonly for q and e since they need to be computed on sentencesand the answers a are often expressed as phrases.

pairs of resources by modeling instance ranks asrelative pairwise classification decisions.

We employ SVM rank (Joachims, 2006) asour pairwise LTR algorithm, which after trans-forming the ranking problem into a set of bi-nary classification tasks address the classifica-tion through the formalism of Support VectorMachines (SVM). Ranking SVMs in a non-factoid QA ranking problem formulation haveshowed similar performances to a Neural Percep-tron Ranking model (Surdeanu et al., 2011).

4.2.1 Training our MIER-19 Task Model

Next, we describe how an LTR model can betrained using a (q, a, e) triplet feature vector com-puted according to the 76 feature categories shownin Table 3.

The ranker aims to impose a ranking on thecandidate explanation sentences for each QA inthe test set, so that (1) the correct explanationsentences are ranked higher than the incorrectones and (2) the correct explanation sentences areranked in order of their precedence w.r.t. eachother. In LTR, this is modeled as an ordered pair(xqi,ai,ej , xqi,ai,ek ), where xqi,ai,ej is a feature vec-tor generated between a QA (qi, ai) and a correctcandidate explanation sentence ej , and xqi,ai,ekis a feature vector generated between (qi, ai) andan incorrect candidate explanation sentence ek.In addition, another kind of training instance inour dataset can occur between correct explanationsentences as an ordered pair (xqi,ai,ej , xqi,ai,em),where ej logically precedes em in the explana-tion sentence sequence. The goal of the ranker-learning algorithm, then, is to acquire a ranker thatminimizes the number of violations of pairwiserankings provided in the training set.

The ordered pairwise instances are createdabove based on the labels assigned to each traininginstance. One detail we left out earlier when dis-cussing our (q, a, e) triplet features for LTR, wasthat each triplet is also assigned a label indicat-ing a graded relevance between the QA and thecandidate explanation sentence. This is done asfollows. For each (q, a, e) triplet instance, if e isin the sequence of correct explanation sentences,then it is labeled in a descending rank order start-ing at ‘rank=number of explanation sentences+1’for the first sentence and ending at ‘rank= 2’ forthe last one in the sequence, otherwise, ‘rank= 1’for all incorrect explanation sentences.

93

1. Lexical (31 feature categories)

1. lemmas of q/a/e

2. lemmas shared by q and e, a and e, and q, a and e

3. 5-gram, 4-gram, and 3-gram prefixes and suffixes of q/a/e

4. 5-gram, 4-gram, and 3-gram prefixes and suffixes shared by q, a, and e

5. e’s table type from the provided annotated tablestore data

2. Grammatical (11 feature categories)

1. using OpenIE (Angeli et al., 2015) extracted relation triples from q, a, and e sentences, thefeatures are: the q/a/e lemmas in the relation subject role, shared q, a and e subject lemmas,q/a/e lemmas in the relation object role, shared q, a and e object lemmas, and q/a/e lemmaas the relation predicate

3. Semantic (34 feature categories)

1. top 50 conceptualizations of q/a/e words obtained from ConceptNet (Liu and Singh, 2004)

2. top 50 ConceptNet conceptualizations shared by q and e, a and e, and q, a and e words

3. words related to q/a/e words by any ConceptNet relation such as FormOf, IsA, HasContext,etc.

4. words in common related to q, a, and e words

5. Wiktionary5 categories of q/a/e words

6. Wiktionary categories shared by q, a, and e words

7. Wiktionary page titles for content matched with q/a/e words

8. Wiktionary page titles for content matched with q, a, and e words in common

9. FrameNet v1.7 frames and their frame-elements (Fillmore, 1976) using open-SESAME (Swayamdipta et al., 2017) were extracted from q and e sentences

Table 3: 76 feature categories for explanation ranking. Each training instance corresponds to a triplet (q, a, e),where q, a, and e are bags of question, answer, and explanation words/lemmas, respectively, with stopwordsfiltered out, where the data was sentence split and tokenized using the Stanford Parser 6.

4.3 Rules Solver

The application of the LTR system on develop-ment data revealed 11 classes of errors that wecall obvious error categories in the sense thatthey could be easily rectified via a set of logicalif − then − else chained rules where the out-put of one rule is the input of the next. We hy-pothesize that a rule-based approach can comple-ment a purely learning-based approach, since a hu-

man could alternatively encode the commonsenseknowledge that may not be accessible to a learn-ing algorithm given our features set. This inclu-sion of rules as a post-processing step resulted inour hybrid learning-based and rule-based systemto MIER-19 explanation sentence ranking.

We list four rules from our complete set of 11rules as examples next. 7

7We list all the rules in Appendix A.

94

E.g. Rule 1: Match with uni- or bigram an-swers.

if answer is a unigram or bigram thenrerank all explanation sentences containingthe answer to the top

end ifE.g. Rule 2: Match with named entities.

if explanation sentence contains named entitiesidentified by [A-Z][a-z]+( [A-Z][a-z]+)+ then

rerank the explanation sentence to the bottomif neither the question or answer contain theexplanation’s named entities

end ifE.g. Rule 3: Rerank explanation sentences withother color words than the answer

if an answer contains a color word thenrerank all explanation sentences about othercolors in the form “[other color] is a kind ofcolor” to the bottom of the list

end ifE.g. Rule 4: Rerank based on gerund or par-ticiple answer words

if answer contains gerund or participle words,i.e. “ing” words then

rerank all explanation sentences from theSYNONYMY table type containing gerundor participle words other than the answer“ing” word to the bottom of the list

end if

4.4 Testing the Hybrid System

The trained LTR model and the rules were thenapplied on the QA instances released as the testdataset. From test data, (q, a, e) triplets were cre-ated in the same manner as the development datawhere each test QA is given all 4,789 candidateexplanation sentences for ranking. Unlike devel-opment data, however, in testing the valid expla-nation sentences are unknown.

5 Evaluation

In this section, we evaluate our hybrid approach toexplanation sentence ranking for elementary sci-ence QA.

5.1 Experimental Setup

Dataset. We used the 1,190 and 264 elementaryscience QA pairs released as the MIER-19 chal-lenge training and development data, respectively,for developing our system. For testing, we used

the 1,247 QA instances released in the evaluationphase of the MIER-19 challenge. For explanationcandidate sentences, we used the tablestore of the4,789 sentences which remained the same in thecourse of the challenge.

Evaluation Metrics. Evaluation results are ob-tained using the official MHIER-19 challengescoring program. Results are expressed in termsof mAP computed by the following formula.

mAP =1

N

N∑

n=1

APn

whereN is the number of QA instances andAPis the average precision for a QA computed as fol-lows.

AP@k =1

GTP

k∑

i=1

TPseen@i

i

where AP@k, i.e. average precision at k, isthe standard formula used in information retrievaltasks. Given the MHIER-19 challenge data, foreach QA, GTP is the total ground truth explana-tion sentences, k is the total number of explana-tion sentences in the tablestore (i.e. 4,789), andTPseen@i are the total ground truth explanationsentences seen until rank i.

By the above metric, our results are evaluatedonly for the correct explanation sentences returnedas top-ranked, without considering their order.

Parameter Tuning. To optimize ranker perfor-mance, we tune the regularization parameter C(which establishes the balance between generaliz-ing and overfitting the ranker model to the trainingdata). However, we noticed that a ranker trainedon all provided explanation sentences is not ableto learn a meaningful discriminative model at allowing to the large bias in the negative examplesoutweighing the positive examples (consider thatvalid explanation sentences range between 1 to 21whereas there are 4,789 available candidate expla-nation sentences in the tablestore). To overcomethe class imbalance, we tune an additional param-eter: the number of negative explanation sentencesfor training. Every QA training instance is as-signed 1000 randomly selected negative explana-tion sentences. We then test tuning the numberof negative training data explanation sentences torange between 500 to 1,000 in increments of 100.

Both the cost factor and the number of nega-tive explanation sentences are tuned to maximizeperformance on development data. Note, however,that our development data is created to emulate the

95

Dev MAP Test MAP

SVMrank 37.1 34.1+Rules 44.4 39.4

Table 4: Mean Average Precision (mAP ) percentagescores for Elementary Science QA explanation sen-tence ranking from only pairwise LTR (row 1) and as ahybrid system with rules (row 2) on development andtest datasets, respectively.

testing scenario. So every QA instance during de-velopment is given all 4,789 candidate explanationsentences to obtain results for the ranking task.8

Our best LTR model when evaluated on devel-opment data was obtained with C = 0.9 and 700negative training instances.

5.2 Results and DiscussionTable 4 shows the elementary science QA expla-nation sentence ranking results from the officialMIER-19 scoring program in terms of mAP . Thefirst row corresponds to results from the feature-rich SVMrank LTR system and the second rowshows the reranked results using rules. Whileadding the rules gives us close to a 7 and 5 pointsimprovement on development and test sets, re-spectively, the LTR system results are nonethelesssignificantly better than an information retrievalTF-IDF baseline which gives 24.5% and 24.8%mAP on development and test data. Addition-ally, Figure 2 shows the impact of the features-only LTR system versus the features with ruleshybrid system on different length explanation sen-tences up to 12.9 It illustrates that the longer ex-planations are indeed harder to handle by both ap-proaches and on both development and test data.

To provide further insights on the impact ofadding different feature groups in our LTR system,we show ablation results in Table 5. We discussthe maximum impact feature groups (viz. affix,concepts, and relations) with examples, next, todemonstrate why they work.

Compared to all other features, adding affixes inthe LTR system resulted in the maximum perfor-mance gain of 6 points on the development data.In general, affixation enables lexical matches withvariant word forms, which for us, facilitated bet-

8For parameter tuning, C is chosen from the set 0.1, 0.9,1,10,50,100,500,800 and the number of negative training in-stances is chosen from the set 500,600,700,800,900,1000.

9We only consider explanation length up to 12 for thecomparison since the longer explanations are underrepre-sented in the data with up to 3 QA instances.

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12 14

MA

P

# Explanation Sentences

Dev Features

Dev Features+Rules

Test Features

Test Features+Rules

Figure 2: Percentage mAP of the ‘Features’ ver-sus ‘Features+Rules’ systems on Development data (inblue) and Test data (in red), respectively, on differentlength explanation sentences.

Feature Type Dev MAP Test MAP1 lemma 28.40 24.852 +tablestore 28.05 24.953 +affix 34.01 31.014 +concepts 35.89 32.945 +relations (openIE &

conceptNet)36.79 33.69

6 +Wiktionary 37.14 33.657 +framenet 37.12 34.14

Table 5: Ablation results of the LTR SVMrank systemin terms of percentagemAP with feature groups (fromseven feature types considered) incrementally added.

ter matches between QA and candidate explana-tion sentences. Consider the following exampleshowing the top 3 returned explanation sentences.

Question What happens when the Sun’s energy warmsocean water?Answer The water evaporates.Before[not gold] an ocean is a kind of body of water[not gold] temperature or heat energy is a property of ob-jects or weather and includes ordered values of cold orcool or warm or hot[not gold] coral lives in the ocean or warm waterAfter[not gold] an ocean is a kind of body of water[rp 2 & rg 1] boiling or evaporation means change froma liquid into a gas by adding heat energy[rp 3 & rg 6] the sun transfers solar energy or light energyor heat energy from itself to the planets or Earth throughsunlight

Before affixation, none of the valid explanationsentences are retrieved among the top 3. After af-fixation, however, two valid explantion sentencesget reranked among the top 3 10 owing to enabledmatches with the QA words “Sun’s” and “evapo-rates” based on their trigram prefixes.

As hypothesized earlier, we found Concept-Net’s semantic knowledge preventing semantic

10rp and rg stand for predicted rank and gold rank, respec-tively.

96

drift in several instances. This is illustrated in theexample below.

Question In which part of a tree does photosynthesismost likely take place?Answer leavesBefore[rp 1 & rg 1] a leaf performs photosynthesis or gas ex-change[rp 5 & rg 2] a leaf is a part of a green plant[rp 10 & rg 3] a tree is a kind of plantAfter[rp 1 & rg 1] a leaf performs photosynthesis or gas ex-change[rp 3 & rg 2] a leaf is a part of a green plant[rp 7 & rg 3] a tree is a kind of plant

In the example, with additional knowledge suchas that “plant” has a ConceptNet conceptual class“photosynthetic organism” enables higher rerank-ing for the second and third explanation sentencessince one of the focus concepts in the question isphotosynthesis.

We find the ConceptNet relations as features en-able making connections between the question andanswer. These connections enable a more accuratereranking of those explanation sentences that relyon information from both the question and the cor-rect answer closer to the top. Consider the follow-ing example.

Question Cows are farm animals that eat only plants.Which of these kinds of living things is a cow?Answer HerbivoreBefore[rp 3 & rg 6] an animal is a kind of living thing[rp 5 & rg 1] herbivores only eat plantsAfter[rp 2 & rg 1] herbivores only eat plants[rp 5 & rg 6] an animal is a kind of living thing

For the above example, from ConceptNet weobtain the lexical relations “herbivore IsA ani-mal”, “cow RelatedTo animal”, and “an animalDesires eat” which lexically links the explanationsentence with the question and the correct answer.We attribute the application of such relations asthe reason for the correct reranking of the two sen-tences in terms of precedence and their proximityto the gold rank.

5.2.1 Negative ResultsApart from the features depicted above, we alsoconsidered WordNet (Miller, 1998) for additionallexical expansion to facilitate matches for the(q, a, e) triplets based on linguistic relations suchas synonymy, hypernymy, etc., but did not ob-tain improved system performance. Further, fea-tures computed from the word embeddings, viz.

Word2vec (Mikolov et al., 2013), Glove (Pen-nington et al., 2014), and ConceptNet Number-batch (Speer et al., 2017), as averaged vectors alsodid not improve our model scores.

Finally, while the hybrid system via the rerank-ing rules addresses lexical ordering between can-didate explanation sentences, they still cannothandle eliminating intermediate explanation sen-tences that may not be semantically meaningful tothe QA. This is illustrated in the representative ex-ample below.11

Question Jeannie put her soccer ball on the ground onthe side of a hill. What force acted on the soccer ball tomake it roll down the hill?Answer gravity[rf 3 & rh 1] gravity is a kind of force[rf 21 & rh 3] gravity or gravitational force causes ob-jects that have mass or substances to be pulled down or tofall on a planet[rf 4 & rh 14] a ball is a kind of object[rf 7 & rh 17] to cause means to make

The example shows that the reranked resultfrom the hybrid system follows the order in thegold data, however, not consecutively. Sentencesextraneous to the explanation such as “the groundis at the bottom of an area”, “a softball is a kindof ball”, etc., are still in between gold explanationsentences in the reranked results.

6 Conclusions

We employed a hybrid approach to explanationsentence ranking for Elementary Science QA con-sisting of a feature-rich LTR system followed by aseries of 11 rules. When evaluated on the MIER-19 official test data, our approach achieved anmAP of 39.4%.

An immediate extension to this system wouldbe to encode the dependencies between explana-tion sentences as features. While the pairwise LTRmodel tackles this dependency to some extent, wehypothesize that explicit modeling of features be-tween explanation sentences should produce sig-nificantly improved scores.

Acknowledgments

We would like to thank the MIER-19 organizersfor creating the corpus and organizing the sharedtask. We would also like to thank the anonymousreviewers for their helpful suggestions and com-ments.

11rf and rh stand for ranking by features and the hybridsystem, respectively.

97

ReferencesGabor Angeli, Melvin Jose Johnson Premkumar, and

Christopher D. Manning. 2015. Leveraging linguis-tic structure for open domain information extraction.In ACL.

Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.Commonsense for generative multi-hop question an-swering tasks. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 4220–4230.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. ArXiv,abs/1803.05457.

Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sab-harwal, Oyvind Tafjord, Peter Turney, and DanielKhashabi. 2016. Combining retrieval, statistics, andinference to answer elementary science questions.In Thirtieth AAAI Conference on Artificial Intelli-gence.

Peter Clark, Philip Harrison, and Niranjan Balasubra-manian. 2013. A study of the knowledge base re-quirements for passing an elementary science test.In Proceedings of the 2013 workshop on Automatedknowledge base construction, pages 37–42. ACM.

Charles J Fillmore. 1976. Frame semantics and the na-ture of language. Annals of the New York Academyof Sciences, 280(1):20–32.

Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-ter Clark. 2017. Framing qa as building and rankingintersentence answer justifications. ComputationalLinguistics, 43(2):407–449.

Peter Jansen and Dmitry Ustalov. 2019. TextGraphs2019 Shared Task on Multi-Hop Inference for Ex-planation Regeneration. In Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing (TextGraphs-13), HongKong. Association for Computational Linguistics.

Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton T. Morrison. 2018.Worldtree: A corpus of explanation graphs forelementary science questions supporting multi-hopinference. In Proceedings of the 11th InternationalConference on Language Resources and Evaluation(LREC).

Thorsten Joachims. 2006. Training linear svms in lin-ear time. In Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discoveryand data mining, pages 217–226. ACM.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.Scitail: A textual entailment dataset from sciencequestion answering. In AAAI.

Hugo Liu and Push Singh. 2004. Conceptneta practi-cal commonsense reasoning tool-kit. BT technologyjournal, 22(4):211–226.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

George A Miller. 1998. WordNet: An electronic lexicaldatabase. MIT press.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In EMNLP.

Debjit Paul and Anette Frank. 2019. Ranking and se-lecting multi-hop knowledge paths to better predicthuman needs. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 3671–3681.


Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention flow for machine comprehension. ArXiv,abs/1611.01603.

Robyn Speer, Joshua Chin, and Catherine Havasi.2017. ConceptNet 5.5: An open multilingual graphof general knowledge. pages 4444–4451.

Mihai Surdeanu, Massimiliano Ciaramita, and HugoZaragoza. 2011. Learning to rank answers to non-factoid questions from web collections. Computa-tional linguistics, 37(2):351–383.

Swabha Swayamdipta, Sam Thomson, Chris Dyer, andNoah A. Smith. 2017. Frame-Semantic Parsing withSoftmax-Margin Segmental RNNs and a SyntacticScaffold. arXiv preprint arXiv:1706.09528.

A Appendix

Rule 1: Match with uni- or bigram answers.if answer is a unigram or bigram then

rerank all explanation sentences containingthe answer to the top

end ifRule 2: Match with named entities.

if explanation sentence contains named entitiesidentified by [A-Z][a-z]+( [A-Z][a-z]+)+ then

rerank the explanation sentence to the bottomif neither the question or answer contain theexplanation’s named entities

98

end ifRule 3: Match with energy insulator common-sense knowledge.

if explanation contains wax or rubber or brickor down feathers as “energy insulator” then

rerank the explanation to the bottom if nei-ther the question or answer contain “energyinsulator”

end ifRule 4: Match with “increase” or “decrease”commonsense knowledge.

if explanation contains “increase” or “decrease”then

rerank the explanation to the bottom if neitherthe question or answer contains “increase” or“decrease”

end ifRule 5: Rerank explanation sentences with un-related entities to the QA.

By this rule, we identify pairs of entities thatare unrelated and rerank all explanation sentencescontaining an unrelated entity mention to the bot-tom of the list. For instance, if the QA is about“planets”, then all explanation sentences about“fern” can be reranked to the bottom as it unlikelyfor any discussion to exist that relates “planets”and the “fern” plant. Similarly, if a QA is about“puppies”, then all explanation sentences about“peas” can be reranked to the bottom.

To form pairs of unrelated entities, first, we cre-ate lists of living and nonliving entities using theKINDOF explanation table type (one among 61explanation tables with a few shown in Table 2),where sentences are of the pattern “[LHS] is akind of [RHS]”. These lists as created in a recur-sive manner. For instance, to create the list of liv-ing entities, we begin with all sentences where theRHS=“living thing” and add the LHS value to thelist. In the next step, we substitute in the RHS thenew found living entities extracted in the previousstep. Again, we add the new LHS values to the listof living entities. This process, i.e. substitutingnew entities in the RHS and extracting the entity inthe LHS, continues until the list of living entitiesno longer changes. As a simple example, given “aplant is a kind of living thing”, in step 1, we addplant to the list of living entities. In step 2, given“peas are a kind of plant”, we add peas to the listof living entities. The lists for non-living entitiesare created in a similar manner, with the startingpattern using RHS=“nonliving thing”.

Once weve obtained these lists, we obtain allpairwise combinations of living, non-living, andliving and non-living entities. We identify thepairs of unrelated entities by filtering out all pairsof entities that have appeared in training and de-velopment data. We also manually filter out ad-ditional entities that we recognize could be re-lated. Using this list, we rerank the explanationsentences as follows.

if question or answer contain an entity in the listof unrelated entity pairs then

rerank all explanation sentences containingthe second element of the pair to the bottomof the list

end ifRule 6: Singular form matched in explanationsentences with plural unigram answers.

if a unigram answer is in plural thenrerank all explanation sentences containingits singular form to the top of the list follow-ing explanation sentences containing the ex-act plural match

end ifRule 7: Rerank explanation sentences withother color words than the answer.

if an answer contains a color word thenrerank all explanation sentences about othercolors in the form “[other color] is a kind ofcolor” to the bottom of the list

end ifRule 8: Rerank KINDOF explanation sen-tences with entities not present in the QA.

if QA does not contain generic living entitytypes such as “plants”, “animal”, “organism” or“human” then

rerank all KINDOF explanation sentences tothe bottom relating entities not expressed inthe either the question or answer

end ifRule 9: Rerank explanations from table typesnot considered in training and developmentdata based on six QA types and overall.

We identify six QA types: “Which”, “When”,and “What” questions; questions beginning with“Some”; questions beginning with indefinite arti-cle “A”; and those beginning with definite article“The”. For each of these six QA types, we iden-tify the table types that are never used to gener-ate explanations in training and development data.Additionally, we identify table types never used to

99

generate explanations in the training and develop-ment set overall.

if QA is in one of the six types or overall thenrerank all explanation sentences from thenever used table types for the particular typeof QA to the bottom of the list

end ifRule 10: Match based on alternative sense ofthe word “makes”

if QA contains the word “makes” (e.g., “Whatmakes up most of a human skeleton?”) then

rerank all explanation sentences from theSYNONYMY table type of “make” with al-ternative word senses to the bottom of thelist (e.g., “to make something easier meansto help”)

end ifRule 11: Rerank based on gerund or participleanswer words.

if answer contains gerund or participle words,i.e. “ing” words then

rerank all explanation sentences from theSYNONYMY table type containing gerundor participle words other than the answer“ing” word to the bottom of the list

end if

100


Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning overChains of Facts for Explainable Multi-hop Inference

Ameya Godbole1∗, Rajarshi Das1∗,Manzil Zaheer2, Shehzaad Dhuliawala3, Andrew McCallum1

1University of Massachusetts, Amherst,2Google Research, 3Microsoft Research, Montrealagodbole, rajarshi, [email protected]

[email protected], [email protected]

Abstract

This paper describes our submission to theshared task1 on “Multi-hop Inference Expla-nation Regeneration” in TextGraphs workshopat EMNLP 2019 (Jansen and Ustalov, 2019).Our system identifies chains of facts relevantto explain an answer to an elementary scienceexamination question. To counter the prob-lem of ‘spurious chains’ leading to ‘semanticdrifts’, we train a ranker that uses contextual-ized representation of facts to score its rele-vance for explaining an answer to a question.Our system2 was ranked first w.r.t the mean av-erage precision (MAP) metric outperformingthe second best system by 14.95 points.

1 Introduction

Machine reading comprehension (MRC), the abil-ity of computing systems to read and understandtext, has been a long-standing goal of natural lan-guage understanding. Question answering (QA)provides a natural way of testing a models capa-bility to comprehend text — by probing it withqueries about information present in the text. Re-search in MRC has seen rapid progress in recentyears, with systems matching or out-performinghuman performance in many QA datasets (Chenet al., 2016; Devlin et al., 2018; Yang et al., 2019;Liu et al., 2019). However, a series of recent workhas demonstrated the brittleness of these systemsshowing that they perform shallow pattern match-ing (Jia and Liang, 2017; Kaushik and Lipton,2018; Chen and Durrett, 2019).

QA models that can provide explanations be-hind its answers can give us better insights intothis brittleness. More importantly, it has also beenshown that providing explanations for an an-swer also increases how much a user trusts the

1https://github.com/umanlp/tg2019task2https://github.com/ameyagodbole/multihop_

inference_explanation_regeneration

A balance is used for measuring mass;weight of an object; of a substance

A graduated cylinder is used to measure volume of a liquid; of an object

A graduated cylinder is a kind of instrument for measuring volume of liquids or objects

Comparing requires measuring

Marble is a kind of object; material

A student wants to compare the masses and volumes of three marbles. Which two instruments should be used? Answer: Balance and graduated cylinder

Figure 1: A subgraph of facts from the WorldTree cor-pus that explains the answer to the question.

model (Herlocker et al., 2000; Dzindolet et al.,2003; Ribeiro et al., 2016). Therefore, as more ofthese models are deployed into real-world appli-cations, providing explanations should be a neces-sary component of every QA system.

In this paper, we describe the system that wesubmitted to the shared task on “Multi-hop In-ference Explanation Regeneration”. Our modelscores chains of facts (sentences) relevant to ex-plain an answer to an elementary science examquestion. Given the question and its answer, weuse a standard information retrieval (IR) systemto retrieve a set of starting facts from the givencorpus. We adopt the “explanation graphs” rep-resentation proposed by Jansen et al. (2018), inwhich sentences can be viewed as nodes in a graphand two nodes are connected if they have a lexicaloverlap between them. Therefore, finding the ex-planation for a question can be viewed as findingthe ‘relevant subgraph’ within the bigger graph offacts and the explanation corresponds to the nodes(facts) comprising the subgraph. For example, fig-ure 1 shows a subgraph of facts explaining the an-swer to the given question.

We decompose the problem of finding the sub-graph into finding the chain of relevant nodes thatare important to answering the question and then

101

A student wants to compare the masses and volumes of three marbles. Which two instruments should be used? (A) Balance and graduated cylinder (B) Centimeter ruler and thermometer (C) Graduated cylinder and centimeter ruler (D) Thermometer and balance

Question

WorldTree corpus

TF-IDF- centimeter is a unit of measurement- temperature is a measure of heat energy- tape measure is a kind of tool…...…….and 41 others

- eating food that contains pesticides can have a negative impact on humans- a human can pedal a bicycle- humans discarding waste in an environment causes harm to that environment…….and 92 others

- state of matter has no impact on mass- gravitational forces causes objects that have mass to be pulled down- gram is a kind of unit for measuring mass….and 35 others

- each of the moon’s phases usually occurs once per month- butterfly can live for one month- months are a unit for measuring time…...and 8 others

a. Get top-K sentences via TF-IDF

b. Get all outgoing edges from the retrieved sentences in the graph. Each sentence is connected to all the facts in the corresponding boxes (via lexical overlap).

a graduated cylinder is used to measure volume of a liquid; of an object

a rain gauge is a kind of graduated cylinder

a student is a kind of human

a balance is used for measuring mass;weight of an object; of a substance

a new season occurs once per three months

…… and K-5 more initial set of sentences

Outgoing facts

Figure 2: Overview of our approach. Given a question and its answer, we find a set of initial relevant facts via asimple TF-IDF retriever. Next, from each of this relevant fact, we look at other outgoing facts (shown in the coloredboxes). Note that, there are a lot of outgoing facts and most of them are spurious w.r.t the given question. Usingthe annotations provided, we train a supervised ranker to identify the relevant chains needed to explain an answer.

combining the top-ranked chains to reconstruct theexplanation subgraph. Starting with the initial setof nodes (facts) returned by the IR system, we lookat all the facts that can be hopped from them. How-ever as can be seen from figure 2, most of thesechains of facts are not relevant for answering thequestion and are hence ‘spurious’ in nature. Forexample, combining the two facts (that are con-nected via lexical overlap) “a graduated cylinderis used to measure volume of a liquid” and “cen-timeter is a unit of measurement” does not providea valid explanation for the question. Models, at-tempting to do reasoning over such spurious chainof facts often fall into the phenomenon of ‘se-mantic drift’ (Fried et al., 2015). To counter suchspurious chains, we use the annotations providedin the WorldTree corpus (Jansen et al., 2018) totrain a re-ranker that scores whether a pair of factsis relevant to answer a question. Our re-rankerencodes a pair of facts with the question andobtains question-aware contextualized representa-tion from a pre-trained BERT (Devlin et al., 2018)language model. We also find that an even sim-pler model that scores each facts independently(instead of a chain) is also very competitive and acombination of both the models performs the best.Overall our simple technique achieves a score of56.25% MAP score outperforming the second bestentry by 14.95 points.

2 Task

Given a question and the correct answer, the taskis to find a set of sentences that explain the an-swer to the question. The task can easily be trans-formed into a ranking setup where the goal is torank the relevant facts over all other facts presentin the corpus. The evaluation metric used for thetask is the widely used and robust mean averageprecision (MAP) metric.Data: The data in the shared task comes from theWorldTree corpus that contains elementary sci-ence questions from the ARC corpus (Clark et al.,2018). A key feature of the WorldTree corpus isthat it contains detailed annotation stating whethera fact is a part of the explanation for each ques-tion. Following our prior graphical representation,these sentences form an ‘explanation subgraph’.Although, we do not use this, but the corpus alsocontains annotations about whether a fact is ‘cen-tral’, or provides ‘grounding’ or serves as a ‘lexi-cal glue’ for explaining the answer to the question.

3 Model

Our model consists of two major components —(a) a simple IR system that retrieves the initial setof evidence facts from the corpus and serves asa starting point for further exploration and (b) aBERT based ranker that scores a pair of facts w.r.ta question (and the correct answer). For our cur-

102

rent system, the IR component is a simple tf-idfbased retriever that takes in the concatenation ofquestion and the correct answer string and returnsthe top-k ranked facts in the corpus. Although wefind that this simple retriever is effective, our de-sign is agnostic to the choice of the retriever andcan be replaced with any sophisticated IR system.

The retrieved facts will often miss key factsthat are required for explaining an answer. Thiscould be because there is no lexical overlap be-tween the fact and the question, or because the rel-evant facts are just ranked low. For example, asshown in figure 2, two important explanatory factswere missed by the initial IR system (“Computingrequires measuring” (‘central’) and “Marble is akind of object; material” (‘grounding’)). However,we note that if we also consider the outgoing factsfrom all the retrieved sentences, then the recallof the system increases significantly. For exam-ple, on considering all the 1-hop neighbors fromthe top 25 initial facts retrieved by tf-idf, 94.48%of ground truth facts were covered for a question,on average3. This motivated us to consider reason-ing over a chain of facts together. Reasoning overchains of facts has also been shown to be an effec-tive technique for reasoning over knowledge bases(Lao et al., 2011; Neelakantan et al., 2015; Daset al., 2017, 2018, inter-alia).

In our graphical representation, edges are asa result of lexical overlap between sentences. Inmaking edges between facts, we ignore the stopwords and “filler” words that were added in the an-notation process. Despite this the resulting graphis very dense, and hence, each node is associatedwith many neighbors and not all of the facts areimportant for answering the question. These kindof spurious chain of facts is responsible for leadingto wrong inference and often goes out of contextw.r.t the question leading to semantic drifts (Friedet al., 2015). Analogous problems due to spuri-ous facts can be seen in learning semantic parserfrom denotations (Guu et al., 2017) and in rein-forcement learning (Sutton, 1984; Agarwal et al.,2019). The density of the graph also makes it in-tractable to use more than 1-hop chains and a bet-ter (sparser) graph representation will be neces-sary to make full use of the path ranker.

To counter the problem of spurious chains, weuse the annotations present in the WorldTree cor-pus. Specifically, a chain of facts is a positive train-

3For 75.66% of questions, all the ground truth facts werepresent in the set of 1-hop neighbors of top 25 tf-idf facts.

[CLS] a student wants to compare the masses and volumes of three marbles. which two instruments should be used? balance and graduated cylinder [SEP] a graduated cylinder is used to measure volume of a liquid; of an object. [SEP] centimeter is a unit of measurement

BE

RT E

ncoder

CLS

Linear (768X768)

Linear (768X2)

Softm

ax

Figure 3: Architecture of the fact encoder. The ques-tion, answer and the chain of facts are concatenated to-gether and fed to a BERT encoder to form query-awarecontextualized representation. The representation cor-responding to the [CLS] token is then fed to anotherfeed-forward network.

ing example, if all the facts present in the chain arerelevant to the question. For example, the chain(“A balance is used for measuring mass, weightof an object”, “Marble is kind of an object”) is apositive training example, since both the individ-ual facts are relevant for the question in figure 1.However, the chain (“A balance is used for mea-suring mass, weight of an object”, “A graduatedcylinder is a object”) is considered as a spuriouschain for the same question.

To encode the chain of facts, we use thepower of contextualized representation from apre-trained BERT language model (Devlin et al.,2018). As shown in figure 3, we form the query-dependent representation of a chain of facts byconcatenating the question, correct answer and thefacts in the chain. After encoding with a BERT

model, we use the representation corresponding tothe [CLS] token as the intermediate representationwhich is then fed to a 2-layer feed-forward net-work to output a score. We use a simple binarycross entropy loss to train the network.

Our model scores chain of facts and thereforewe need a way of propagating the scores to indi-vidual facts to create a ranked list. We follow thesimple strategy of assigning the same score to eachindividual fact in the chain. The same fact can alsooccur in multiple chains. In that case, a fact is as-signed the maximum score it gets from any chain.

We also tried a variant of our model that doesnot consider a chain of facts but treats and scoresfact independently. Such models have been foundto be effective for information retrieval (Nogueiraand Cho, 2019). In this model, we concatenate thequery and an individual fact and compute the rel-evance score for each fact in the corpus. Althoughsimple, a drawback of this model is that it will notscale to a large corpus of facts. However, sincethe WorldTree corpus contains only around 5000

103

facts, we were able to exhaustively evaluate it. Aswe will show in the next section, this model is ex-tremely competitive.

4 Experiments

For all experiments, the BERT encoder is initial-ized with the pre-trained BERT-BASE-UNCASED

model available in the Python package PYTORCH-TRANSFORMERS4. For training, learning rate wasinitialized to 3e−5, batch size of 60 and maximumsequence length of 90 (for batching). During eval-uation, we increased the sequence length to 140.The entire model was fine-tuned for 1 epoch. Weused Adam optimizer (Kingma and Ba, 2014) withlinear learning rate decay. For generating the pathsduring training, the top 25 facts obtained by tf-idfbased ranker were used. We used the implementa-tion of TfidfVectorizer present in SCIKIT-LEARN

(Pedregosa et al., 2011). For all our experiments,we consider chains up to length 2.

For evaluation, we report results using top 25and top 50 tf-idf ranked facts. Using 50 startingfacts improves the coverage of our method5 butthe this results in about 10,000+ chains per ques-tion making it difficult to scale. We concatenatethe correct answer choice with the question. Wedid this because we observed that the score of tf-idf based ranker is best when given only the cor-rect choice. This makes intuitive sense since theranker is distracted by the wrong choices and inmost examples, the remaining choices are not nec-essary to answer the question. Our approach doesnot guarantee ranking of all possible facts presentin the corpus. To obtain a ranking of all facts, weappended the facts that were missed in the orderthat they appear in the tf-idf ranking.

4.1 BaselinesWe report the tf-idf based ranking scores as base-lines. We experimented with three variants basedon which answer choices were present in thequery. We report the score of the ‘no choices’ vari-ant because in a realistic setting, the correct an-swer will not be present at test time.

We consider a variant of our model that ranksindividual facts directly rather than paths (BERT

Re-ranker in the Table 1). This is the model pro-posed in (Nogueira and Cho, 2019) but instead of

4https://github.com/huggingface/pytorch-transformers

5For 88% (75.66%) of questions, all the ground truth factswere present in the set of 1-hop neighbors from top 50 (top25) tf-idf facts.

Model MAP

TF-IDF (no choices) 0.1927TF-IDF (all choices) 0.2440TF-IDF (correct choice) 0.3012

BERT Re-ranker 0.5611

Path ranker (k=25) 0.5352Path ranker (k=50) 0.5529

Ensemble ranking (k=25) 0.5810+ Postprocessing 0.5827Ensemble ranking (k=50) 0.5829+ Postprocessing 0.5846

Table 1: MAP score on the development set. Perfor-mance of our model (Path ranker) and ensemble are re-ported for two variants; using top-25 and top-50 tf-idfretrieved facts as starting points

re-ranking a subset of facts, we use it to rank all thefacts. The Re-ranker was initialized and trained inthe same way as our model. We found that a modelfine-tuned for 3 epochs worked best. This is anextremely competitive model and outperforms ourPath-ranker model. As noted before, this is not ascalable approach and can not be applied to a largecorpus of facts.

We also perform a type of ensemble ranking.For every question, the ranking of Path rankeris used until the probability output of the modeldrops below 0.5. At this point, we append the rank-ing of the BERT Re-ranker. The intuition behindthis scheme is that we consider the ranking of thePath ranker only when it is confident (prob ≥ 0.5)and then fall back to the BERT re-ranker. Empiri-cally, as seen in table 1, this heuristic is very effec-tive. In addition, we perform a post processing stepto move duplicate facts to the end of the ranking6.This post-processing also gave a slight improve-ment 7.

4.2 Performance on the hidden test setOur submission during the test phase of the sharedtask corresponds to a constrained version of thefinal model. Also, the initial set of facts are Thetest set performance is reported in Table 2.

4.3 Error AnalysisWe perform extensive error analysis (in AppendixI) on the 23 questions in the development set

6There are few duplicate facts present in the dataset (withunique fact-ids)

7In the appendix, we refer to ranking generated by theensemble followed by post-processing as ‘Mix’ for brevity

104

Model/Participant MAP

Our model 0.5625

pbanerj6 0.4130redken 0.4017

Table 2: MAP score on the hidden test set.

where our model performed poorly (MAP≤ 0.25).We find that 2 of the 23 errors resulted due

to the pre-processing step of removing the wrongchoices. These questions required reasoning byeliminating answer candidates. In 5 of the 23 er-rors, (in our opinion) our model brings up alterna-tive sentences (to the top) that provide sufficientexplanation. In all other cases, when the modeloutputs a poor ranking, it usually outputs seman-tically similar facts but these facts are not neces-sary/helpful for completing the reasoning.

We would like to emphasize question 23 in Ap-pendix I which proves the need for scoring chainsrather than individual facts (“The snowshoe haresheds its fur twice a year. In the summer, the furof the hare is brown. In the winter, the fur is white.Which of these statements best explains the advan-tage of shedding fur? (A) Shedding fur keeps thehare clean. (B) Shedding fur helps the hare movequickly. (C) Shedding fur keeps the hares homewarm. (D) Shedding fur helps the hare blend intoits habitat”). In this question, understanding thatbrown fur is in fact an adaptation to hide frompredators and not meant for warmth is dependenton making a link that the hare lives in forestswhich have brown bark. Although our model per-forms better than tf-idf on this question, since thereasoning chain is longer than one hop, the perfor-mance is poor overall.

In appendix II, we also note examples wherehopping over multiple facts is necessary. BERT-re-ranker scores relevant facts low when they havelow lexical overlap with the question. However,our model uses chain of connected facts to iden-tify that there are relevant to the question.

5 Conclusion

We describe our entry to the shared task on‘Multi-hop Inference Explanation Regeneration’.We present a system that reasons over chains offacts to reconstruct a subgraph of facts that ex-plains an answer to the question. Our system is thewinning entry to the shared task outperforming thesecond best system by 14.95 points in MAP score.

Acknowledgements

This work is funded in part by the Center forData Science and the Center for Intelligent In-formation Retrieval, and in part by the NationalScience Foundation under Grant No. IIS-1514053and in part by the International Business Ma-chines Corporation Cognitive Horizons Networkagreement number W1668553 and in part by theChan Zuckerberg Initiative under the project Sci-entific Knowledge Base Construction. Any opin-ions, findings and conclusions or recommenda-tions expressed in this material are those of theauthors and do not necessarily reflect those of thesponsor.

ReferencesRishabh Agarwal, Chen Liang, Dale Schuurmans, and

Mohammad Norouzi. 2019. Learning to generalizefrom sparse and underspecified rewards. In ICML.

Danqi Chen, Jason Bolton, and Christopher D Man-ning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. In ACL.

Jifan Chen and Greg Durrett. 2019. Understandingdataset design choices for multi-hop reasoning. InNAACL.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457.

Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,Luke Vilnis, Ishan Durugkar, Akshay Krishna-murthy, Alex Smola, and Andrew McCallum. 2018.Go for a walk and arrive at the answer: Reasoningover paths in knowledge bases using reinforcementlearning. In ICLR.

Rajarshi Das, Arvind Neelakantan, David Belanger,and Andrew McCallum. 2017. Chains of reasoningover entities, relations, and text using recurrent neu-ral networks. In EACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In NAACL.

Mary T Dzindolet, Scott A Peterson, Regina A Pom-ranky, Linda G Pierce, and Hall P Beck. 2003. Therole of trust in automation reliance. Internationaljournal of human-computer studies.

Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-hai Surdeanu, and Peter Clark. 2015. Higher-order lexical semantic models for non-factoid an-swer reranking. TACL.

105

Kelvin Guu, Panupong Pasupat, Evan Zheran Liu,and Percy Liang. 2017. From language to pro-grams: Bridging reinforcement learning and maxi-mum marginal likelihood. In ACL.

Jonathan L Herlocker, Joseph A Konstan, and JohnRiedl. 2000. Explaining collaborative filtering rec-ommendations. In CSCW.

Peter Jansen and Dmitry Ustalov. 2019. TextGraphs2019 Shared Task on Multi-Hop Inference for Ex-planation Regeneration. In TextGraphs-13, EMNLP.

Peter A Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton T Morrison. 2018. Worldtree:A corpus of explanation graphs for elementary sci-ence questions supporting multi-hop inference. InLREC.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In EMNLP.

Divyansh Kaushik and Zachary C Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks. InEMNLP.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Ni Lao, Tom Mitchell, and William Cohen. 2011. Ran-dom walk inference and learning in a large scaleknowledge base. In EMNLP.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Arvind Neelakantan, Benjamin Roth, and Andrew Mc-Callum. 2015. Compositional vector space modelsfor knowledge base completion. In ACL.

Rodrigo Nogueira and Kyunghyun Cho. 2019. Pas-sage re-ranking with bert. arXiv preprintarXiv:1901.04085.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Why should i trust you?: Explainingthe predictions of any classifier. In KDD.

Richard Stuart Sutton. 1984. Temporal Credit Assign-ment in Reinforcement Learning. Ph.D. thesis, Uni-versity of Massachusetts Amherst. AAI8410337.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237.

106

6 Appendix I: Model error analysis (AP ≤ 0.25)This section contains examples from the development set where the final model performs poorly.1 mercury sc 401652Question: Before large trees could grow on Earth, what had to happen first? (A) Rocks were eroded toform soil. (B) Molten rock warmed Earth’s interior. (C) Earth’s gravity accumulated to modern levels.(D) Volcanoes exploded to form mountaintop lakes.Correct Answer: AGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixsoil is formed by rocks eroding as CENTRAL 23 19 2 19a plant requires soil for survival; to grow CENTRAL 3 64 6 61a tree is a kind of plant GROUNDING 1501 10 32 10

MAP 0.1407 0.0840 0.3090 0.0840

Analysis:

• TF-IDF does better because the first relevant fact is ranked higher. However, the low TFIDF fact isranked much lower than in the learned models

• Reranker output consists of facts related to soil erosion, which although have similar context, dontanswer the question

• Path ranker by itself has a much better MAP but due to low confidence, its ranking is disregarded inthe mixture result. The top results are lexical glue sentences connected to the questions and mostlyrelated to erosion again

2 mercury sc 400058Question: The small stone plant has leaves that look like pebbles or stones. This characteristic helpsthe plant (A) attract insects for pollination. (B) produce a large number of seeds. (C) absorb water andnutrients. (D) avoid being eaten by animals.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixlooking like is similar to camouflaging as LEXGLUE 135 32 552 30camouflage is a kind of adaptation for hiding in anenvironment

CENTRAL 4026 233 4358 229

helping something has a positive impact on thatsomething

LEXGLUE 317 1 257 2

an adaptation; an ability has a positive impact onan animal’s; living thing’s survival; health; abilityto reproduce

CENTRAL 1279 88 24 86

camouflage is a kind of protection against preda-tors; from predators; against consumers

CENTRAL 4027 67 4375 65

consumers eat other organisms CENTRAL 3866 1043 235 1032a plant is a kind of organism GROUNDING 23 57 2 55In the food chain process an animal has the role ofconsumer which eats producers;other animals forfood

CENTRAL 205 145 148 142

avoiding predators; escaping predators; avoidingconsumers is a kind of protection

CENTRAL 4100 276 4137 272

an adaptation is a kind of characteristic GROUNDING 110 115 1723 112An example of camouflage is when an organismlooks like its environment

CENTRAL 214 173 47 170

rock means stone LEXGLUE 7 87 18 85a pebble is a kind of small rock GROUNDING 208 33 560 31an ecosystem contains nonliving things CENTRAL 2622 590 3412 581rock is a kind of nonliving thing GROUNDING 4870 538 135 531

MAP 0.0283 0.1193 0.0676 0.0876

Analysis:

• The annotation for the question marks too many facts as relevant. It is possible to generate a smallerset of facts as sufficient to answer then question

107

– Establish rock is non-living: pebble is a kind of small rock; rock means stone; rock is a kind ofnonliving thing; nonliving, non-living, die is the opposite of living, alive, live

– Establish consumer eat living things: consumers eat other organisms; an organism is a livingthing; a plant is a kind of organism

– Establish that the plant is trying to camouflage: looking like is similar to camouflaging as;camouflage is a kind of protection against predators; from predators; against consumers

• Final ranking gathers facts related to adaptation and food chains near the top

3 mcas 2011 5 17673Question: Jose has two bar magnets. He pushes the ends of the two magnets together and then he letsgo. The magnets move quickly apart. Which of the following statements best explains why this happens?(A) The north poles of the two magnets are facing each other. (B) One magnet is a north pole and onemagnet is a south pole. (C) The ends of magnets repel each other but the centers attract. (D) One magnetis storing energy and one magnet is releasing energy.Correct Answer: AGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa magnet contains a north pole; south pole CENTRAL 1118 4 14 6if a pole is facing a pole of the same direction thenthose two poles will repel each other

CENTRAL 5 41 10 41

repel means move away LEXGLUE 971 1740 56 1719apart means away LEXGLUE 137 11 79 11north is a kind of direction GROUNDING 117 84 28 84

MAP 0.0495 0.1111 0.0969 0.0944

Analysis:

• The mixed results ranks facts related to magnetism and separated object as top choices. They aresimilar in content but not necessary to answer the question

4 mercury sc 411306Question: Some different types of plants have characteristics in common. Which characteristic do mostplants share? (A) the size of their roots (B) the shape of their leaves (C) the color of their flowers (D) thestructure of their cellsCorrect Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa plant is made of plant cells CENTRAL 2554 8 15 8a plant cell is box-like in shape CENTRAL 172 41 29 40structure is similar to shape LEXGLUE 53 31 136 30shape is a kind of characteristic CENTRAL 51 29 64 28

MAP 0.0191 0.0971 0.0530 0.0991

Analysis:

• None of the valid ground truth facts in top 25 TF-IDF ranking meaning that the path ranker has badstarting points

• The mixed results ranks facts related to cells, structure and characteristics as top choices. They havesimilar context but are not necessary to answer the question

5 vasol 2011 5 3Question: To make an electromagnet, a conductor should be coiled around - (A) a glass tube (B) an ironnail (C) a roll of paper (D) a wooden stickCorrect Answer: BGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixan electromagnet contains a wire; cylindrical fer-rous metal

ROLE 22 8 11 8

ferrous metals contain iron ROLE 51 18 23 18MAP 0.0423 0.1181 0.0889 0.1181108

Analysis:

• The top 2 facts ranked by the model are “iron nails are made of iron” and “an electromagnet isformed by attaching an iron nail wrapped in a copper wire to a circuit”. These are sufficient toanswer the question but not marked as ground truth.

6 mercury sc lbs10789Question: All of the following can become fossils except (A) bones. (B) shells. (C) teeth. (D) rocks.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixfossil means preserved remains of organisms CENTRAL 2671 1 3 1skeletal system is made of bones NE 2429 1540 3418 1503a skeletal system is a part of an animal NE 2256 1506 919 1469an animal is a kind of organism CENTRAL 1221 43 39 41a shell is a part of some animals NE 2133 4328 285 4249a tooth is a kind of bone NE 2308 3819 3308 3742rock is a kind of nonliving thing CENTRAL 2098 25 24 23nonliving; non-living; die is the opposite of living;alive; live

LEXGLUE 1117 2242 130 2387

an organism is a living thing CENTRAL 3294 178 64 175MAP 0.0021 0.1318 0.0698 0.1331

Analysis:

• This question is a case where having all the choices is necessary to gather relevant facts. This isa known weakness in the preprocessing step which only keeps the correct choices. Because therankers only see the query “All of the following can become fossils except rocks” they cannot gatherfacts related to bones, teeth and shells. This is reflected by very poor TF-IDF ranks

• Top ranked facts in the final results relate to fossil formation and rock formation. Again, because ofthe same reason mentioned before, the ranker never sees the other choice terms

7 mdsa 2009 5 16Question: Students visited the Morris W. Offit telescope located at the Maryland Space Grant Observa-tory in Baltimore. They learned about the stars, planets, and moon. The students recorded the informationbelow. Star patterns stay the same, but their locations in the sky seem to change. The sun, planets, andmoon appear to move in the sky. Proxima Centauri is the nearest star to our solar system. Polaris is astar that is part of a pattern of stars called the Little Dipper. Which statement best explains why the sunappears to move across the sky each day? (A) The sun revolves around Earth. (B) Earth rotates aroundthe sun. (C) The sun revolves on its axis. (D) Earth rotates on its axis.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixthe Earth rotating on its axis causes the sun toappear to move across the sky during the day

CENTRAL 1 18 25 15

the sun rises in the east BACKGROUND 39 23 87 20the sun sets in the west BACKGROUND 48 7 21 7

MAP 0.3713 0.1281 0.0540 0.1421

Analysis:

• Just one fact (”the Earth rotating on its axis causes the sun to appear to move across the sky duringthe day”) is sufficient to actually answer the question

• Top results are “if a human is on a rotating planet then other celestial bodies will appear to movefrom that human’s perspective”, “a star is a kind of celestial object; celestial body”, “stay the samemeans not changing”, “a telescope is used for observing stars;planets;moons;distant objects; thesky; celestial objects”, “the Earth rotates on its axis on its axis”, “the Sun is the star that is closestto Earth”. This looks like a reasonable explanation set

109

8 mdsa 2010 5 18 Question: Wind is a natural resource that benefits the southeastern shore of theChesapeake Bay. How could these winds best benefit humans? (A) The winds could blow oil spills intothe bay. (B) The winds could be converted to fossil fuel. (C) The winds could blow air pollution towardland. (D) The winds could be converted to electrical energy.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa windmill converts wind energy into electricity CENTRAL 80 75 192 74electricity means electrical energy LEXGLUE 32 3 7 3electricity is used as an energy source by electricaldevices

CENTRAL 125 33 5 33

electrical devices are used for industrial purposes;household purposes by humans

CENTRAL 372 142 79 140

to help; to benefit means to be of use LEXGLUE 2 10 15 10MAP 0.1291 0.1425 0.1525 0.1428

Analysis:

• Model spuriously ranks “a human is a kind of animal” at 2 and “person is synonymous with human”at 3. All other top facts relate to energy, wind and shore. Again the model has captured context butnot found facts that answer the question

9 mercury sc 400862Question: A wire is wrapped around a metal nail and connected to a battery. If the battery is active, thenail will (A) vibrate. (B) create sound. (C) produce heat. (D) become magnetic.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixan electromagnet is a kind of electric magnet GROUNDING 3328 18 26 19a electromagnet is formed by attaching an iron nailwrapped in a copper wire to a circuit

CENTRAL 4 9 52 10

iron is a kind of metal GROUNDING 22 86 30 86creating a simple circuit requires a wire; battery CENTRAL 7 30 35 30if battery in an electromagnet is active then the nailin the electromagnet will become magnetic

CENTRAL 1 1 14 6

MAP 0.4224 0.3161 0.0917 0.1432

Analysis:

• Top ranked fact is “metal is sometimes magnetic”. The model pushes useful sentences away byranking unnecessary synonymy sentences higher

10 vasol 2009 5 10Question: A student is hiking through a forest taking pictures for science class. Which picture wouldmost likely be used as an example of human impact on Earth? (A) A trail built by cutting down trees(B) A river eroding away the riverbank (C) A bird nest made of dead branches (D) A group of butterflieslanding on flowersCorrect Answer: AGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixcutting down trees has a negative impact on anecosystem; organisms living in an ecosystem

CENTRAL 10 3 5 3

clearing a forest means humans cutting down thetrees

CENTRAL 7 38 38 37

building usually requires cutting down trees BACKGROUND 11 27 9 27the Earth contains many ecosystems CENTRAL 530 41 52 40

MAP 0.1558 0.1460 0.1445 0.1471

Analysis:

• Model output shows that it has captured context but again does not mark facts required to explainanswer higher

110

11 mcas 2002 5 6Question: Hummingbirds can hover in the air and fly very quickly. This benefits the hummingbird in allof the following except (A) quickly escaping predators. (B) easily reaching flowers. (C) staying in oneplace to drink nectar. (D) keeping eggs warm.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixsome animals move quickly to escape predators NEG 160 2294 241 2275flying is a kind of motion air GROUNDING 410 12 5 12motion; movement means moving; to move LEXGLUE 4541 155 425 150avoiding predators; escaping predators; avoidingconsumers is a kind of protection

LEXGLUE 3423 3832 621 3770

protecting something means preventing harm tothat something

LEXGLUE 4077 1033 1271 1025

harming something has a negative impact on; effecton that something

LEXGLUE 4325 390 4488 384

negative impact is the opposite of positive impact LEXGLUE 1125 76 2346 73a positive impact is a benefit LEXGLUE 3862 2 3968 5a hummingbird can reach flowers by hovering inthe air

NEG 2 37 26 36

hummingbirds eat nectar CENTRAL 9 25 63 24a flower is a source of nectar CENTRAL 2769 89 182 86an animal needs to eat food for nutrients CENTRAL 3545 365 85 359an animal; living thing requires nutrients for sur-vival

CENTRAL 3946 151 71 148

requiring is similar to needing help LEXGLUE 4740 24 4786 23to help; to benefit means to be of use LEXGLUE 3281 9 3830 10to hover means to stay in place in the air NEG 1 13 2 2

MAP 0.1503 0.1470 0.0847 0.1651

Analysis:

• Since this is a question requiring elimination of choices, the preprocessing step of removing thewrong answer choices harms the model. Subsequently all relevant facts corresponding to incorrectchoices are pushed to the end

12 nceoga 2013 5 15Question: Which body system sends electrical signals to all other body systems? (A) circulatory system(B) digestive system (C) muscular system (D) nervous systemCorrect Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixthe nervous system sends observations in the formof electrical signals to the rest of the body

CENTRAL 1 6 6 6

MAP 1.0000 0.1667 0.1667 0.1667

Analysis:

• The model output ranks “nervous system is an electrical; electric conductor”; “the nervous systemis a kind of body system”; “the nervous system contains nerves”; “the nervous system is a part of thebody of an animal”; “the nervous system is the vehicle for controlling the body”. This is surprisingsince it ignores the near-perfect lexical overlap between question and ground truth

13 mercury sc 401598Question: Which action does a kitten learn from its mother? (A) how to grow (B) how to meow (C) howto hunt mice (D) how to nurse milkCorrect Answer: CGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixhunting; scavenging is a kind of skill GROUNDING 3883 46 28 45skills are learned characteristics CENTRAL 4507 11 41 12hunting is a kind of action CENTRAL 8 5 3 3

MAP 0.0421 0.1490 0.1593 0.1889

111

Analysis:

• The model output contains a lot of synonymy-like facts but not facts about mice and hunting

14 mcas 2016 5 7 Question: Which of the following happens only during the adult stage of the lifecycle of a frog? (A) A frog lays eggs. (B) A frog swims in water. (C) A frog begins to lose its tail. (D) Afrog begins to develop lungs.Correct Answer: AGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixsome adult animals lay eggs CENTRAL 28 4 4 4a frog is a kind of animal GROUNDING 4 3 3 3An example of reproduction is laying eggs CENTRAL 85 32 16 31reproduction occurs during adulthood CENTRAL 267 182 84 179adulthood is a stage in the life cycle process CENTRAL 12 246 97 242

MAP 0.1179 0.1938 0.2240 0.1946

Analysis:

• Top ranked facts are “an egg is a stage in the life cycle process of some animals”; “frogs lay eggs”;“a frog is a kind of animal”; “some adult animals lay eggs”

• Spurious sentences like “a female insect lays eggs during the adult stage of an insect’s life cycle”appear in the ranking

15 mercury sc 415366Question: Trees need oxygen. Roots close to the surface of the ground take in the oxygen the tree needs.Which organisms help trees get oxygen? (A) woodpeckers making holes in the tree (B) earthwormsmaking holes in the ground near the tree (C) mushrooms growing at the base of the tree (D) squirrelseating walnuts on the ground near the treeCorrect Answer: BGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixearthworms create tunnels in soil CENTRAL 144 30 22 28to create means to make LEXGLUE 1475 197 2444 193a tunnel is a kind of hole LEXGLUE 3265 110 147 107soil is a part of the ground outside LEXGLUE 34 278 170 274tunnels in soil loosen that soil CENTRAL 1924 83 11 80the looseness of soil increases the amount of oxy-gen in that soil

CENTRAL 22 297 5 292

plants absorb nutrients; water; oxygen from soilinto themselves through their roots

CENTRAL 63 9 7 9

taking in something means receiving; absorbing;getting something

LEXGLUE 1956 587 33 581

a tree is a kind of plant GROUNDING 46 3 168 3an earthworm is a kind of animal CENTRAL 2760 1 3319 1an animal is a kind of organism GROUNDING 2354 21 3051 20to make something easier means to help LEXGLUE 330 450 1715 444getting something increases the amount of thatsomething

LEXGLUE 1024 1040 51 1033

MAP 0.0247 0.2044 0.1058 0.2066

Analysis:

• Model ranks sentences related to roots, plants and earthworms near the top. It gathers sentenceswith low TF-IDF overlap but misses facts required to complete the reasoning process

16 mdsa 2007 5 39Question: Students are learning about the natural resources in Maryland. One group of students re-searches information about renewable natural resources in the state. The other group researches infor-mation about nonrenewable natural resources in the state. The resources the students investigate includeplants, animals, soil, minerals, water, coal, and oil. Which of the following human activities negativelyaffects a natural resource? (A) fishing in a lake (B) using water to produce electricity (C) planting native

112

plants along a lakeshore (D) directing runoff from cropland into a lakeCorrect Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixwater is a kind of natural resource GROUNDING 6 11 1 11a lake is a kind of body of water GROUNDING 1448 370 7 364runoff contains chemicals; fertilizer; pollutants;pesticides from cropland

CENTRAL 194 2 19 2

to impact means to affect LEXGLUE 4482 7 7 7cropland is used for farming by humans BACKGROUND 744 35 93 35pollution is when humans pollute the environ-ment with pollutants

CENTRAL 1590 47 525 46

pollution has a negative impact on the environ-ment; air quality

CENTRAL 1504 65 239 64

a body of water is a kind of environment GROUNDING 1186 18 22 18MAP 0.0247 0.2101 0.2149 0.2107

Analysis:

• TF-IDF is distracted by the initial chunk of text in the question. This text does not add any value tothe actual question

• Top 5 outputs of our model are “runoff is when cropland water enters; runs off into bodies of water”;“runoff contains chemicals; fertilizer; pollutants; pesticides from cropland”; “runoff is a kind ofwater”; “harming something has a negative impact on; effect on that something”; “waste has anegative impact on the environment”. This shows that our model has in fact ignored the distractors

17 mercury sc 400612Question: The surface of Earth changes constantly from weathering and erosion. Compared to Earth,there is little weathering and erosion on the Moon because of (A) the lack of gravity on the Moon. (B)the thin atmosphere on the Moon. (C) the lack of air and water on the Moon. (D) the lack of livingcreatures on the Moon.Correct Answer: CGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixthe moon does not contain air; water CENTRAL 36 4 17 5if something does not contain something else thenthat something lacks that something else

LEXGLUE 4201 15 23 14

rocks; soil; materials interacting with wind; mov-ing water over long periods of time causes weath-ering

CENTRAL 192 51 94 50

wind means moving air LEXGLUE 811 173 64 170soil erosion means soil loss through wind;water;animals

CENTRAL 153 23 23 22

lack is similar to low; little LEXGLUE 38 1 3 3MAP 0.0214 0.3344 0.1164 0.2108

Analysis:

• TF-IDF is distracted by facts relating to the definition of weathering, and the relation between theEarth and the moon

• Our model actually ranks the top 2 facts as “the Moon has less water; air than Earth”; “weatheringis a kind of erosion” which are both correct and useful for answering the question. It also ranks“soil erosion is when wind; moving water; gravity move soil from fields; environments” at rank 9.These would be sufficient facts to answer the question. This example shows an occurrence wherethe annotation/ranking process actually misses valid supporting facts

18 mcas 1998 4 8Question: Where would it be MOST dangerous to work with electric tools? (A) in a garage (B) beside aswimming pool (C) near a television or computer (D) in a cool basement

113

Correct Answer: BGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa swimming pool contains water CENTRAL 1 3 1 1water is an electrical; electric energy; thermal; ther-mal energy conductor

CENTRAL 93 15 5 5

sending electricity through a conductor causeselectricity; electric current to flow through thatconductor

CENTRAL 102 313 1313 310

nervous system is an electrical; electric conductor CENTRAL 42 714 1482 708if one electrical conductor contacts another elec-trical conductor then electricity will flow throughboth conductors

CENTRAL 1817 975 1247 969

the nervous system is a part of the body of an ani-mal

CENTRAL 2809 3828 1014 3777

if electricity flows through; is transferred throughthe body of an animal then that animal is electro-cuted

CENTRAL 1784 25 11 24

electrocution causes harm to an organism CENTRAL 519 26 45 25an animal is a kind of organism GROUNDING 4833 59 20 57harm means danger LEXGLUE 3695 12 10 13electric devices require electrical energy to func-tion

CENTRAL 35 19 57 18

device means tool LEXGLUE 3709 115 225 113MAP 0.1043 0.1322 0.2199 0.2127

Analysis:

• The fact that the nervous system conducts electricity is needed to show that the animal/humanbody will conduct electricity. The facts ranked low by our model are the ones related to the nervoussystem. Most likely the model implicitly assumes this since it has observed the fact that “if electricityflows through; is transferred through the body of an animal then that animal is electrocuted”

19 mercury sc 401294 Question: The shape of plants’ leaves that survive well in a rainy climate aremost often (A) red and shiny. (B) wide and flat. (C) thick and waxy. (D) sharp and narrow. CorrectAnswer: B Ground truth facts:

Text Role TF-IDF Reranker Path Ranker Mixas the size of a leaf increases , the amount of sun-light absorbed by that leaf will increase

CENTRAL 506 99 53 97

width is a property of size; shape and includes or-dered values of narrow; wide

CENTRAL 5 78 2 76

as flatness of a leaf increases , the amount of sun-light that leaf can absorb will increase

CENTRAL 954 11 17 12

flatness is a property of a surface; the shape of anobject and includes ordered values of uneven; flat

CENTRAL 6 9 5 10

a leaf is a kind of object GROUNDING 2220 19 11 18a surface is a part of an object GROUNDING 2240 520 31 516a leaf absorbs sunlight to perform photosynthesis CENTRAL 4734 207 494 204a leaf is a part of a green plant GROUNDING 3033 6 98 7a plant requires photosynthesis to grow; survive CENTRAL 74 237 77 234rainy means often raining LEXGLUE 3 10 4 11as the amount of rain increases in an environment ,available sunlight will decrease in that environment

CENTRAL 245 150 91 147

a climate is synonymous with an environment LEXGLUE 20 1 3 2the decrease of something required by an organismhas a negative impact on that organism’s survival

CENTRAL 953 482 352 478

a plant is a kind of organism GROUNDING 2125 15 201 15large leaves are a kind of adaptation for absorbingsunlight

CENTRAL 36 47 242 46

larger means greater; higher; more in size LEXGLUE 2089 83 66 81an adaptation; an ability has a positive impact onan animal’s; living thing’s survival; health; abilityto reproduce

CENTRAL 4088 41 115 40

negative impact is the opposite of positive impact LEXGLUE 1598 115 3030 234MAP 0.0979 0.2486 0.2634 0.2147

114

Analysis:

• This question has too many supporting facts and a smaller subset of these facts is sufficient toanswer the question i.e. establish that flat and wide leaves absorb more sunlight, rainy climate meansshortage of sunlight and hence flat and wide leaves is a useful characteristic.

• The ground truth facts ranked lowest by our model are “a surface is a part of an object“, “thedecrease of something required by an organism has a negative impact on that organism’s survival”which we feel are extraneous. The top ranked facts relate to leaves and their shapes.

20 csz 2007 5 csz10148Question: Above a continent, a warm air mass slowly passes over a cold air mass. As the warm air beginsto cool, clouds form. What will most likely happen next? (A) Rain will fall. (B) Hurricanes will form.(C) Lightning will strike. (D) Hail will form.Correct Answer: AGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa warm front is when warm air mass rises andpasses over a cold air mass

CENTRAL 1 1 3 3

a warm front causes cloudy and rainy weather CENTRAL 214 143 9 142cloudy means the presence of clouds in the sky BACKGROUND 242 26 130 26clouds are formed by water vapor rising and con-densing

CENTRAL 361 10 7 11

water vapor cooling causes that water vapor tocondense

CENTRAL 1413 317 28 316

precipitation is when rain;snow;hail fall fromclouds to the Earth;ground

GROUNDING 33 3 4 4

MAP 0.1849 0.3624 0.3218 0.2190

Analysis:

• This is a case where mixing actually leading to worse performance than the individual models

• The fact “a warm front causes cloudy and rainy weather” is necessary to eliminate hail from con-sideration and this is missed by the model. The actual model outputs relate to cloud formation andprecipitation but this one fact is ranked very low

21 mdsa 2011 5 6Question: Planets in our solar system have different solar years. Which statement explains the cause ofan Earth solar year? (A) Earth rotates around the sun. (B) Earth revolves around the sun. (C) The sunrotates around Earth. (D) The sun revolves around Earth.Correct Answer: BGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa complete revolution of the Earth around the suntakes 1; one year; solar year; Earth year

CENTRAL 1 6 4 5

a revolution is when something revolves aroundsomething else

LEXGLUE 41 8 10 7

MAP 0.5244 0.2083 0.2250 0.2428

Analysis:

• The top results for our model are “the Earth revolves around the sun”; “Earth is a kind of planet”;”revolving around something means orbiting that something”; “the solar system contains the moon”;“a complete revolution of the Earth around the sun takes 1; one year; solar year; Earth year”. Thesesufficiently answer the question.

• It should be noted that TF-IDF had a better score than the other models

22 timss 2007 4 pg90Question: Sue measured how much sugar would dissolve in a cup of cold water, a cup of warm water,and a cup of hot water. What did she most likely observe? (A) The cold water dissolved the most sugar.

115

(B) The warm water dissolved the most sugar. (C) The hot water dissolved the most sugar. (D) The coldwater, warm water and hot water all dissolved the same amount of sugar.Correct Answer: CGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixgreatest means largest; highest; most ROLE 136 63 1940 62temperature; heat energy is a property of objects;weather and includes ordered values of cold; cool;warm; hot

CENTRAL 40 6 37 6

hot means high in heat energy; temperature LEXGLUE 125 10 52 10as temperature increases , the ability of that liquidto dissolve solids will increase

CENTRAL 177 60 22 60

high is similar to increase LEXGLUE 4268 54 1174 54water is a kind of liquid GROUNDING 21 2 2 2sugar is a kind of solid GROUNDING 15 15 20 15

MAP 0.0487 0.2434 0.1356 0.2436

Analysis:

• The central fact “as temperature increases , the ability of that liquid to dissolve solids will increase”has very low TF-IDF overlap. The path ranker actually does a good job of bringing this fact and the2 GROUNDING facts to the top 25. However, its output is disregarded due to the mixing threshold.

• The reranker model is distracted by the facts relating to temperature, heat and the process of dis-solving

23 mdsa 2011 5 20Question: The snowshoe hare sheds its fur twice a year. In the summer, the fur of the hare is brown. Inthe winter, the fur is white. Which of these statements best explains the advantage of shedding fur? (A)Shedding fur keeps the hare clean. (B) Shedding fur helps the hare move quickly. (C) Shedding fur keepsthe hare’s home warm. (D) Shedding fur helps the hare blend into its habitat.Correct Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixshedding is when an animal loses hair;fur;skin CENTRAL 4 24 3 3a hare is similar to a rabbit LEXGLUE 16 8 4 4some rabbits live in forests CENTRAL 2761 278 103 276a rabbit is a kind of animal GROUNDING 1566 10 7 12a forest contains plants; trees GROUNDING 4040 288 4195 286bark is a protective covering around the trunk of;branches of a tree

GROUNDING 1002 1336 732 1330

bark is usually brown in color GROUNDING 240 522 478 518snow is white in color GROUNDING 239 15 14 16An example of camouflage is when somethingchanges color in order to have the same color asits environment

CENTRAL 210 21 53 22

coloration is a kind of adaptation for hid-ing;camouflage

CENTRAL 2806 96 257 95

coloration means a thing’s color LEXGLUE 4537 137 28 136help means advantage LEXGLUE 236 9 1492 11blending in is synonymous with hiding LEXGLUE 2207 1 2795 6habitat is similar to environment LEXGLUE 586 19 1625 20an environment means an area LEXGLUE 4491 31 4935 31the color of an environment means the color of thethings that environment contains

LEXGLUE 73 116 194 115

MAP 0.0352 0.2507 0.1279 0.2449

Analysis:

• Answering this question properly requires connecting the following facts: “a hare is similar to arabbit”, “some rabbits live in forests”, “a forest contains plants; trees”, “bark is a protective coveringaround the trunk of; branches of a tree”, “bark is usually brown in color” to the question text. Notethat the intermediate facts have little to no term overlap with the question.

116

• Our model is able to get “a hare is similar to a rabbit” but misses the rest of the facts. This is not alimitation in the model itself but in the actual inference time computation (we only hop out one factduring evaluation).

7 Appendix II: Positive examples

This sections contains examples of questions where Path Ranker performs better than all other models.1 timss 2011 4 pg90

Question: Water that has its salt removed before it can be used as drinking water is most likely to havecome from (A) underground (B) a river (C) a lake (D) a seaCorrect Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixthe ocean contains large amounts of salt water CENTRAL 23 5 2 2sea means ocean LEXGLUE 4090 93 3 3animals usually require removing salt from wa-ter for drinking

BACKGROUND 1 1 1 1

MAP 0.3626 0.4774 1.0000 1.0000

2 mercury sc 401244Question: Which rock type is most useful in studying the history of living organisms? (A) basalt (B)marble (C) granite (D) limestoneCorrect Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixlimestone is a kind of sedimentary rock GROUNDING 81 298 1 1nearly all fossils are found in sedimentary rock CENTRAL 150 306 12 297fossils are formed when layers of sediment coverthe remains of organisms over time

CENTRAL 156 107 131 103

sedimentary rocks are formed from sediment com-pacting; cementing together

CENTRAL 4357 1662 32 1647

history occurred a long time ago CENTRAL 22 5 3 3a type is synonymous with a kind LEXGLUE 14 2 2 2something from long ago can be used for studyinghistory

CENTRAL 1 3 4 4

useful means good to use LEXGLUE 5 6 9 7MAP 0.2431 0.3160 0.6669 0.6001

Analysis: Notice that the facts “nearly all fossils are found in sedimentary rock” and “sedimentaryrocks are formed from sediment compacting; cementing together” only have the word rock in commonwith the question and correct answer. Yet the Path ranker ranks them higher than the other models. Itreaches them via the paths [‘limestone is a kind of sedimentary rock’, ‘nearly all fossils are found insedimentary rock’] and [‘limestone is a kind of sedimentary rock’, ‘sedimentary rocks are formed fromsediment compacting; cementing together’]

3 mercury sc 415541Question: Which of these objects will most likely float in water? (A) glass marble (B) steel ball (C) hardrubber ball (D) table tennis ballCorrect Answer: DGround truth facts:

Text Role TF-IDF Reranker Path Ranker Mixa ball is a kind of object CENTRAL 1476 4 1 1an tennis ball contains air CENTRAL 4 1 2 2something that contains air is usually buoyant CENTRAL 4562 688 11 677buoyant means able to float in a liquid or gas CENTRAL 11 8 19 8water is a kind of liquid GROUNDING 55 31 94 29

MAP 0.0980 0.4023 0.5073 0.5041

Analysis: Notice that the fact “something that contains air is usually buoyant” has no word in commonwith the question and correct answer. Yet the Path ranker ranks it higher than the other models. It reachesthe fact via the path [‘an tennis ball contains air’, ‘something that contains air is usually buoyant’].

117


Joint Semantic and Distributional Word Representationswith Multi-Graph Embeddings

Pierre-Daix Moreux∗Ubisoft Entertainment SA

[email protected]

Matthias GalleNaver Labs Europe

[email protected]

Abstract

Word embeddings continue to be of great usefor NLP researchers and practitioners due totheir training speed and easiness of use anddistribution. Prior work has shown that therepresentation of those words can be improvedby the use of semantic knowledge-bases. Inthis paper we propose a novel way of com-bining those knowledge-bases while the lexi-cal information of co-occurrences of words re-mains. It is conceptually clear, as it consistsin mapping both distributional and semanticinformation into a multi-graph and modifyingexisting node embeddings techniques to com-pute word representations. Our experimentsshow improved results compared to vanillaword embeddings, retrofitting and concatena-tion techniques using the same information, ona variety of data-sets of word similarities.

1 Motivation

Word embeddings revolutionized NLP through theuse of lookup dictionaries that provided contin-uous representations of words. While surpassedrecently by token-based (contextual) embeddings,word embeddings continue to be popular becausethey are faster to train, can be used plug-and-play by a multitude of machine learning sys-tems, with storing a database of embeddings forsole requirement. This makes them particularlyattractive for domain-specific embeddings (e.g.,privacy policies (Harkous et al., 2018), oil andgas (Nooralahzadeh et al., 2018) or sentimentanalysis (Sarma et al., 2018).

The most popular word embeddings are trainedpurely with a distributional prior: words occurringin a similar context should have a similar repre-sentation. It is well known that the quality of wordembeddings can be improved by injecting seman-tic knowledge in the form of curated databases

∗Work done while at Naver Labs Europe

of relationships between words. However, it isless clear how these two types of knowledge canbe mixed. Existing approaches work mostly byfine-tuning them afterwards (Mrksic et al., 2016;Faruqui et al., 2014) through additional semanticconstraints from lexical databases, such as Word-Net (Miller, 1995). Although some joint learningapproaches have been proposed, the way that thesemantic knowledge is injected is not straightfor-ward as the original data-structure is very different(sequences and graphs).

In this paper we propose to represent the co-occurrence relationship of words as a graph. Sucha representation opens up natural ways of merg-ing this lexical graph with the semantic graphincorporating new edges with different types.Our experiments show that graph embeddings ofthe resulting nodes (words) outperform not onlypure distributional-based embeddings, but alsoretrofitted and concatenated ones, on standardword similarity tasks.

The main contributions of this paper are:

• Combining two types of knowledge into onestructure represented as a multi-graph.

• Tailoring optimization methods from graphembeddings to include edge types.

• Experimental results showing that they out-perform existing methods on a standard wordsimilarity task. The gap is higher when lesslexical training data is available.

2 Related Work

Continuous embeddings of words rely on two rep-resentations per word w: one considering it as to-ken (~vw), and another one considering it as contextof another token (~v ′c). When using pointwise mu-tual information of the co-ocurrences matrix (Levyand Goldberg, 2014), this happens implicitly as

118

the final matrix is square. Alternatively, when us-ing a matrix reduction technique (eg: SVD), thesecond non-diagonal matrix (which is discardedmost of the time) can be considered as a contex-tual view of the tokens.

In more modern word embeddings, likeword2vec (Mikolov et al., 2013) or GloVe (Pen-nington et al., 2014), such a representation is ex-plicit, and the final representation is either oneof the two or a combination of them (for a dis-cussion on the impact of different combinationssee (Duong et al., 2016)).

2.1 Node embeddings

This dual representation becomes even more im-portant when considering graph embeddings. Tofind a self-supervised optimization function thatinduces a representation of nodes, two differentgoals are formalized (Tang et al., 2015): ho-mophily, stating that close nodes should have asimilar representation (McPherson et al., 2001),and structural similarity, aiming to have simi-lar representations for words that have a similarneighbourhood (Fortunato, 2010). The LINE al-gorithm (Tang et al., 2015) creates node embed-dings optimized for either of those two. When op-timizing for homophily, the loss function consistsin maximising their first order similarity:

sim(~vi, ~vj) (1)

As in previous work, we will define the similar-ity as the logistic function, work in log-space anddefine for simplicity :

sim(~vi, ~vj) = log1

1 + e−~vi·~vj(2)

Optimizing for structural similarity is achievedby focusing on the second order similarity, goingthrough the contextual embedding, by maximis-ing:

sim(~vi, ~v′j) (3)

For two nodes with shared neighbourhoods, op-timizing this will force their representations tobe similar. While LINE uses the alias tablemethod (Li et al., 2014) to optimize either Eq. 1 orEq. 3, word2vec uses a context window of fixed-size c to maximize Eq. 3.

2.2 Incorporating semantic knowledgeCombining lexical and semantic information froma knowledge-graph – for word embeddings – isnot straightforward as they consist in two differ-ent representations.

One line of research uses knowledge-graphsto modify word embeddings obtained throughpure distributional, lexical approaches afterwards.Faruqui et al. (2014) do so by maximising first or-der similarity (Eq. 1) of two words marked as syn-onyms in a semantic graph. In order to not disruptthe embedding space, this is regularized with aterm insisting that the new embeddings should notbe too far apart from the original ones. On top ofthis Mrksic et al. (2016) add also antonyms, push-ing the representations of two antonyms apart.

Another line of research, closer to our proposi-tion, is to incorporate semantic knowledge at train-ing time. Liu et al. (2015) do so through using or-dinal constraints (similarity of synonyms shouldbe higher than of non-synonyms). Many otherworks trained co-occurrences together with syn-onyms (Yu and Dredze, 2014; Bian et al., 2014;Kiela et al., 2015) or even other terms. However,those methods treat synonyms equally to contextwords and do not modify the similarity betweenthem. They are therefore optimized through sec-ond order similarity as well.

Our main contribution consists in (i) defin-ing this problem through a conceptually simplemulti-graph data structure and (ii) treating differ-ent edges types with different similarity (first orsecond order).

3 Joint Learning of Semantic andLexical Embeddings

Our proposal is to construct a graph that containsboth the lexical information of co-occurrence ofwords, as well as the semantic information con-tained in knowledge graphs. We construct a multi-edge graph, where each edge belongs to one ofa predefined class. Here, we report results usingthree classes:

• lexical: we add an edge or increment itsweight between node vi and vj every timeword i occurs in the same window (of pre-defined size c) than word j

• synonym: words i and j are connected when-ever any of their senses belongs to the samesynset from WordNet.

119

• antonym: words i and j are connected when-ever any of their senses are antonyms accord-ing to WordNet.

In this paper, we will uniformly sample a syn-onym from the set of synonyms of a word. Wedefine Si and Ti to respectively be the set of allsynonyms and antonyms of word i.

For a node (word) vi, we will model its relationwith one of its synonyms using first order proxim-ity:

sim(~vs, ~vi) (4)

where s ∈ Si.Using the multi-edge graph setting, we run an

experiment where, when possible, we also includean antonym as an additional first order negative ex-ample, uniformly sampled from the antonyms setTi of a word. For training, we use negative sam-pling (N(v)) and end up with the following ob-jective which we train with stochastic gradient de-scent:

sim(~v ′j · ~vi) +k∑

n=1

Evn∼N(v)

(sim

(−~v ′n · ~vi

))

+ sim(~vs · ~vi) +k∑

n=1

Evn∼N(v)

(sim (−~vn · ~vi))

+ sim(−~va · ~vi) +k∑

n=1

Evn∼N(v)

(sim (−~vn · ~vi))

where s ∈ Si and a ∈ Ti.This objective function accounts for both types

of similarity during learning.

4 Results

We trained our embeddings on 25 millions oflines from the english One Billion Word Cor-pus (Chelba et al., 2013). For any word, weused WordNet to include its set of synonyms andantonyms when needed. As usual, we computethe cosine similarity between the embeddings foreach word and compare the Spearman correlationof that similarity with human scores evaluating theextent to which those two words are similar (syn-onyms) or related.

Datasets compiling scores for explicitly evalu-ating similarity include SimLex-999 (Hill et al.,2015), RG-65 (Rubenstein and Goodenough,1965) and MC-30 (Miller and Charles, 1991).For comparison purposes, we include also EN-MTURK-771 (Guy Halawi, 2012), which rather

deals with evaluating word pairs’ relatedness.WordSimilarity-353 (Finkelstein et al., 2001)(EN-WS-353-SIM) is less clear on whether it eval-uates similarity or relatedness, as in contrast to itstitle, human participants were asked “to estimatethe relatedness of the words”. Lofi (2015) or Asret al. (2018) provide good introductions to the dif-ference between evaluating similarity versus relat-edness.

Table 1 summarizes the different results ob-tained with our joint learning approach (with andwithout antonyms), separate results for first orderand second order representations, and word2vec(skip-gram with negative sampling) with and with-out retrofitting. It also includes results obtainedwith concatenated representations learned fromoptimizing first order proximity of synonyms withsecond order embeddings (as proposed in Tanget al. (2015)).

Concatenation surpasses the retrofitting tech-nique in terms of Spearman correlation scores. Itrequires however much more training time, as anadditional embedding is required for each new se-mantic relationship.

In any case, the joint learning approach wepropose outperforms any kind of method ondatasets evaluating similarity. For instance, theSpearman correlations we obtain on SimLex-999,when solely including synonyms, improve onthe word2vec baseline by over 11%. Includingantonyms increases this difference in performanceup to 17% on this particular dataset.

Our attempt at shifting our vector space towardsa similarity nudged one seems confirmed by ourperformance on EN-MTURK-771. Indeed, purelydistributional vectors obtain here better Spearmancorrelation scores.

We also benchmarked the learning curve withan increasing amount of lexical data. In Figure 1,we plot the Spearman correlations obtained whentraining with an increasing chunk of the 1 Bil-lion Word Corpus, comparing jointly-learned vec-tors with concatenated, LINE second order andword2vec (vanilla and retrofitted) embeddings.

Figure 1 illustrates that provided with Word-Net, the joint learning approach is better equippedto learn representations when less lexical trainingdata is available. Indeed, higher Spearman cor-relation scores are obtained from the beginning.In addition to this, including antonyms further in-creases the observed gap.

120

Method en-simlex-999 en-rg-65 en-mc-30 en-mturk-771 en-ws-353-sim

w2v 0.402 0.603 0.605 0.629 0.713w2v retrofitted 0.444 0.667 0.653 0.620 0.689LINE second order 0.378 0.562 0.604 0.571 0.685LINE first order 0.304 0.533 0.516 0.536 0.642Concat. Syn. 0.471 0.643 0.673 0.551 0.703Joint Learning (Syn.) 0.516 0.726 0.771 0.573 0.742Joint Learning (Syn. + Ant.) 0.580 0.705 0.749 0.570 0.675

Table 1: Spearman correlations between human-based judgements and similarity obtained using different embed-dings learned on more than 634 millions of tokens. “Concat. Syn.” stands for results obtained when concatenatingto second order representations, first order embeddings of synonyms. “Syn.” and “Syn. + Ant.” stand for theinclusion of synonyms and antonyms during joint learning

Word pair - relation GloVe w2v Joint Learning (Syn. + Ant.) LINE second order

(coffee, cup) - relatedness 0.335 0.333 0.133 0.207(cheap, expensive) - antonymy 0.545 0.512 -0.504 0.556(period, epoch) - meronymy 0.199 0.274 0.272 0.348(torso, trunk) - synonymy 0.257 0.481 0.766 0.447

Table 2: Cosine similarity measures for different word pairs.

Figure 1: Spearman correlation scores over SimLex-999 obtained for different embeddings on an increasingamount of lexical data. “J.-L.” stands for joint-learning.

To illustrate the impact the joint learning ap-proach has on the embeddings space, Table 2 pro-vides examples showing the impact our approachhas on the cosine similarity of different kinds ofword pairs.

We observe that while the cosine similarity ofthe related word pair decreases, the similarity oftwo synonyms greatly increases in comparison tothe antonym pair’s similarity, which turns nega-tive. Interestingly, the similarity of the two pro-vided meronyms does not show any great differ-

ence with respect to the one provided by distribu-tional methods.

5 Conclusion

We proposed a novel way of combining the lexi-cal information of co-occurrences with that of se-mantic knowledge bases. Our method maps allthose sources of information into a multi-graphand modifies existing node embeddings techniqueso that they treat edges of different types differ-ently. We claim that our proposal is conceptu-ally simpler than existing proposals which eitheruse the information from the semantic graph tofinetune the word embeddings obtained throughlexical information, or combine the informationin some other indirect way. Instead of this, ourmethod formalizes an objective to include noveledge types. In our experiments we presented re-sults using three types of edges. In addition toobtaining better results when measured on a stan-dard word similarity task, our method is less data-greedy: it obtains better results with much lesstraining data than other methods. This can beinteresting in particular for the creation of in-domain word embeddings, where curated knowl-edge graph exists. Thus, using a multi-graph al-lows for an easy way of incorporating additionaltypes of information. In particular, we are consid-

121

ering multi-lingual embeddings through the inclu-sion of bilingual dictionaries which connect nodes(words) of different languages.

ReferencesFatemeh Torabi Asr, Robert Zinkov, and Michael

Jones. 2018. Querying word embeddings for sim-ilarity and relatedness. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 675–684.

Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014.Knowledge-powered deep learning for word embed-ding. In Joint European conference on machinelearning and knowledge discovery in databases,pages 132–148. Springer.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, and Phillipp Koehn. 2013. One bil-lion word benchmark for measuring progress in sta-tistical language modeling. CoRR, abs/1312.3005.

Long Duong, Hiroshi Kanayama, Tengfei Ma, StevenBird, and Trevor Cohn. 2016. Learning crosslingualword embeddings without bilingual corpora. arXivpreprint arXiv:1606.09403.

Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, ChrisDyer, Eduard Hovy, and Noah A Smith. 2014.Retrofitting word vectors to semantic lexicons.arXiv preprint arXiv:1411.4166.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proceedings of the 10th inter-national conference on World Wide Web, pages 406–414. ACM.

Santo Fortunato. 2010. Community detection ingraphs. Physics reports, 486(3-5):75–174.

Evgeniy Gabrilovich Yehuda Koren Guy Halawi,Gideon Dror. 2012. Large-scale learning of word re-latedness with constraints. KDD, pages 1406–1414.

Hamza Harkous, Kassem Fawaz, Remi Lebret, FlorianSchaub, Kang G Shin, and Karl Aberer. 2018. Poli-sis: Automated analysis and presentation of privacypolicies using deep learning. In 27th USENIX Se-curity Symposium (USENIX Security 18), pages531–548.

Felix Hill, Roi Reichart, and Anna Korhonen. 2015.Simlex-999: Evaluating semantic models with (gen-uine) similarity estimation. Computational Linguis-tics, 41(4):665–695.

Douwe Kiela, Felix Hill, and Stephen Clark. 2015.Specializing word embeddings for similarity or re-latedness. In Proceedings of the 2015 Conference on

Empirical Methods in Natural Language Process-ing, pages 2044–2048.

Omer Levy and Yoav Goldberg. 2014. Neural wordembedding as implicit matrix factorization. In Ad-vances in neural information processing systems,pages 2177–2185.

Aaron Q Li, Amr Ahmed, Sujith Ravi, and Alexan-der J Smola. 2014. Reducing the sampling com-plexity of topic models. In Proceedings of the 20thACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 891–900.ACM.

Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, andYu Hu. 2015. Learning semantic word embeddingsbased on ordinal knowledge constraints. In Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers), pages 1501–1511.

Christoph Lofi. 2015. Measuring semantic similarityand relatedness with distributional and knowledge-based approaches. Information and Media Tech-nologies, 10(3):493–501.

Miller McPherson, Lynn Smith-Lovin, and James MCook. 2001. Birds of a feather: Homophily in socialnetworks. Annual review of sociology, 27(1):415–444.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

George A Miller and Walter G Charles. 1991. Contex-tual correlates of semantic similarity. Language andcognitive processes, 6(1):1–28.

Nikola Mrksic, Diarmuid OSeaghdha, Blaise Thom-son, Milica Gasic, Lina Rojas-Barahona, Pei-HaoSu, David Vandyke, Tsung-Hsien Wen, and SteveYoung. 2016. Counter-fitting word vectors to lin-guistic constraints. In Proceedings of NAACL-HLT,pages 142–148.

Farhad Nooralahzadeh, Lilja Øvrelid, and Jan ToreLønning. 2018. Evaluation of domain-specific wordembeddings using knowledge resources. In Pro-ceedings of the Eleventh International Conferenceon Language Resources and Evaluation (LREC-2018).


122

Herbert Rubenstein and John B Goodenough. 1965.Contextual correlates of synonymy. Communica-tions of the ACM, 8(10):627–633.

Prathusha K Sarma, Yingyu Liang, and William ASethares. 2018. Domain adapted word embeddingsfor improved sentiment classification. In ACL.

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, JunYan, and Qiaozhu Mei. 2015. Line: Large-scale in-formation network embedding. In Proceedings ofthe 24th International Conference on World WideWeb, pages 1067–1077. International World WideWeb Conferences Steering Committee.

Mo Yu and Mark Dredze. 2014. Improving lexicalembeddings with semantic knowledge. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), pages 545–550.

123


Evaluating Research Novelty Detection: Counterfactual Approaches

Reinald Kim Amplayo and Seung-won Hwang and Min SongYonsei University

Seoul, South Korearktamplayo, seungwonh, [email protected]

Abstract

In this paper, we explore strategies to evaluatemodels for the task research paper novelty de-tection: Given all papers released at a givendate, which of the papers discuss new ideasand influence future research? We find thenovelty is not a singular concept, and thus in-herently lacks of ground truth annotations withcross-annotator agreement, which is a majorobstacle in evaluating these models. Test-of-time award is closest to such annotation, whichcan only be made retrospectively and is ex-tremely scarce. We thus propose to compareand evaluate models using counterfactual sim-ulations. First, we ask models if they can dif-ferentiate papers at time t and counterfactualpaper from future time t + d. Second, we askmodels if they can predict test-of-time awardat t + d. These are proxies that can be agreedby human annotators and easily augmentedby correlated signals, using which evaluationcan be done through four tasks: classification,ranking, correlation and feature selection. Weshow these proxy evaluation methods comple-ment each other regarding error handling, cov-erage, interpretability, and scope, and thus al-together contribute to the observation of therelative strength of existing models.

1 Introduction

Research paper novelty detection can be definedas follows: Given the full-text content of the pa-per, determine if the paper is novel or not. Whencomparing the novelty of two papers, we assumethat only the texts (i.e., abstract, body, and ref-erence sections) are shown and that both papersare published at the same time, and the venue isnot known. This task is essential because whileprevious works on plagiarism detection (Harris,2002; Lukashenko et al., 2007), citation recom-mendation (He et al., 2010; Ji et al., 2017), andreviewer assignment (Long et al., 2013; Liu et al.,

2014) help in the administrative part of the re-view process, automatically detecting paper nov-elty can speed up the paper reviewing. Also, fromthe viewpoint of paper readers, it helps in filter-ing out non-novel papers from the large number ofpapers being published every day.

Despite its importance, this direction of re-search has not been explored as much. We arguethat this is because it is hard to evaluate these mod-els. An obvious solution is to create an evalua-tion dataset which contains papers that are labeledwith their novelty. However, acquisition of thisdataset is practically impossible, because of sev-eral aspects. First, novelty is not a singular con-cept, and captures diverse aspects, being yet to beseen with respect to previous knowledge and in-novative to have impact on future publication. Asa result, collecting novelty judgment from review-ers (e.g., from reviewing papers) as ground-truth,would have low cross-annotator agreement, espe-cially when the qualification and background ofreviewers are diverse. Second, one ideal solutionis to create a dataset of test-of-time awarded pa-pers, which are selected by widely-known highlyqualified experts in the field, based on its impactafter more than ten years since its publication.However, this process takes too long since we needto wait for ten years to determine if a paper standsthe test of time or not.

In this paper, we explore evaluation methodswhich do not use gold labels for comparison, assimilarly done in other NLP tasks: such as auto-mated essay scoring (Burstein et al., 2004) andrepresentation learning (Schnabel et al., 2015).Specifically, we consider the following counter-factual simulations, in place of human annota-tions: First, we ask models if they can differentiatepapers at time t and counterfactual paper from fu-ture time t + d, where d is a large time gap. Wefound human annotators agree counterfactual pa-

124

Classification

Ranking

…

2011 2012 2015

<nov( ) nov( )

nov( μ2011 ) nov( μ2012 ) nov( μ2015 ) nov( μ2016 )< < <

2016

(a) Evaluation through time

April 2011 papers

cites: 0 cites: 5 cites: 73 cites: 621

Regression

Feature selection

nov( μ2011 ) nov( μ2012 ) nov( μ2015 ) nov( μ2016 )< < <

acccitation_prediction( μ20) acccitation_prediction( μ2011 )< + nov( )

(b) Evaluation through impact

Figure 1: Intuition behind our proposed evaluationmetrics for research paper novelty detection, wherenov(·) is the novelty detection model.

per from future is more novel, which suggest thistime proxy simulation can capture one aspect ofnovelty. Second, we ask models if they can predicttest-of-time award at t+ d. As test-of-time awardis a sparse signal, we augment with future cita-tions, which we empirically observe to correlatewith the award signal. Using this impact proxy,we evaluate whether the model can differentiatepapers with high citations and those with fewer ci-tations in the future. Through this, novelty detec-tion can be treated as four NLP tasks: classifica-tion, ranking, correlation, and feature selection, asshown in Figure 1. Throughout the paper, we ex-plain why and in what conditions we can evaluatemodels through these proxies.

2 Novelty Detection in Texts

Text novelty detection is a task to identify theset of relevant and noel texts from an ordered setof document swithin a certain topic (Voorhees,2004). Text novelty is defined as providing newinformation that has not yet been found in anyof the previously seen sentences. Most systems(Blott et al., 2004; Zhang et al., 2007) used TF-IDF metric as an importance value. Other nov-elty detection systems used entities such as namedentities as features (Jaleel et al., 2004; Tsai and

Zhang, 2011).Unlike text novelty, research paper novelty is

rather complex. While there is no clear and precisedefinition, Kaufer and Geisler (1989) attempted todescribe research paper novelty into the points be-low:

1. Static: Novelty in a research paper is less aproperty of ideas than a relationship amongresearch communities and ideas. It is less anindividual trait than the consistency of the pa-per to the research community and structure.

2. Dynamic: Novelty in a research paper is de-fined by how much the introduction of thispaper “changes” the overall relationship.

• Research papers are novel if they mas-tered a set of conventions which en-ables the research community to usepast ideas and go beyond in the future.• Novelty in a research paper is a short-

hand for the standards used to contributeto the growth of a specific disciplinaryresearch community.• Novelty in a research paper is a balance

between the inertia of past ideas and thedrive to contribute to the ideas.

The above description tells us that novelty innormal texts and research papers are not the same.While in normal texts, it is necessary that thereis new information to be considered novel, in re-search papers, the relationship among various re-search entities (e.g., authors, ideas, contents, etc.)are more important, as also shown in the literature(Ding et al., 2013, 2014; Song and Ding, 2014;Amplayo and Song, 2016). We thus argue, andshow in our experiments, that novelty detectionmodels that use entity-based features are more ef-fective for research paper novelty detection.

3 Novelty Detection Models

Most models for novelty detection can be de-scribed in two parts, First, a feature extractionmodule is used to select useful features. Second,a novelty scoring module is used to output a scorethat represents how novel the text is.

3.1 Feature Extraction ModulesWe summarize the feature extraction methods inFigure 2. There are two types of features: Nor-mal text features are features that are commonly

125

used in novelty detection for text which are not re-search papers. Citation features are features thatmake use of citation information (i.e., the refer-ence section) that are usually available in researchpapers.

Normal Text Features Feature extraction fromnormal text include TF-IDF features (Blott et al.,2004) and word co-occurrence features (Gamon,2006). Since these methods are primarily used todetect novelty in normal texts, they do not con-sider the existence of the relationship between cit-ing and cited papers.

• tfidf (Blott et al., 2004; Zhang et al.,2007): product of the term frequency and thelogarithm of the inverse of the relative docu-ment frequency.

• cooccur (Gamon, 2006): transforms textinto a graph using the fact that two wordswithin a window size are connected with anedge and extracts twenty-one features basedon the node and edge statistics (e.g., numberof new edges, ratio of node to edge, etc.).

Citation Features Feature extraction from cita-tion information uses the idea that there is a directrelationship between the cited paper and the cit-ing paper (Amplayo et al., 2018). Instead of usingco-occurrence graphs, as in cooccur, these mod-els use citation graphs to extract the features fromthe paper, where information from cited and citingpapers are connected using a directed edge. Thereare two kinds of citation features: (a) Metadata-based citation features, in which we create an edgebetween two metadata information such as authorsand papers, and (b) Entity-based citation features(Amplayo and Song, 2016), in which we createan edge between two entities extracted from thetext content, such as keywords, latent topics, andwords. Features are then obtained using the samemethod as in (Gamon, 2006).

• paper: simple citation graph where thenodes are papers and the edges are citationrelationships between the papers: If paper acites paper b then an edge from b to a exists.

• author: simple citation graph where thenodes are authors and the edges are citationrelationships from the authors of the cited pa-per to the authors of the citing paper.

• keyword: a citation graph where the nodesare RAKE-generated keywords (Rose et al.,2010) extracted from the citing and cited in-formation1, and the edges are connected fromthe cited keywords to the citing keywords.

• topic: a citation graph where the nodesare LDA topic vectors (Blei et al., 2003) ex-tracted from both the citing and cited infor-mation, and the edges are connections fromthe cited to citing topics.

• word: a citation graph where the nodes arelowercased and lemmatized nouns from bothciting information and cited information.

3.2 Novelty Scoring Module

The features extracted by the models describedabove can be compared over a common noveltydetector. Majority of previous work use autoen-coders, neural networks that are used to learn ef-ficient codings in an unsupervised manner. Whenused as a novelty detector (Japkowicz et al., 1995),autoencoders encode the input features x into en-codings and try to decode an output x′ such that xand x′ are equal.

This idea can also be leveraged to detect re-search paper novelty as well. First, we train theautoencoder using features extracted from papersof the training data. The papers in the training dataare papers from the past and represent the currentknown research ideas and communities. Then attest time, for each unseen paper p, we extract itsinput feature xp using the feature extraction mod-ule. The autoencoder accepts xp as input, andoutputs x′p. The novelty score of the paper is thecloseness of xp and x′p, which is calculated as theroot of the sum of the squared difference of xpand x′p. If xp and x′p are nearly identical, thenthe features extracted from paper p have alreadybeen expected by the model; thus p is not novel.Otherwise, p contains new information and henceis considered novel. The autoencoder is then re-trained to include the current paper p for the nextunseen paper.

4 Dataset of Research Papers

We gathered computer science research papers asan evaluation dataset from arXiv, an online repos-

1Hereon, the citing information refers to the abstract ofthe paper, while the cited information refers to the snippetcontaining the corresponding in-text citation

126

Input Paper:Normal text features Citation features

tfidf

cooccur

…

w1w2

w3 …

linguistictypology

studies

current graph paper graph

+

featureextraction

• number of new edges• number of added background edges• number of background edges• number of background vertices• number of connecting edges• sum of weights on new edges

…

metadata-based (author, paper)

…

a1a2

a3 …

RCottorellJEisner

PLadefoged


+

featureextraction


…

IMaddieson

entity-based (keyword, topic, word)

…

k1k2

k3 …

human language

phonological …

phonemes


+

featureextraction


…

acoustic space

Citations:

…

current papers

TF-IDF

linguistic 0.32

typology 0.54

studies 0.12

the 0.01

…

Figure 2: The feature extraction methods used by different novelty detection models. We only show author andkeyword as examples for metadata-based and entity-based citation features, respectively, for conciseness.

itory that covers a wide range of computer sci-ence subfields and has a total of forty subcate-gories, ranging from Artificial Intelligence to Sys-tems and Control. This repository is a good sourceof a variety of papers that are accepted or re-jected by conferences and journals, and may con-tain ideas already presented in the past. Each pa-per has its information such as the title, authornames, submission date and abstract, and its fulltext in PDF format. We additionally collected thepublished date information and citation count in-formation from the same tool. The data gatheredconsists of papers from the year 2000 to 2016. Wedivided the data into seed data (year 2000-2005)2,training data (year 2006-2010) and test data (year2011-2016).

5 Counterfactual Simulation on Time

We first evaluate research paper novelty detectionmodels using counterfactual simulations on time.The intuition behind this idea is that, when papersfrom t and t+d are presented to human annotators,they would agree on the latter to be more novel(Section 5.1). Once this holds, we can evaluate asclassification between t and t + d (Section 5.2),

2The seed data is needed by models using citation fea-tures (Amplayo et al., 2018). Other models use both seed andtraining data for training.

or ranking (Section 5.3) among clusters of papersground by a specific time period.

5.1 Preliminary analysis

To test our intuition, we performed a prelimi-nary experiment as follows. We collected pairs ofsimilar papers from ACL 2011 and ACL 2016; weused similarity scores from averaged pretrainedword2vec word vectors (Mikolov et al., 2013) topair the papers. We asked four senior graduate stu-dents who major in natural language processing tojudge which among the pair is more novel or not.To perform a transparent experiment, we did notreveal the data collection process and only showthe title and the abstract of the papers. We gaveannotators three choices: (A) choose paper A, (B)choose paper B, and (C) no idea.

The results show that 42.5% of the time thegraduate students selected the 2016 papers, whilethe 2011 papers were selected only 8.5% of thetime. This indicates that time can be used asa proxy for novelty detection. In the remaining49.0% of the time however, the students had noidea which paper is more novel. We posit thatthis is because they are only experts on a smallsub-area of NLP, which additionally shows the dif-ficulty of creating an expert-annotated evaluationset for the problem.

127

Model Accuracytfidf 55.56cooccur 56.49author 59.46paper 59.48word 59.90keyword 61.11topic 78.19

Table 1: Accuracies of different models on the timeclassification task.

5.2 As a Classification Task

Intuition As mentioned above, the first condi-tion is that the difference between publicationtimes of non-novel and novel papers should belarge. Using this, we can reduce the research pa-per novelty detection task into a classification task,where the task is to classify if a paper is novel ornot. Given a large time difference d, the paperspublished in the past time t are given a “not novel”label while papers published in time t+d are givena “novel” label. If a classification model trainedusing the novelty scores extracted from a noveltydetection model can easily distinguish papers fromdifferent groups, then the model performs well.

Setup We use the 2011 test data as papers pub-lished in time t and the 2016 test data as paperspublished in time t+ d, where the time differenced is five years. Moreover, we extract features us-ing novelty detection models trained only until theyear 2010 training data. That is, we do not retrainusing 2011 papers to extract features from 2016papers. We then train a logistic classifier using theextracted features as input and the novelty label asoutput. We evaluate the classifier by calculating itsaccuracy using 10-fold cross validation. We reportthe average accuracy of the ten subsamples. Notethat since the classifier does not have informationregarding the publication date, it tries to classifypapers knowing that all of them are published atthe same time (i.e., end of 2010).

Results The classification accuracies are re-ported in Table 1. Results show that amongthe competing models, topic performs the bestin terms of accuracy. All entity-based mod-els perform better than other competing mod-els, and among these models, word performsthe worst. This is because some words may notcarry novel information specific to the researchfield. The metadata-based models, author andpaper, perform similarly but perform worse than

Model 1 year 6 months 3 months 1 monthtfidf -0.029 -0.100 -0.099 -0.001cooccur 0.771 0.682 0.549 0.384author 0.543 0.536 0.362 0.177paper 0.771 0.882 0.859 0.821word 1.000 0.982 0.911 0.835keyword 1.000 0.945 0.930 0.863topic 1.000 0.991 0.986 0.979

Table 2: Ranking results. Spearman correlation coeffi-cient ρ for each period each novelty detection modelsin comparison. Bold-faced numbers are the best scores.

the entity-based models. Finally, tfidf andcooccur perform the worst among all models.Note that the lower bound accuracy where we as-sume all papers are not novel is 55.56%. tfidfmodel performs eqaully with the lower bound,which shows that the classifier trained using TF-IDF features is not able to distinguish novel andnon-novel papers.

5.3 As a Ranking Task

Intuition The second condition mentionedabove is that evaluation through time should bedone in clusters of papers, instead of individualpapers, which are grouped by publication time.Using this condition, we can reformulate thenovelty detection task into a ranking task, wherewe are tasked to rank the clusters by their pub-lication time, using the average novelty score ofpapers in the cluster. Formally, let P1, P2, ..., Pn

be the sequential sets of papers grouped by givenmultiple publication time period t1, t2, ..., tn.Then, the average novelty scores of the papersin the groups should be sequentially increasing,i.e., µ(P1) < µ(P2) < ... < µ(Pn), where µ(P )stands for the average novelty scores of researchpapers in set P . We can then use a correlationfunction to measure the monotonicity betweenthe relationship of the period and the mean of thepapers in that period.

Setup In this setup, we use four types of publica-tion time periods, i.e., 1 year, 6 months, 3 months,and 1 month. We extract features using modelstrained until the year 2010 training data to viewall papers as papers published at the same time.Through this setup, the model does not have in-formation regarding the publication date. Henceit ranks papers knowing that all of them are pub-lished at the same time (i.e., end of 2010). We useall the test data from years 2011 to 2016 and di-vide them according to the different periods. We

128

Scenario BEST #cites ave. #cites % belowA 210 76.8 92.0%B 34 25.0 86.3%

Table 3: Results of preliminary analysis for the impactproxy. The BEST refers to the best paper.

calculate the Spearman rank-order correlation co-efficient ρ as the rank-order correlation function.

Results The results are shown in Table 2. Over-all, the entity-based models perform the bestamong all competing models. Interestingly, allentity-based models show a perfect correlation co-efficient when using a 1-year time period. Amongthe entity-based models, topic performs the beston all periods. One interesting result is that asthe period gets smaller (i.e., from annually tomonthly), the correlation coefficient gets smaller.We argue that this is because as the period getssmaller, the evaluation condition above gets weak-ened and thus evaluation gets unreliable. Thetfidf produces a negative correlation, whichmeans the novelty scores are decreasing and thuscontradicts to the intuition described above.

6 Counterfactual Simulation on Impact

Another way to evaluate research paper nov-elty detection models is performing counterfactualtest-of-time prediction at time t. However, awardannotation at t+d is sparsely annotated to very fewpapers. We observe the number of citation countsthe paper is correlated with award prediction, toaugment citation for impact proxy. As a specificexample, if both papers a and b are published atthe same time, and paper a received more citationsthan paper b, then paper a is more novel than b.Through this, we can evaluate models both intrin-sically by assuming the number of citation countcorrelates with the novelty of the paper (Section6.2) and extrinsically by using the novelty scoresas features for the citation count prediction task(Section 6.3).

6.1 Preliminary AnalysisTo test our idea, we performed a preliminary ex-periment to check if novelty correlates with cita-tions with the condition that they are publishedat the same time. We considered papers that re-ceived best paper awards as a rough estimate of apaper that is more novel than papers published atthe same time. We looked at two different scenar-ios: (A) ACL 2011 best paper versus other ACL

Model 1-Month 3-Month 5-Month AVGtfidf -0.043 -0.013 0.014 -0.014cooccur 0.066 0.098 0.018 0.060author 0.070 0.128 0.023 0.074paper 0.079 0.097 0.034 0.070word 0.123 0.076 0.045 0.081keyword 0.332 0.461 0.271 0.355topic 0.137 0.204 0.093 0.145

Table 4: Correlation results. Pearson correlation coef-ficient between different novelty detection models andcitation counts. Bold-faced values are top three values.

2011 papers (total of 163 papers), and (B) ICML2011 best paper versus 300 arXiv papers3 with up-load dates closest to the ICML best paper.

We compared the best paper and other papersusing two methods. First, we calculated the aver-age number of citations of other papers and com-pared it with the number of citations of the bestpapers. Second, we also calculated the percentageof papers with citations below the number of cita-tions of the best paper. Results are shown in Table3. Results show that in both scenarios, both papershave higher citations than average. Moreover, atleast 86.3% of the papers have a lower number ofcitations than the best paper. This shows that cita-tions can be used as a proxy for novelty detection.

6.2 As a Correlation TaskIntuition We use citation counts as labels to per-form evaluation, assuming that the novelty scorecorrelates with the citation count of the paper. Onecondition must be considered: Papers should al-ready be mature, that is, enough time should havebeen given to the paper to be exposed to the re-search community. Using these assumptions, wecan simplify the task into a correlation task wherewe check the relationship between the noveltyscores and the citation counts.

Setup To assure paper maturity, papers pub-lished recently are not used for evaluation. In-stead, we use papers that are published approxi-mately five years before the time data gatheringwas done, i.e., since we gathered the data on April2016, we use papers that are published near theApril 2011 date. We consider three sizes of win-dows: 1-month window where we consider onlyApril 2011 papers, and 3/5-month windows wherewe can consider March to May 2011 papers and

3We use arXiv papers for comparison not only for the di-versity of scenarios but also since most ICML 2011 papers(including the best paper) were uploaded to arXiv before theconference started.

129

February to June 2011 papers, respectively. Weuse the Pearson correlation coefficient as the eval-uation metric. We also report the average correla-tion.

Results We show the correlation coefficientscores of all competing models in Table 4. The ta-ble shows that the entity-based models outperformall the other models. Among them, keywordperforms the best on all window sizes. tfidfperforms the worst among the models, garneringnegative correlation when the window size is 1/3-month, which means that the novelty scores pro-duced by the model do not correspond to the cita-tion count.

6.3 As a Feature Selection Task

Intuition We evaluate the novelty detectionmodels by measuring their contribution to the ci-tation count prediction task, where we are givena paper and its information, and we are tasked topredict the number of citations the paper receiveafter a particular given time. Previous works haveattempted to use content features (Yan et al., 2011;Chakraborty et al., 2014) to solve the task. We ar-gue that the novelty scores can also be treated ascontent features, and the novelty detection modelthat produces the most useful content feature canbe regarded as the best model for the citation countprediction task. Using this intuition, we can refor-mulate the task as a feature selection task.

Setup Following Yan et al. (2011), we perform afeature selection study on the novelty scores pro-duced by all competing models. Specifically, welook at the performance of a citation count predic-tion model, both (a) when only one novelty scoreis isolated as a single feature, and (b) when thesame novelty score is dropped and all other nov-elty scores are used as features. We use five typ-ical models for citation count prediction: linearregression (LR), k-nearest neighbors (KNN), deci-sion trees (DT), support vector machines (SVM),and multilayer perceptron (MLP). To train thesemodels, for each year from 2011 to 2015, we usethe first five months as training data, and the sixthmonth as test data, obtaining five training and testdatasets. We use the coefficient of determinationR2 as the evaluation metric. Finally, we also re-port the average of results on the five datasets.

Results The results are shown in Table 5.keyword performs the best among all models,

having scores included in the best three scoresfor all cases. topic also performs well, havingnine out of ten scores in the top three. Interest-ingly, both author and word perform compara-bly, having six and five out of ten scores in the topthree, respectively. We posit that this is due to pre-vious findings (Yan et al., 2011; Chakraborty et al.,2014) that author-based features (e.g., h-index, au-thor rank) are informative when predicting citationcounts. authormay have learned author-specificbiases in its novelty scores. Finally, tfidf per-forms the worst, consistently having a zero scoreon all classifiers when used in isolation. Thismeans that tfidf does not carry any importantsignal to predict citation counts.

7 Discussions

The evaluation metrics are complementary toeach other We compared existing models usingfour evaluation methods. These evaluation meth-ods have different characteristics that complementeach other. For one thing, evaluation throughtime prefers topically new papers, thus evaluationwould be problematic on published papers that areoff-topic or published at a low-tier conference orjournal. However, this problem would not existwhen evaluating through impact, because off-topicand low-tier papers normally do not have lots ofcitations. Moreover, evaluation through citationprefers papers that have many citations, hence itcan be problematic on less impactful yet highlycited papers, such as survey papers. This is nota problem when evaluating through time, becausethe topics discussed in survey papers are not new.

The individual tasks also complement eachother. For example, the classification task providesa direct comparison of individual paper noveltysince evaluation is done at paper-level, while theranking task is not very interpretable since evalua-tion is done in groups. However, the classificationtask is not able to evaluate papers between two pe-riods t and t+d, while the ranking task makes useof all data. It is therefore recommended that allfour evaluation methods are used for evaluation.

Novelty in normal texts and research papersare different The results from different evalu-ation methods presented in the paper consistentlyshow that models which are originally used for de-tecting novelty in normal text, especially tfidf,do not perform well in the task. This contradictsto previously reported strong results (Voorhees,

130

Classifier tfidf cooccur author paper word keyword topicwhen feature is isolated

LR 0.000 0.007 0.091 0.082 0.042 0.104 0.086KNN 0.000 0.007 0.010 0.002 0.003 0.027 0.015DT 0.000 0.002 0.000 0.000 0.038 0.023 0.003SVM 0.000 0.007 0.084 0.082 0.042 0.104 0.086MLP 0.000 0.006 0.100 0.083 0.075 0.091 0.090

when feature is droppedLR 0.110 0.110 0.111 0.120 0.109 0.083 0.108KNN 0.038 0.044 0.045 0.038 0.036 0.035 0.026DT 0.067 0.067 0.039 0.067 0.000 0.055 0.060SVM 0.114 0.115 0.107 0.119 0.113 0.098 0.104MLP 0.107 0.105 0.107 0.105 0.103 0.104 0.098

Table 5: Feature selection results. R2 score of each feature (novelty score) when in isolation and when droppedfrom an all-features model. When in isolation, the larger R2 is, the better. When dropped, the smaller R2 is, thebetter. Bold-faced values are top three values per row.

2004) on normal text novelty detection task, wherethe model obtained state of the art performance.This supports the fact that the novelty in normaltexts and in research papers is different. Since in-formation in a research paper is often not com-pletely new, the tfidf model always gives asmall novelty score to newer papers, which ex-plains the negative results.

Entity-based citation graphs are better featureextractors On all evaluation metrics, modelswith features extracted from entity-based citationgraphs, i.e. keyword and topic obtain the bestperformance. One possible explanation to this isthat features extracted from entity-based citationgraphs best reflect the changes brought upon bythe paper on the relationship between entities inthe background knowledge (Ding et al., 2013).This is defined as the static and dynamic nature ofresearch paper novelty in Section 2. On the otherhand, normal text novelty detection models cap-ture features that are different from research papernovelty, and metadata-based citation graph mod-els capture only the static nature of research papernovelty.

Limitations in models and evaluation One ma-jor limitation of the existing methods is that theydo not consider the possible existence of biasesfrom other factors. For example, an applicationof a basic machine learning model on other non-computing fields such as philosophy or psychol-ogy may have different intensities of novelty de-pending on the field of the venue where the pa-per is published. Also, papers from multidisci-plinary fields can receive more citations becausetwo or more research communities are readingthem. These factors may affect the evaluation pre-

sented in this paper.Regarding evaluation, expert-annotated labels,

though difficult to acquire in scale, are still moredesirable than our suggested estimates. The idealway to create this dataset is to gather papers eval-uated for the test of time award, an award givento papers published in the past 10 or more yearsfor their significant contribution by a committeeof influential researchers of the community. Thisensures that the annotators are experts and the pa-pers have shown a noticeable impact. However,this method is not efficient as it will need at leastten years to annotate a small number of papers.Moreover, this does not include annotations of pa-pers with less intensity of novelty.

8 Conclusion

We described counterfactual simulations to evalu-ate the research paper novelty detection task by us-ing time and impact as proxies. Using these meth-ods, we evaluate features used in existing models,to find entity-based citation features, compared tonormal text features, are stronger signals to pre-dict and explain novelty. We finally provided dis-cussions on the advantages and disadvantages ofusing these evaluation metrics and the future di-rection of this research.

Acknowledgement

This work is supported by Micrsoft Research Asia.

References

Reinald Kim Amplayo, SuLyn Hong, and Min Song.2018. Network-based approach to detect novelty ofscholarly literature. Inf. Sci., 422:542–557.

131

Reinald Kim Amplayo and Min Song. 2016. Build-ing content-driven entity networks for scarce scien-tific literature using content information. In Pro-ceedings of the Fifth Workshop on Building andEvaluating Resources for Biomedical Text Mining,BioTxtM@COLING 2016, Osaka, Japan, December12, 2016, pages 20–29.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022.

Stephen Blott, Fabrice Camous, Paul Ferguson,Georgina Gaughan, Cathal Gurrin, Gareth J. F.Jones, Noel Murphy, Noel E. O’Connor, Alan F.Smeaton, Peter Wilkins, Oisın Boydell, and BarrySmyth. 2004. Experiments in terabyte search-ing, genomic retrieval and novelty detection forTREC 2004. In Proceedings of the Thirteenth TextREtrieval Conference, TREC 2004, Gaithersburg,Maryland, USA, November 16-19, 2004.

Jill Burstein, Martin Chodorow, and Claudia Leacock.2004. Automated essay evaluation: The criteriononline writing service. AI Magazine, 25(3):27–36.

Tanmoy Chakraborty, Suhansanu Kumar, PawanGoyal, Niloy Ganguly, and Animesh Mukherjee.2014. Towards a stratified learning approach to pre-dict future citation counts. In IEEE/ACM Joint Con-ference on Digital Libraries, JCDL 2014, London,United Kingdom, September 8-12, 2014, pages 351–360.

Ying Ding, Min Song, Jia Han, Qi Yu, Erjia Yan,Lili Lin, and Tamy Chambers. 2013. Entitymet-rics: Measuring the impact of entities. CoRR,abs/1309.2486.

Ying Ding, Guo Zhang, Tamy Chambers, Min Song,Xiaolong Wang, and Chengxiang Zhai. 2014.Content-based citation analysis: The next generationof citation analysis. JASIST, 65(9):1820–1833.

Michael Gamon. 2006. Graph-based text representa-tion for novelty detection. In Proceedings of theFirst Workshop on Graph Based Methods for Natu-ral Language Processing, TextGraphs-1, pages 17–24, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

Robert Harris. 2002. Anti-plagiarism strategies for re-search papers. Virtual salt, 7.

Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, andC. Lee Giles. 2010. Context-aware citation rec-ommendation. In Proceedings of the 19th Interna-tional Conference on World Wide Web, WWW 2010,Raleigh, North Carolina, USA, April 26-30, 2010,pages 421–430.

Nasreen Abdul Jaleel, James Allan, W. Bruce Croft,Fernando Diaz, Leah S. Larkey, Xiaoyan Li,Mark D. Smucker, and Courtney Wade. 2004.Umass at TREC 2004: Novelty and HARD. In

Proceedings of the Thirteenth Text REtrieval Con-ference, TREC 2004, Gaithersburg, Maryland, USA,November 16-19, 2004.

Nathalie Japkowicz, Catherine Myers, and Mark A.Gluck. 1995. A novelty detection approach to clas-sification. In Proceedings of the Fourteenth Inter-national Joint Conference on Artificial Intelligence,IJCAI 95, Montreal Quebec, Canada, August 20-251995, 2 Volumes, pages 518–523.

Xiaonan Ji, Alan Ritter, and Po-Yin Yen. 2017. Us-ing ontology-based semantic similarity to facilitatethe article screening process for systematic reviews.Journal of Biomedical Informatics, 69:33–42.

David S Kaufer and Cheryl Geisler. 1989. Nov-elty in academic writing. Written Communication,6(3):286–311.

Xiang Liu, Torsten Suel, and Nasir D. Memon. 2014.A robust model for paper reviewer assignment. InEighth ACM Conference on Recommender Systems,RecSys ’14, Foster City, Silicon Valley, CA, USA -October 06 - 10, 2014, pages 25–32.

Cheng Long, Raymond Chi-Wing Wong, Yu Peng,and Liangliang Ye. 2013. On good and fair paper-reviewer assignment. In 2013 IEEE 13th Interna-tional Conference on Data Mining, Dallas, TX, USA,December 7-10, 2013, pages 1145–1150.

Romans Lukashenko, Vita Graudina, and Janis Grund-spenkis. 2007. Computer-based plagiarism detec-tion methods and tools: an overview. In Pro-ceedings of the 2007 International Conference onComputer Systems and Technologies, CompSysTech2007, Rousse, Bulgaria, June 14-15, 2007, page 40.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States., pages 3111–3119.

Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents. pages 1–20.

Tobias Schnabel, Igor Labutov, David M. Mimno, andThorsten Joachims. 2015. Evaluation methods forunsupervised word embeddings. In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2015, Lisbon,Portugal, September 17-21, 2015, pages 298–307.

Min Song and Ying Ding. 2014. Topic Modeling: Mea-suring Scholarly Impact Using a Topical Lens, pages235–257. Springer International Publishing, Cham.

Flora S. Tsai and Yi Zhang. 2011. D2S: document-to-sentence framework for novelty detection. Knowl.Inf. Syst., 29(2):419–433.

132

Ellen M. Voorhees. 2004. Overview of TREC 2004. InProceedings of the Thirteenth Text REtrieval Con-ference, TREC 2004, Gaithersburg, Maryland, USA,November 16-19, 2004.

Rui Yan, Jie Tang, Xiaobing Liu, Dongdong Shan,and Xiaoming Li. 2011. Citation count prediction:learning to estimate future citations for literature. InProceedings of the 20th ACM Conference on Infor-mation and Knowledge Management, CIKM 2011,Glasgow, United Kingdom, October 24-28, 2011,pages 1247–1252.

Kuo Zhang, Juan Zi, and Li Gang Wu. 2007. Newevent detection based on indexing-tree and namedentity. In SIGIR 2007: Proceedings of the 30th An-nual International ACM SIGIR Conference on Re-search and Development in Information Retrieval,Amsterdam, The Netherlands, July 23-27, 2007,pages 215–222.

133


Do Sentence Interactions Matter? Leveraging Sentence LevelRepresentations for Fake News Classification

Vaibhav, Raghuram Mandyam Annasamy, Eduard HovyLanguage Technologies Institute

School of Computer ScienceCarnegie Mellon University

[email protected], [email protected], [email protected]

Abstract

The rising growth of fake news and misleadinginformation through online media outlets de-mands an automatic method for detecting suchnews articles. Of the few limited works whichdifferentiate between trusted vs other types ofnews article (satire, propaganda, hoax), noneof them model sentence interactions within adocument. We observe an interesting patternin the way sentences interact with each otheracross different kind of news articles. Tocapture this kind of information for long newsarticles, we propose a graph neural network-based model which does away with the needof feature engineering for fine grained fakenews classification. Through experiments, weshow that our proposed method beats strongneural baselines and achieves state-of-the-artaccuracy on existing datasets. Moreover, weestablish the generalizability of our model byevaluating its performance in out-of-domainscenarios. Code is available at https://github.com/MysteryVaibhav/fake_news_semantics.

1 Introduction

In today’s day and age of social media, thereare ample opportunities for fake news production,dissemination and consumption. Rashkin et al.(2017) break down fake news into three categories,hoax, propaganda and satire. A hoax article typi-cally tries to convince the reader about a cooked-up story while propaganda ones usually misleadthe reader into believing a false political or so-cial agenda. Burfoot and Baldwin (2009) definesa satirical article as the one which deliberatelyexposes real-world individuals, organisations andevents to ridicule.

Previous works (Rubin et al., 2016; Rashkinet al., 2017) rely on various linguistic and hand-crafted semantic features for differentiating be-tween news articles. However, none of them try

(a) Trusted (b) Satirical

Figure 1: TSNE visualization (Van Der Maaten, 2014)of sentence embeddings obtained using BERT (Devlinet al., 2019) for two kind of news articles from SLN.A point denotes a sentence and the number indicateswhich paragraph it belonged to in the article.

to model the interaction of sentences within thedocument. We observed a pattern in the way sen-tences cluster in different kind of news articles.Specifically, satirical articles had a more coherentstory and thus all the sentences in the documentseemed similar to each other. On the other hand,the trusted news articles were also coherent but thesimilarity between sentences from different partsof the document was not that strong, as depictedin Figure 1. We believe that the reason for suchkind of behaviour is the presence of factual jumpsacross sections in a trusted document.

In this work, we propose a graph neuralnetwork-based model to classify news articleswhile capturing the interaction of sentences acrossthe document. We present a series of experi-ments on News Corpus with Varying Reliabil-ity dataset (Rashkin et al., 2017) and SatiricalLegitimate News dataset (Rubin et al., 2016).Our results demonstrate that the proposed modelachieves state-of-the-art performance on thesedatasets and provides interesting insights. Experi-ments performed in out-of-domain settings estab-lish the generalizability of our proposed method.

134

Dataset Trusted (# Docs) Satire (# Docs) Hoax (# Docs) Propaganda (# Docs)

LUN-train GN except ‘APW’ and ‘WPB’ (9,995) The Onion (14,047) American News (6,942) Activist Report (17,870)LUN-test GN only ‘APW’ and ‘WPB’ (750) The Borowitz Report, Clickhole (750) DC Gazette (750) The Natural News (750)

SLN The Toronto Star, The NY Times (180) The Onion, The Beaverton (180) - -RPN WSJ, NBC, etc (75) The Onion, The Beaverton, etc (75) - -

Table 1: Statistics about different dataset sources. GN refers to Gigaword News.

Figure 2: Proposed semantic graph neural network based model for fake news classification.

2 Related Work

Satire, according to Simpson (2003), is compli-cated because it occupies more than one place inthe framework for humor, proposed by Ziv (1988):it clearly has an aggressive and social function,and often expresses an intellectual aspect as well.Rubin et al. (2016) defines news satire as a genreof satire that mimics the format and style of jour-nalistic reporting. Datasets created for the task ofidentifying satirical news articles from the trustedones are often constructed by collecting docu-ments from different online sources (Rubin et al.,2016). McHardy et al. (2019) hypothesized thatthis encourages the models to learn characteristicsfor different publication sources rather than char-acteristics of satire. In this work, we show thatour proposed model generalizes to articles fromunseen publication sources.

Rashkin et al. (2017) extends Rubin et al.(2016)’s work by offering a quantitative study oflinguistic differences found in articles of differ-ent types of fake news such as hoax, propagandaand satire. They also proposed predictive mod-els for graded deception across multiple domains.Rashkin et al. (2017) found that neural methodsdidn’t perform well for this task and proposed touse a Max-Entropy classifier. We show that ourproposed neural network based on graph convo-lutional layers can outperform this model. Re-cent works by Yang et al. (2017); De Sarkar et al.(2018) show that sophisticated neural models canbe used for satirical news detection. To the bestof our knowledge, none of the previous works rep-resent individual documents as graphs where thenodes represent the sentences for performing clas-

sification using a graph neural network.

3 Dataset and Baseline

We use SLN: Satirical and Legitimate NewsDatabase (Rubin et al., 2016), RPN: Random Po-litical News Dataset (Horne and Adali, 2017) andLUN: Labeled Unreliable News Dataset Rashkinet al. (2017) for our experiments. Table 1 showsthe statistics. Since all of the previous methodson the aforementioned datasets are non-neural, weimplement the following neural baselines,

• CNN: In this model, we apply a 1-d CNN(Convolutional Neural Network) layer (Kim,2014) with filter size 3 over the word em-beddings of the sentences within a document.This is followed by a max-pooling layer toget a single document vector which is passedto a fully connected projection layer to get thelogits over output classes.

• LSTM: In this model, we encode the doc-ument using a LSTM (Long Short-TermMemory) layer (Hochreiter and Schmidhu-ber, 1997). We use the hidden state at thelast time step as the document vector whichis passed to a fully connected projection layerto get the logits over output classes.

• BERT: In this model, we extract the sen-tence vector (representation correspondingto [CLS] token) using BERT (BidirectionalEncoder Representations from Transform-ers) (Devlin et al., 2019) for each sentencein the document. We then apply a LSTMlayer on the sentence embeddings, followed

135

by a projection layer to make the predictionfor each document.

4 Proposed Model

Capturing sentence interactions in long documentsis not feasible using a recurrent network becauseof the vanishing gradient problem (Pascanu et al.,2013). Thus, we propose a novel way of encod-ing documents as described in the next subsection.Figure 2 shows the overall framework of our graphbased neural network.

4.1 Input RepresentationEach document in the corpus is represented asa graph. The nodes of the graph represent thesentences of a document while the edges repre-sent the semantic similarity between a pair of sen-tences. Representing a document as a fully con-nected graph allows the model to directly capturethe interaction of each sentence with every othersentence in the document. Formally,

eij = Similarity(si, sj) (1)

We initialize the edge scores using BERT (De-vlin et al., 2019) finetuned on the semantic textualsimilarity task1 for computing the semantic sim-ilarity (SS) between two sentences. Refer to theSupplementary Material for more details regard-ing the SS model. Note that this representationdrops the sentence order information but is betterable to capture the interaction between far off sen-tences within a document.

4.2 Graph based Neural NetworksWe reformulate the fake news classification prob-lem as a graph classification task, where a graphrepresents a document. Given a graphG = (E,S)where E is the adjacency matrix and S is the sen-tence feature matrix. We randomly initialize theword embeddings and use the last hidden state of aLSTM layer as the sentence embedding, shown inFigure 2. We experiment with two kinds of graphneural networks,

4.2.1 Graph Convolution Network (GCN)The graph convolutional network (Kipf andWelling, 2017) is a spectral convolutional opera-tion denoted by f(Z l, E|W l),

Z l+1 = f(Z l, E|W l) (2)

f(Z l, E|W l) = σ(EZ lW l) (3)1Task 1 of SemEval-2017

Here, Z l is the output feature corresponding to thenodes after lth convolution. W l is the parameterassociated with the lth layer. We set Z0 = S.Based on the above operation, we can define ar-bitrarily deep networks. For our experiments, wejust use a single layer unless stated otherwise. Bydefault, the adjacency matrix (E) is fully con-nected i.e. all the elements are 1 except the di-agonal elements which are all set to 0. We set Ebased on semantic similarity model in our GCN +SS model. For the GCN + Attn model, we just adda self attention layer (Vaswani et al., 2017) afterthe GCN layer and before the pooling layer.

4.2.2 Graph Attention Network (GAT)Velickovic et al. (2018) introduced graph atten-tion networks to address various shortcomings ofGCNs. Most importantly, they enable nodes toattend over their neighborhoods’ features withoutdepending on the graph structure upfront. The keyidea is to compute the hidden representations ofeach node in the graph, by attending over its neigh-bors, following a self-attention (Vaswani et al.,2017) strategy. By default, there is one attentionhead in the GAT model. For our GAT + 2 AttnHeads model, we use two attention heads and con-catenate the node embeddings obtained from dif-ferent heads before passing it to the pooling layer.For a fully connected graph, the GAT model al-lows every node to attend on every other node andlearn the edge weights. Thus, initializing the edgeweights using the SS model is useless as they arebeing learned. Mathematical details are providedin the Supplementary Material.

4.3 HyperparametersWe use a randomly initialized embedding matrixwith 100 dimensions. We use a single layer LSTMto encode the sentences prior to the graph neuralnetworks. All the hidden dimensions used in ournetworks are set to 100. The node embedding di-mension is 32. For GCN and GAT, we set σ asLeakyRelU with slope 0.2. We train the models fora maximum of 10 epochs and use Adam optimizerwith learning rate 0.001. For all the models, weuse max-pool for pooling, which is followed by afully connected projection layer with output nodesequal to the number of classes for classification.

5 Experimental Setting

We conduct experiments across various settingsand datasets. We report macro-averaged scores in

136

Figure 3: Attention heatmaps generated by GAT for 2-way classification. Left: Trusted, Right: Satire.

Model Precision Recall

CNN 67.5 67.5LSTM 82.2 81.4BERT 78.1 78.1SoTA* (Rubin et al., 2016) 88.0 82.0

Our Models

GCN 85.9 85.0GCN + SS 86.4 86.3GCN + Attn 87.1 86.9GCN + Attn + SS 87.8 87.8GAT 86.2 86.1GAT + 2 Attn Heads 89.1 88.9

Table 2: 2-way classification results on SLN. *n-foldcross validation (precision, recall) as reported in SoTA.

all the settings.

2-way classification b/w satire and trustedarticles: We use the satirical and trusted newsarticles from LUN-train for training, and fromLUN-test as the development set. We evaluateour model on the entire SLN dataset. This isdone to emulate a real-world scenario where wewant to see the performance of our classifier onan out of domain dataset. We don’t use SLN fortraining purposes because it just contains 360examples which is too little for training our modeland we want to have an unseen test set. The bestperforming model on SLN is used to evaluate theperformance on RPN.

4-way classification b/w satire, propaganda,hoax and trusted articles: We split the LUN-train into a 80:20 split to create our training anddevelopment set. We use the LUN-test as our outof domain test set.

6 Results

Table 2 shows the quantitative results for the twoway classification between satirical and trustednews articles. Our proposed GAT method with 2attention heads outperforms SoTA. The semantic

Model LUN-dev LUN-test

CNN 96.48 54.04LSTM 88.75 55.05BERT 95.07 54.87SoTA* (Rashkin et al., 2017) 91.0 65.0

Our Models

GCN 96.76 65.0GCN + Attn 97.57 67.08GAT 97.28 65.51GAT + 2 Attn Heads 97.82 66.95

Table 3: 4-way classification results for different mod-els. We only report F1-score following the SoTA paper.

similarity model does not seem to have much im-pact on the GCN model, and considering the com-puting cost, we don’t experiment with it for the4-way classification scenario. Given that we useSLN as an out of domain test set (just one over-lapping source, no overlap in articles), whereasthe SoTA paper (Rubin et al., 2016) reports a 10-fold cross validation number on SLN. We believethat our results are quite strong, the GAT + 2 AttnHeads model achieves an accuracy of 87% on theentire RPN dataset when used as an out-of-domaintest set. The SoTA paper (Horne and Adali, 2017)on RPN reports a 5-fold cross validation accuracyof 91%. These results indicate the generalizabilityof our proposed model across datasets. We alsopresent results of four way classification in Table3. All of our proposed methods outperform SoTAon both the in-domain and out of domain test set.

To further understand the working of our pro-posed model, we closely inspect the attentionmaps generated by the GAT model for satirical andtrusted news articles for the SLN dataset. FromFigure 3, we can see that the attention map gen-erated for the trusted news article only focuseson two specific sentence whereas the attentionweights are much more distributed in case of asatirical article. Interestingly enough the high-lighted sentences in case of the trusted news ar-ticle were the starting sentence of two different

137

paragraphs in the article indicating the presence ofsimilar sentence clusters within a document. Thisopens a new avenue for understanding the differ-ences between different kind of text articles for fu-ture research.

7 Conclusion

This paper introduces a novel way of encoding ar-ticles for fake news classification. The intuitionbehind representing documents as a graph is moti-vated by the fact that sentences interact differentlywith each other across different kinds of article.Recurrent networks are unable to maintain longterm dependencies in large documents, whereasa fully connected graph captures the interactionbetween sentences at unit distance. The quantita-tive result shows the effectiveness of our proposedmodel and the qualitative results validate our hy-pothesis about difference in sentence interactionacross different articles. Further, we show that ourproposed model generalizes to unseen datasets.

Acknowledgement

We would like to thank the AWS Educate programfor donating computational GPU resources used inthis work. We also appreciate the anonymous re-viewers for their insightful comments and sugges-tions to improve the paper.

Supplementary Material

The supplementary material is available2 alongwith the code which provides mathematical detailsof the GAT model and few additional qualitativeresults.

ReferencesClint Burfoot and Timothy Baldwin. 2009. Automatic

satire detection: Are you having a laugh? In Pro-ceedings of the ACL-IJCNLP 2009 conference shortpapers, pages 161–164. Association for Computa-tional Linguistics.

Sohan De Sarkar, Fan Yang, and Arjun Mukherjee.2018. Attending sentences to detect satirical fakenews. In Proceedings of the 27th International Con-ference on Computational Linguistics, pages 3371–3380.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training of2https://github.com/MysteryVaibhav/

fake_news_semantics/blob/master/EMNLP2019_TextGraphs_Supplementary.pdf

deep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Benjamin D Horne and Sibel Adali. 2017. This just in:fake news packs a lot in title, uses simpler, repetitivecontent in text body, more similar to satire than realnews. In Eleventh International AAAI Conferenceon Web and Social Media.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1746–1751.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In International Conference on LearningRepresentations (ICLR).

Robert McHardy, Heike Adel, and Roman Klinger.2019. Adversarial training for satire detection: Con-trolling for confounding variables. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long and Short Papers), pages 660–665, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difficulty of training recurrent neuralnetworks. In International conference on machinelearning, pages 1310–1318.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and polit-ical fact-checking. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2931–2937.

Victoria Rubin, Niall Conroy, Yimin Chen, and SarahCornwell. 2016. Fake news or truth? using satiricalcues to detect potentially misleading news. In Pro-ceedings of the Second Workshop on ComputationalApproaches to Deception Detection, pages 7–17.

Paul Simpson. 2003. On the discourse of satire: To-wards a stylistic model of satirical humour, vol-ume 2. John Benjamins Publishing.

Laurens Van Der Maaten. 2014. Accelerating t-sne us-ing tree-based algorithms. The Journal of MachineLearning Research, 15(1):3221–3245.

138

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph attention networks. In InternationalConference on Learning Representations.

Fan Yang, Arjun Mukherjee, and Eduard Dragut. 2017.Satirical news detection and analysis using atten-tion mechanism and linguistic features. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 1979–1989,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Avner Ziv. 1988. Teaching and learning with humor:Experiment and replication. The Journal of Experi-mental Education, 57(1):4–15.

139


Faceted Hierarchy: A New Graph Type to Organize Scientific Conceptsand a Construction Method

Qingkai Zeng†, Mengxia Yu†, Wenhao Yu†, Jinjun Xiong‡, Yiyu Shi†, Meng Jiang††Department of Computer Science and Engineering,

University of Notre Dame, Notre Dame, Indiana, USA‡IBM T. J. Watson Research Center, Yorktown Heights, NY USA

†qzeng, myu2, wyu1, yshi4, [email protected],‡[email protected]

AbstractOn a scientific concept hierarchy, a parent con-cept may have a few attributes, each of whichhas multiple values being a group of child con-cepts. We call these attributes facets: clas-sification has a few facets such as applica-tion (e.g., face recognition), model (e.g., svm,knn), and metric (e.g., precision). In thiswork, we aim at building faceted concept hi-erarchies from scientific literature. Hierar-chy construction methods heavily rely on hy-pernym detection, however, the faceted rela-tions are parent-to-child links but the hyper-nym relation is a multi-hop, i.e., ancestor-to-descendent link with a specific facet “type-of”.We use information extraction techniques tofind synonyms, sibling concepts, and ancestor-descendent relations from a data science cor-pus. And we propose a hierarchy growth algo-rithm to infer the parent-child links from thethree types of relationships. It resolves con-flicts by maintaining the acyclic structure of ahierarchy.

1 Introduction

Concept hierarchies play an important role inknowledge discovery from scientific literature.Concepts are expected to be organized in a hi-erarchical structure like chapters-to-sections-to-subsections in textbooks. In this work, we proposea new representation of scientific concept hierar-chy, called faceted concept hierarchy. Under thishierarchy, the links should not only carry parent-to-child relations but also the semantic relations(facets) between the concepts. Figure 1 presents apart of the faceted hierarchy in Data Science. Theparent node is “classification” and the child con-cepts of it are excepted to be grouped into threefacets, each of which has three child-concepts.One example of the faceted relation we define isas follows:

parent(“classification”, “svm”): “models”,

support_vector_machines; svm

decision_trees

fraud_detection

text_categorization

k_nearest_neighbor; knn classification face

_recognition

precision recallf1_measure;f1_score

models

metrics

applications

machine_learning

python spark tensorflow regression clustering ……

tools problems

Figure 1: The idea of faceted concept hierarchyfrom Data Science publications: For student learn-ing, concepts are expected to be organized in a hi-erarchical structure. For example, here the ninechild-concepts of “classification” (in dashed lineblocks) should be grouped into three facets (“mod-els”, “applications”, and “metrics”).

which is more complete than “type-of” relations inthe works that focused on taxonomy or ontologyinduction (Liang et al., 2017; Gupta et al., 2017;Zhang et al., 2018; Liu et al., 2018) like this:

type-of (“svm”, “classification model”).

The basic units of the hierarchy include conceptnodes and their parent-to-child relations. Threetypes of essential structural relations are expressedin paper texts and can be used to infer the parent-to-child relations. The relation types include (1)synonym (concept names on the same node), (2)sibling (concept nodes having the same parent),and (3) ancestor-to-descendant (nodes on the di-rect descending line). The task of hierarchy con-struction has three challenges. First, there is nosufficient human annotated data or available dis-tant supervisory sources to feed into (deep) learn-ing models. It is necessary to extract the conceptsand relations in an unsupervised manner. Second,the extracted relations could be noisy at the longtail of the frequency distribution. When inferringthe parent-to-child relations, the algorithm shouldconsider the trustworthiness of the synonym, sib-

140

ling, and ancestor relations. Also, it is important todetect redundant or conflicting relations (links) onthe hierarchy. Third, it requires a joint process ofclustering child-concepts into the parent concept’sfacets and identifying words as facet indicators.

We propose a novel framework HiGrowth thatgrows faceted hierarchies from literature data. Theframework has five modules: (M1) scientific con-cept extraction, (M2) concept typing, (M3) hierar-chical relation extraction, (M4) hierarchy growth,and (M5) facet discovery. The M1–M3 NLP mod-ules were implemented in an unsupervised man-ner. First, we use two complementary keyphrasemining tools to extract concepts: one is rule-basedand the other is a statistical learning method. Sec-ond, we use a KNN-based method, simple and ef-fective, to assign types (e.g., $Problem, $Method)to the concepts. Third, we use textual patternsto extract the hierarchical relations (i.e., synonym,sibling, and ancestor). To address the second chal-lenge, we design an efficient algorithm that growsa concept hierarchy by scanning the set of relationtuples (sorted by their frequency from the high-est to the lowest) just once and inferring parent-to-child relations. This algorithm will be able toidentify unnecessary, invalid, and redundant linksduring the process of hierarchy growth in spite ofserious noise at the long tail. Finally, we use aword clustering method to discover the facets ofevery parent concept and assign child concepts toeach of the facets.

Thirty-two junior/senior students who took theData Science course in Spring 2018 were askedto manually label the parent-child concept pairs.We finalize a set as ground-truth if the pair was la-belled by more than half of the participants. TheF1 score of building the parent-to-child links is0.73. The F1 score of 2-hop paths is 0.69. Bothprecision values are above 0.99, showing that thelinks in the hierarchy are precise because of thecareful design of the growth algorithm, but thepattern-based methods have limitations of findingall possible relations.

2 HiGrowth – Part I: InformationExtraction Components (M1 – M3)

2.1 Data Description

We collected full text, all sections including ab-stract, introduction, and experiments, of 5,850 pa-pers in the proceedings of ACM SIGKDD 1994–2015, IEEE ICDM 2001–2015, The Web Confer-

Table 1: Neighboring words for concept typing.Type Triggers on the left Triggers on the right$Problem problem, problems task, tasks, demands$Method methods method, model, algorithm$Object number function$Metric measure, terms measures, scores, values

support_vector_machine;support_vector_machines;

SVM; SVMs

synonym(“support_vector_machines”,“SVMs”)

sibling(“precision”,“recall”)

ancestor(“machine_learning”, “SVM”)

precision recall

machine_learning

SVM

……

…

descendant

childchild

Figure 2: Three types of hierarchical relations.

ence 2001–2015 and ACM WSDM 2008–2015.

2.2 M1: Scientific Concept Extraction

We use phrase mining tools, AutoPhrase (Shanget al., 2018) & SCHBase (Adar and Datta, 2015),to extract scientific concepts from data sciencepapers. AutoPhrase adopted distant supervisionand large-scale statistical techniques; SCHBasefocused on a tendency to expand keyphrases byadding terms, coupled with a pressure to abbrevi-ate to retain succinctness in academic writing.

2.3 M2: Concept Typing

We use a simple but effective method to clas-sify the concepts into four types: $Problem (e.g.,“fraud detection”), $Method (e.g., “svm”), $Ob-ject (e.g., “frequent patterns”), and $Metric (e.g.,“accuracy”). We assume that the neighboring non-stop word indicates the concept’s type, for exam-ple, the trigger word “problem” in the sentence“. . . the problem of fraud detection” suggests that“fraud detection” is a $Problem. We manually se-lect a set of 20 trigger words that indicate concepttypes when they appear left/right next to the con-cepts. Table 1 shows a few examples. If in the textone concept has a left/right neighboring word inthe set, the corresponding type gets one vote. Foreach concept, we count the votes on every type anduse the strategy of majority voting (MajVot) to de-termine the predicted type (i.e., the most voted).

2.4 M3: Hierarchical Relation Extraction

In order to find the relations in an unsupervisedmanner on the scientific text, we use textual pat-terns, mainly Hearst Patterns (Hearst, 1992), to

141

accurately extract three types of hierarchical rela-tions, where X and Y are two concept names:• synonym(X , Y ), if X and Y will be included

in the same concept node on the hierarchy;• sibling(X , Y ), if the concept nodes of X andY will have the parent node;• ancestor(X , Y ), if there will be a path from

the concept node of X to the node of Y .Note that synonym and sibling relations are sym-metric, while ancestor-to-descendant is asymmet-ric (see Figure 2).Find synonym(X , Y ). Two ideas to find synonymconcepts: First, the plural form of a noun or noun-phrase concept can be considered as a synonym,for example, we have synonym(“SVM”, “SVMs”)and synonym(“decision tree”, “decision trees”).Second, the abbreviation inside of parenthesescan be considered as a synonym of the fullname preceding the parenthesis. We have syn-onym(“support vector machines”, “SVMs”) fromtext “. . . Support Vector Machines ( SVMs ). . . ”.Find ancestor(X , Y ). Hearst patterns such as• X such as Y1, Y2, . . . , (and|or) Yn,• X , including Y1, Y2, . . . , (and|or) Yn,

have been often used to find “isA” relation orcalled hypernym for taxonomy construction: Yi(e.g., “dog”) is a kind of X (e.g., “mammal”).However, we expect to extract faceted hierarchi-cal relations such as• ancestor(“machine learning”, “SVM”): models;• ancestor(“machine learning”, “classification”): tasks;• ancestor(“classification”, “SVM”): models;

instead of• isA(“machine learning models”, “SVM”);• isA(“machine learning tasks”, “classification”);• isA(“classification models”, “SVM”),

if the text contains• . . .machine learning models such as SVM. . . ;• . . .machine learning tasks such as classification. . . ;• . . . classification models such as SVM. . . ,

especially when “machine learning” has been ex-tracted as a concept. Note that we are not confi-dent to say every relation given by pattern match-ing is parent-to-child. We denote the relation asancestor. We expect that “machine learning” con-nects to “SVM” through “classification” on the hi-erarchy instead of a direct connection.

Therefore, we modify the patterns as below:• X <pl> such as Y1, . . . , (and|or) Yn,• X <pl> including Y1, . . . , (and|or) Yn,

where <pl> is the plural form of a noun or nounphrase, e.g., “models” and “tasks”. We extract an-cestor(X , Yi) from the above patterns. We will

XAX DX

YAY DY

PX

PY

CX

CY

childchild

descendant descendant

childchild

descendant descendant

synonym(X, Y) sibling(X, Y) ancestor(X, Y)

— PX ∩PY≠ ∅ Y ∈DXó X ∈AY

PX ∩PY≠ ∅OR Y ∈DXOR X ∈DY

Y ∈DXOR X ∈DY

PX ∩PY≠ ∅ORCX ∩CY≠ ∅OR X ∈DY

If matched none of the above conditions, then

the hierarchy H with the relation between X and Y.UPDATE

PASS(no need toadd)

STOP(should notadd)

Figure 3: Check if the relation is unnecessary(“PASS”) or invalid (“STOP”) for updatingH.

infer concrete parent-to-child relations and parentconcept’s facets in the next section.Find sibling(X , Y ). Shorter patterns in which theancestor concept names are missing occur morefrequently in the text, for example:• (<pl>) such as Y1, Y2, . . . , (and|or) Yn,• (<pl>) including Y1, Y2, . . . , (and|or) Yn,

and other patterns like• Y1, Y2, . . . , and|or (other) <pl>.We extract sibling(Yi, Yj) from these patterns.

The number of sibling relations is more than thenumber of the ancestor relations, and the siblingrelations, e.g., sibling(“precision”, “recall”), bringuseful information to hierarchy induction, say, Yiand Yj have the same parent concept node.

We use the strategy of majority voting to chooseone relation type for each pair of concepts. We as-sume that a pair of concepts can have no or onlyone relation among synonym, sibling, and ances-tor. However, the relational extractions may stillbe noisy due to the long tail. Next we discuss howto construct a high-quality concept hierarchy froma set of the three types of relations with noise.

3 HiGrowth – Part II: Hierarchy Growth(M4) and Facet Discovery (M5)

3.1 M4: The Hierarchy Growth Algorithm

Given a set of relations rel(X , Y ) and their sup-port (i.e., frequency), construct a hierarchy H inwhich the links are directional indicating parent-

142

X, Y ∉ H

Y ∉ HXAX DX

XAX DX YAY DY

X, YAX DX

X, YAX∪AY

DX∪DY

X, Y

(a) GrowH with symmetric synonym(X , Y ).

X, Y ∉ H

Y ∉ HXPX

XNIL

Ychild

child

child

descendant XPX chil

d

descendant

Ychild

XPX chil

d

descendant

YPY chil

d

descendant XPX∪PY

child

descendant

Ychild

descendan

teliminate NILs

(b) GrowH with symmetric sibling(X , Y ).X, Y ∉ H

Y ∉ HX

X Ydescendant

X Ydescendant

YX ∉ H YX descendant

X Y YX descendant

remove redundant links

(c) GrowH with asymmetric ancestor(X , Y ).

Figure 4: Hierarchy growth with a new relation.

X

A

child/descendant

Y

child/descendant

descendant

A

X

child/descendant

Y

descendant

X

Y

descendant

A

child/descendant

descendant

descendant

Figure 5: Remove redundancy when adding ances-tor(X , Y ) to the hierarchy.

to-child relations between concepts, where rel ∈synonym, sibling, ancestor. H should have nounnamed nodes, and have no unnecessary or in-valid or redundant links. Specifically, the unnec-essary means that the relation is correct but it doesnot contain extra information for the hierarchy.We will define these characteristics when we in-troduce each step of it in details. An overview ofthe algorithm comes as below.• InitializeH as empty;• Scan the relations from the highest support to

the lowest:– Check if the relation is unnecessary or

invalid to be added into H (see Fig-ure 3). Skip it if yes.

ANIL

Xchild

child

YC

Bchild

child

ANIL

Xchild

YC

Bchild

A

Xchild

YC

B

childsibling(X, Y)

eliminate NILsif non-NIL exists

XNIL

Ychild

child

A descendant

XNIL

Ychild

A child

XA

Ychild

child

specializedescendantto child eliminate NILs

When adding a new sibling relation into the hierarchy:

When post-processing descendant relations in the hierarchy:

descendant

Figure 6: Two scenarios that NIL nodes can beeliminated when finalizing the hierarchy.

– Grow the hierarchy H with this relation(see Figure 4).

– Remove redundant links when the rela-tion is ancestor (see Figure 5).

• Narrow down ancestor relations to parent-to-child when the scan completes (see Figure 6).

We denote different sets of connected nodesgiven a concept node X as below (see Figure 3):• PX is the set of parent nodes of X: there is

at least one direct link from ∀Z ∈ PX to X;• CX is the set of child nodes of X: there is at

least one direct link from X to ∀Z ∈ CX ;• AX is the set of ancestor nodes of X: there

is at least one path but no direct link from∀Z ∈ AX to X;• DX is the set of descendant nodes ofX: there

is at least one path but no direct link from Xto ∀Z ∈ DX .

Check if a relation is invalid (Figure 3). Given anew relation synonym(X , Y ), if there has been anyother relation between X and Y such as ancestor(i.e., X ∈ DY or Y ∈ DX ) or sibling (i.e., PX ∩PY 6= ∅), this new relation is invalid to be addedto the H. Given sibling(X , Y ), if X and Y haveat least one parent, we skip; if there has been anancestor relation between X and Y , the siblingrelation is invalid. Given ancestor(X , Y ), if therehas been path from X to Y (i.e., Y ∈ DX ), weskip it; if there has been a sibling relation (i.e.,PX ∩ PY 6= ∅) or a descendant relation (i.e., X ∈DY ), the ancestor relation is invalid.Grow the hierarchyH with a new relation (Fig-ure 4). We sort valid relations by their frequen-cies. For synonym(X , Y ), we merge node X andY in H: if neither was in H, we create a new iso-lated node named “X , Y ”; if one of them existedin H, we update the name of the existing node as“X , Y ”; if both existed, we merge their ancestornodes as the new ancestor nodeAX ∪AY , and we

143

Table 2: Data science concept examples extractedby two complementary phrase mining tools.

Keyword Count Keyphrase Countclustering 22,607 data mining 8,120classification 19,488 machine learning 4,195accuracy 18,108 feature selection 3,320

(a) AutoPhrase finds quality keywords/phrases of goodstatistical features (e.g., frequency, concordance).

Keyword Count Keyphrase CountSVM 5,774 least squares support vector machine 4LDA 3,548 root-mean-square error 3AUC 2,542 block coordinate gradient descent 1

(b) Some typical examples of acronyms and abbreviationexpansions found by SCHBase.

merge their descendant nodes as the new descen-dant node DX ∪ DY .

For sibling(X , Y ), if neither of the concepts ex-isted, we create a “NIL” node as the parent nodeto each concept node; if one of them existed, foreach parent node in PX , we add Y as a child nodeof it; if both existed, we merge their parent nodesas the parent node of each and eliminate the NILs.

For ancestor(X , Y ), we add a descendant linkfrom X to Y . When X and Y are in H, we elimi-nate the NILs and remove the redundant links.

When adding a new relation sibling(X , Y ), wemerge their parent nodes. If there has been at leastone non-NIL node in the set of parent nodes, weremove the NILs. When adding an ancestor nodeof either X or Y , if they share a NIL parent node,we remove the NIL node.Remove redundant links when growing withancestor(X , Y ) (Figure 5). On the concept hi-erarchy, we allow only one path from an ancestornode to a descendant node. Therefore, when weadd a new ancestor(X , Y ), there are three situa-tions of having a redundant link. First, if there hasbeen a path from X to Y , the new relation is re-dundant. For example, suppose on H, A (“svm”)is a descendant node of X (“classification”) andY (“ls-svm”) is a descendant node of A (“svm”).Then a new relation ancestor(“classification”, “ls-svm”) is actually inferable so it is redundant. Wedo not add it to the hierarchy. For the other twosituations, we also remove the existing, redundantlink in the hierarchy.

3.2 M5: Facet Word Discovery using ContextWord Clustering

For each parent node, we decompose a 3-ordertensor, child node by type of child node by con-text word, in which each entry is the count ofthe context word (e.g., “models”) used in the pat-

Table 3: Performance of concept typing.Accuracy (True/False)

MajVot 0.874 (188/27)MajVot+Grouping 0.963 (207/8)

Table 4: False type predictions in red.Concept Prediction Ground Truthtopic model $Method $Methodtopic models (synonym) $Problem $Methodmean absolute error $Metric $Metricarea under the curve (sibling) $Method $Metric

(a) MajVot: 27 false predictions.

Concept Prediction Ground Truthfrequent patterns $Object $Objectprincipal components $Method $Objectinformation gain $Metric $Objectcluster analysis $Method $Problem

(b) MajVot+Grouping: 8 false predictions.

terns (e.g., “$Problem:classification models suchas $Method:naıve bayes and $Method:svm”) thatindicate the semantic relation between the parentnode (e.g., “classification”) and child node (e.g.,“svm”). The decomposition assigns a cluster ofcontext words to each group of child nodes. Weconsider the most frequent context word in thecluster as the facet of the child nodes group. Wefind three groups of context words:• “algorithm(s)”,“model(s)”,“method(s)”,

“approach(es)”, “technique(s)”. . . ;• “application(s)”,“problem(s)”,“task(s)”. . . ;• “measure(s)”,“metric(s)”. . . .

4 Experiments

We conduct experiments to answer three ques-tions: (1) Are the three NIP modules effective inextracting hierarchical relations? (2) Does the hi-erarchy growth algorithm generate a hierarchy ofbetter quality than existing methods? Are NILnodes and redundant link removal necessary? (3)What does the result hierarchy look like?

4.1 Results on Three IE Components

M1: Scientific concept extraction. Table 2shows examples of data science concepts the toolsextracted. The learning module in AutoPhrase cansegment words and phrases of good statistical fea-tures like high frequency. There is often no am-biguity when we lowercase them but the phraselengths tend to be short. SchBase has a differentphilosophy: it looks for abbreviation expansionsthat could be long and of very low frequency. Weshow some case studies in Table 2. For result of

144

Table 5: Number of concepts of each type.$Problem $Method $Object $Metric

Count (predicted) 52 104 9 50Count (ground truth) 53 100 13 49

Table 6: Number of relations for each type.synonym sibling ancestor

# unique concept pairs 41 234 138# extractions 1,966 1,379 381

AutoPhrase, some 1-gram and n-gram high qual-ity phrase are in Table 2a. For results of SchBase,some acronyms and typical abbreviation expan-sions we selected are in Table 2b. With these twocomplementary tools, we harvest a collection of215 data science concepts.M2: Concept typing. Table 3 shows that theaccuracy of concept typing (a 4-class classifi-cation task) is 0.874. Table 4a gives two ofthe 27 MajVot’s false predictions. We observethat some synonym/sibling concept names like“topic model” and “topic models” have inconsis-tent predicted types due to the sparsity of theirneighboring words. Therefore, we leverage thesynonym/sibling relations discovered in the nextsubsection to group the related concept namestogether and determine their type based on theneighboring words of all the concepts in the group(called MajVot+Grouping). The accuracy is im-proved significantly to 0.963. Table 4b showsthree of the 8 false cases among 215 predictions.Table 5 shows the number of concepts of each typewe have for hierarchy induction.M3: Hierarchical relation extraction. Table 6gives the number of relation tuples we extractedfor each type. The relation synonym has the high-est number of extractions while sibling gives themost unique concept pairs.

4.2 Results on Hierarchy Quality Evaluation

Evaluation metrics. Based on the manually la-belled parent-to-child relations, we evaluate thequality of the resulting hierarchy with three stan-dard IR metrics, precision, recall, and F1 score,on extracting concept pairs that have a 0-hop path(i.e., synonyms), a 1-hop path (i.e., “parent-to-child” relation), and a 2-hop path (i.e., ancestorrelation as parent’s parent). A higher score meansbetter performance.Baseline method. It is not fair to compare withtaxonomy construction methods because we aretargeting a different problem, that is to generatea concept hierarchy of facets with three kinds of

Table 7: Comparing HiGrowth with baselines onbuilding hierarchy from data science literature.

Method Path Precision Recall F10-hop 1.0 .5034 .6697

TAXI 1-hop 1.0 .4004 .57192-hop 1.0 .1831 .3095

HiGrowth 0-hop 1.0 .5294 .6923w/o “NIL” 1-hop .9482 .4583 .6179

2-hop .9499 .3038 .46030-hop 1.0 .5294 .6923

HiGrowth 1-hop .9926 .5781 .73072-hop .9987 .5253 .6885

hierarchical relations. Therefore, we choose tocompare with a hierarchy induction method, calledTAXI (Panchenko et al., 2016), and we feed it withall the relations we mined so that we only com-pare on the performance of hierarchy induction al-gorithms. However, TAXI has no module to con-sider the sibling relations but we have the “NIL”mechanism. TAXI goes through all the relationsseveral times, removes cycles, and links discon-nected components to the root, while we considerthe relation weights and generate the hierarchy in agrowth manner for one scan. Therefore, comparedwith TAXI, HiGrowth is a more efficient algorithmon generating a facet concept hierarchy.

Quality analysis. As shown in Table 7, Hi-Growth consistently outperforms TAXI on all threekinds of paths: it improves synonym detection by3.4%, parent relation extraction by 27.8%, and 2-hop ancestor relation extraction by from 0.31 to0.69. Actually, the HiGrowth variant that disabledthe generation and removal of “NIL” node canstill outperform TAXI because the hierarchy growswith relations from the most confident to the leastconfident. With the “NIL” nodes, HiGrowth im-proves the 1-hop relation by 18.3% and 2-hop re-lation by 49.6%. This shows that it is important tocarefully consider the sibling relations.

4.3 Results on Removing Redundant Links

Figure 7 presents redundant links that HiGrowthskipped or removed when adding a new relationancestor(X , Y ) for each of the three situations, re-spectively. The most common situation is that, wehave ancestor(A,X) and ancestor(A, Y ) in the hi-erarchy, and now we have a new link to specify therelation between X and Y , two descendants of A.IfX is an ancestor of Y , we remove the redundantlink ancestor(A, Y ). We can see a few examplesof the 93 redundant relations. A is a more gen-

145

X

A

child/descendant

Y

child/descendant

descendant

A

X

child/descendant

Y

descendant

X

Y

descendant

A

child/descendant

descendant

descendant

X A Ydata_mining classification naïve_bayesdata_mining classification decision_treesdata_mining classification support_vector_

machinesdata_mining dimensionality_

reductionprincipal_component_analysis

… … …ensemble boosting adaboost

A X Ydata_science data_mining clusteringdata_science machine_

learninglogistic_regression

data_science data_mining outlier_detection

data_science machine_learning

random_forests

… … …machine_learning

supervised_learning

neural_networks

classification decision_tree CARTclassification decision_tree C4.5

X Y Adata_mining frequent_pattern

_miningfp-growth

data_mining frequent_pattern_mining

apriori

7

93

2

Figure 7: The redundant links that the HiGrowth algorithm removed during hierarchy construction.

eral (ancestor-level) concept. The frequency of Ais often higher than the frequency of X or Y . Theweights of ancestor(A,X) and ancestor(A, Y ) arebigger than the weight of ancestor(X , Y ). So thelatter relation will be added to the hierarchy whenthe other two have been on the hierarchy.

4.4 Visualizing the Concept Hierarchy

Figure 8 presents the concept hierarchy that Hi-Growth extracted from the Data Science publica-tions. The hierarchy is not very large but stillnot visible in one page, so we enlarge three partsof the hierarchy, including (1) a set of conceptsas the “measures” facet of “binary classification,”(2) the “applications” and “algorithms” facetsof the concept “classification,” and (3) the “al-gorithms” of “community detection,” the “tech-niques” of “matrix factorization,” and the “meth-ods” of “feature extraction” and “dimensionalityreduction.” We represent the relations of syn-onyms by adding different surface names for same

entities in one node. For example, “topic models”and “topic model” are merged into one node inFigure 8 because they have the same semanticmeaning.

5 Related Work

5.1 Scientific Concept Extraction

Scientific concept extraction is a fundamentaltask (Yu et al., 2019; Jiang et al., 2019). It has beenwidely studied on multiple kinds of text sourcessuch as web documents (Parameswaran et al.,2010), business documents (Menard and Ratte,2016), clinical documents (Jonnalagadda et al.,2012), material science documents (Kim et al.,2017), and computer science publications (Upad-hyay et al., 2018). The phrase mining technolo-gies have been evolving from noun phrase anal-ysis (Evans and Zhai, 1996) to recently popularrepresentation learning methods (Mikolov et al.,2013; Pennington et al., 2014). Here we combined

146

two methodologies that have been demonstrated tobe effective in Science IE (Gabor et al., 2018).

5.2 Hierarchical Relation Extraction

There has been unsupervised methods on hyper-nym discovery and synonym detection (Weedset al., 2014): In this work, we combine precise tex-tual patterns, not only the syntactic patterns (Snowet al., 2005) but also the typed patterns (Nakasholeet al., 2012; Li et al., 2018; Wang et al., 2019)to find synonyms and hypernyms. We considerhypernyms carefully as ancestor-to-descendant in-stead of parent-to-child relations. Synonyms areon the same node, and hypernyms are connectedvia one- or multi-hop path. Moreover, we extractthe sibling relations which precisely describe thenodes on the same level. All the three types ofrelation tuples are important for inferring concepthierarchies.

5.3 Hierarchy Construction and Population

There are two kinds of hierarchy constructionmethods: one is taxonomy or ontology induc-tion that infers “isA” relations by machine learn-ing models (Kozareva and Hovy, 2010; Wu et al.,2012; Yang et al., 2015; Cimiano and Staab,2005), and the other is topical hierarchy discoverythat organizes phrases into topical groups and theninfers hierarchical connections between the topi-cal groups (Wang et al., 2015; Jiang et al., 2017).For the first kind of approaches, researchers usedsyntactic contextual evidence (Anh et al., 2014),belief propagation for population (Bansal et al.,2014), and embedding-based inference (Fu et al.,2014; Nguyen et al., 2014). For the secondpart, poincare embedding and ontology embed-ding methods have been proposed to learn noderepresentations from existing hierarchies (Nickeland Kiela, 2017; Chen et al., 2018).

None of the existing approaches aimed at infer-ring parent-to-child relations based on the threetypes of hierarchical relations (i.e., synonym,ancestor-to-descendant, and sibling). We proposea novel hierarchy growth algorithm that addressesthe issues of noisy, redundant, and invalid links.

6 Conclusions

This paper presented the HiGrowth method thatconstructs a faceted concept hierarchy from litera-ture data. The major focus is on growing a hierar-chy from three kinds of hierarchical relations that

were extracted by pattern-based IE and weightedby their frequency. The hierarchy growth algo-rithm handles unnecessary, invalid and redundantlinks, even the relation set is noisy at the long tail.

Acknowledgements

This work was supported by Natural ScienceFoundation Grant CCF-1901059.

147

data_science

recommendation,recommender_systems

child

data_mining

child

machine_learning

child

knowledge_discovery

child

visualization

child

artificial_intelligence

child

DCG,discounted_cumulative_gain,discounted_cumulative_gain_measure

MAP,mean_average_precision

child

MRR,mean_reciprocal_rank

child

NDCG,normalized_discounted_cumulative_gain,normalized_discounted_cumulative_gain_measure

child

neural_networks

perceptron

child

artificial_neural_networks

child

N/A

closegraph

child

gspan

child

F1,f-measure,f-score,f1-measure,f1-score,f1_measure

macro-f1,macro_f1

child

micro-f1,micro_f1

child

tensor_decomposition,tensor_factorization

PARAFAC

methods

regression,regression_analysis

child

decision_tree,decision_trees

bagging

child

topic_model,topic_models

LDA,latent_dirichlet_allocation,logistic_regression

child

link_prediction

precision

child

text_categorization

support_vector_machine,support_vector_machines,svm,svms

child

bootstrapping

child

classification

algorithms

algorithms

applications

binary_classification

applications

ensemble,ensembles

algorithms

random_forest,random_forests

algorithms

NB,naive_bayes

algorithms

C4.5

algorithms

k_nearest_neighbor,knn

algorithms

CART

algorithms

bayesian_networks

algorithms

linear_svm

algorithms

image_classification

applications

ID3

algorithms

algorithms

algorithms

measures

measures

sensitivity

measures

specificity

measures

recall

measures

NMI,normalized_mutual_information

measures

AUC,area_under_curve,area_under_the_curve

measures

accuracy

measures

purity

measures

f1_score

measures

measures

clustering

density-based_clustering

algorithms

k-means,k-means_clustering

algorithms

SVD,singular_value_decomposition

algorithms

hierarchical_clustering

algorithms

spectral_clustering

algorithms

dbscan

algorithms

k-medoids

algorithms

NMF,non-negative_matrix_factorization,non_negative_matrix_factorization,nonnegative_matrix_factorization

algorithms

algorithms

algorithms

frequent_itemset_mining

apriori

algorithms

fp-growth

algorithms

eclat

algorithms

closet

algorithms

matrix_factorization

techniques

PMF,probabilistic_matrix_factorization

techniques

PCA,principal_component_analysis,principal_components_analysis

techniques

coordinate_descent

techniques

techniques

methods

boosting

methods

methods

semi-supervised,semi-supervised_learning,semi_supervised_learning

supervised,supervised_learning

techniques

techniques

measures

unsupervised,unsupervised_learning

algorithms

algorithms

algorithms

MAE,mean_absolute_error

MLE,maximum_likelihood_estimation

SVR,support_vector_regression

adaboost

methods

metrics

RMSE,root-mean-square_error,root_mean_square_error,root_mean_squared_error

metrics

association_rule_mining

community_detection

algorithms

document_clustering

techniques

outlier_detection

anomaly_detection

GSP

prefixspan

dimensionality_reduction

methods

feature_selection

information_gain

measures

gini_index

measures

sequential_pattern_mining

algorithms

algorithms

SPADE

algorithms

link_recommendation

feature_extraction

methods

methods

similarity_search

linear_regression

spam_detection

fraud_detection

association_mining

approaches

mining_association_rules

multi-class_classification

multi-label_classification

kernel_k-means

entity_recognition

multiclass_classification

TPR,true_positive_rate,truepositive_rate

FPR,false-positive_rate,false_positive_rate

SGD,stochastic_gradient_descent

child

least_squares_support_vector_machine,ls-svm

child

tasks

tasks

tasks

tasks

techniques

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

tasks

algorithms

algorithms

algorithms

algorithms

algorithms

algorithms

algorithms

CSGD,constrained_stochastic_gradient_descent

child

Figure 8: The resulting faceted concept hierarchy we extracted from Data Science publications, nodesmean the entities with different surface names (synonyms).

148

ReferencesEytan Adar and Srayan Datta. 2015. Building a scien-

tific concept hierarchy database (schbase). In Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers), pages 606–615.

Tuan Luu Anh, Jung-jae Kim, and See Kiong Ng. 2014.Taxonomy construction using syntactic contextualevidence. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 810–819.

Mohit Bansal, David Burkett, Gerard De Melo, andDan Klein. 2014. Structured learning for taxon-omy induction with belief propagation. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1041–1051.

Muhao Chen, Yingtao Tian, Xuelu Chen, Zijun Xue,and Carlo Zaniolo. 2018. On2vec: Embedding-based relation prediction for ontology population. InProceedings of the 2018 SIAM International Confer-ence on Data Mining, pages 315–323. SIAM.

Philipp Cimiano and Steffen Staab. 2005. Learningconcept hierarchies from text with a guided agglom-erative clustering algorithm. In Proceedings of theICML 2005 Workshop on Learning and ExtendingLexical Ontologies with Machine Learning Meth-ods.

David A Evans and Chengxiang Zhai. 1996. Noun-phrase analysis in unrestricted text for informationretrieval. In Proceedings of the 34th annual meetingon Association for Computational Linguistics, pages17–24. Association for Computational Linguistics.

Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, HaifengWang, and Ting Liu. 2014. Learning semantic hier-archies via word embeddings. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1199–1209.

Kata Gabor, Davide Buscaldi, Anne-Kathrin Schu-mann, Behrang QasemiZadeh, Haifa Zargayouna,and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in sci-entific papers. In Proceedings of The 12th Inter-national Workshop on Semantic Evaluation, pages679–688.

Amit Gupta, Remi Lebret, Hamza Harkous, and KarlAberer. 2017. Taxonomy induction using hypernymsubsequences. In Proceedings of the 2017 ACMon Conference on Information and Knowledge Man-agement, pages 1329–1338. ACM.

Marti A Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings of

the 14th conference on Computational linguistics-Volume 2, pages 539–545. Association for Compu-tational Linguistics.

Tianwen Jiang, Ming Liu, Bing Qin, and Ting Liu.2017. Constructing semantic hierarchies via fusionlearning architecture. In China Conference on Infor-mation Retrieval, pages 136–148.

Tianwen Jiang, Tong Zhao, Bing Qin, Ting Liu,Nitesh V Chawla, and Meng Jiang. 2019. The roleof “condition”: A novel scientific knowledge graphrepresentation and construction model. In Proceed-ings of the 25th ACM SIGKDD International Con-ference on Knowledge Discovery & Data Mining,pages 1634–1642. ACM.

Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu,and Graciela Gonzalez. 2012. Enhancing clini-cal concept extraction with distributional semantics.Journal of biomedical informatics, 45(1):129–140.

Edward Kim, Kevin Huang, Adam Saunders, AndrewMcCallum, Gerbrand Ceder, and Elsa Olivetti. 2017.Materials synthesis insights from scientific literaturevia text extraction and machine learning. Chemistryof Materials, 29(21):9436–9444.

Zornitsa Kozareva and Eduard Hovy. 2010. Asemi-supervised method to learn and construct tax-onomies using the web. In Proceedings of the2010 conference on empirical methods in naturallanguage processing, pages 1110–1118. Associationfor Computational Linguistics.

Qi Li, Meng Jiang, Xikun Zhang, Meng Qu, Timothy PHanratty, Jing Gao, and Jiawei Han. 2018. Truepie:Discovering reliable patterns in pattern-based infor-mation extraction. In Proceedings of the 24th ACMSIGKDD International Conference on KnowledgeDiscovery & Data Mining, pages 1675–1684. ACM.

Jiaqing Liang, Yi Zhang, Yanghua Xiao, Haixun Wang,Wei Wang, and Pinpin Zhu. 2017. On the transitivityof hypernym-hyponym relations in data-driven lexi-cal taxonomies. In Thirty-First AAAI Conference onArtificial Intelligence.

Ninghao Liu, Xiao Huang, Jundong Li, and Xia Hu.2018. On interpretation of network embedding viataxonomy induction. In Proceedings of the 24thACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining, pages 1812–1820.ACM.

Pierre Andre Menard and Sylvie Ratte. 2016. Conceptextraction from business documents for software en-gineering projects. Automated Software Engineer-ing, 23(4):649–686.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

149

Ndapandula Nakashole, Gerhard Weikum, and FabianSuchanek. 2012. Patty: a taxonomy of relationalpatterns with semantic types. In Proceedings ofthe 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, pages 1135–1145. As-sociation for Computational Linguistics.

Viet-An Nguyen, Jordan L Ying, Philip Resnik, andJonathan Chang. 2014. Learning a concept hier-archy from multi-labeled documents. In Advancesin Neural Information Processing Systems, pages3671–3679.

Maximillian Nickel and Douwe Kiela. 2017. Poincareembeddings for learning hierarchical representa-tions. In Advances in neural information processingsystems, pages 6338–6347.

Alexander Panchenko, Stefano Faralli, Eugen Ruppert,Steffen Remus, Hubert Naets, Cedrick Fairon, Si-mone Paolo Ponzetto, and Chris Biemann. 2016.Taxi at semeval-2016 task 13: a taxonomy induc-tion method based on lexico-syntactic patterns, sub-strings and focused crawling. In SemEval-2016,pages 1320–1327.

Aditya Parameswaran, Hector Garcia-Molina, andAnand Rajaraman. 2010. Towards the web of con-cepts: Extracting concepts from large datasets. Pro-ceedings of the VLDB Endowment, 3(1-2):566–577.


Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren,Clare R Voss, and Jiawei Han. 2018. Automatedphrase mining from massive text corpora. IEEETransactions on Knowledge and Data Engineering,30(10):1825–1837.

Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2005.Learning syntactic patterns for automatic hypernymdiscovery. In Advances in neural information pro-cessing systems, pages 1297–1304.

Prajna Upadhyay, Ashutosh Bindal, Manjeet Kumar,and Maya Ramanath. 2018. Construction and ap-

plications of teknowbase: a knowledge base of com-puter science concepts. In Companion Proceedingsof the The Web Conference 2018, pages 1023–1030.International World Wide Web Conferences SteeringCommittee.

Chi Wang, Jialu Liu, Nihit Desai, Marina Danilevsky,and Jiawei Han. 2015. Constructing topical hi-erarchies in heterogeneous information networks.Knowledge and Information Systems, 44(3):529–558.

Xueying Wang, Haiqiao Zhang, Qi Li, Yiyu Shi, andMeng Jiang. 2019. A novel unsupervised approachfor precise temporal slot filling from incomplete andnoisy temporal contexts. In The World Wide WebConference, pages 3328–3334. ACM.

Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,and Bill Keller. 2014. Learning to distinguish hyper-nyms and co-hyponyms. In Proceedings of COL-ING 2014, the 25th International Conference onComputational Linguistics: Technical Papers, pages2249–2259. Dublin City University and Associationfor Computational Linguistics.

Wentao Wu, Hongsong Li, Haixun Wang, and Kenny QZhu. 2012. Probase: A probabilistic taxonomy fortext understanding. In Proceedings of the 2012 ACMSIGMOD International Conference on Managementof Data, pages 481–492. ACM.

Shuo Yang, Yansong Feng, Lei Zou, Aixia Jia,and Dongyan Zhao. 2015. Taxonomy inductionand taxonomy-based recommendations for onlinecourses. In Proceedings of the 15th ACM/IEEE-CSJoint Conference on Digital Libraries, pages 267–268. ACM.

Wenhao Yu, Zongze Li, Qingkai Zeng, and MengJiang. 2019. Tablepedia: Automating pdf table read-ing in an experimental evidence exploration and an-alytic system. In The World Wide Web Conference,pages 3615–3619. ACM.

Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen,Meng Jiang, Brian Sadler, Michelle Vanni, and Ji-awei Han. 2018. Taxogen: Constructing topicalconcept taxonomy by adaptive term embedding andclustering. In KDD.

150


Graph-Based Semi-Supervised Learning forNatural Language Understanding

Zimeng Qiu†,‡, Eunah Cho‡, Xiaochun Ma‡, William M. Campbell‡† Electrical & Computer Engineering Department, Carnegie Mellon University

‡ Amazon Alexa [email protected], eunahch,mxiaochu,[email protected]

Abstract

Semi-supervised learning is an efficientmethod to augment training data automati-cally from unlabeled data. Development ofmany natural language understanding (NLU)applications has a challenge where unlabeleddata is relatively abundant while labeled datais rather limited.

In this work, we propose transductive graph-based semi-supervised learning models as wellas their inductive variants for NLU. We evalu-ate the approach’s applicability using publiclyavailable NLU data and models. In order tofind similar utterances and construct a graph,we use a paraphrase detection model. Resultsshow that applying the inductive graph-basedsemi-supervised learning can improve the er-ror rate of the NLU model by 5%.

1 Introduction

Natural language understanding (NLU) technol-ogy is an important component for a dialog sys-tem and is commonly used in voice assistants (e.g.,Amazon Alexa, Google Home, Siri). An NLUsystem takes recognized speech input and pro-duces intent, domain, and slots for the utterance tosupport the user request (Tur and De Mori, 2011).For example, for a user request “turn off the lightsin living room” the NLU system might generatedomain Device, intent Light-Control, andslot values of “off” for OffTrigger and “livingroom” for Location.

It is crucial for an NLU system to be able to addfurther support and improve performance in an in-cremental manner. An efficient method for this issemi-supervised learning (SSL), especially whenonly small amount of labeled data is available.In contrast with supervised learning algorithms,SSL algorithms can improve their performance byleveraging information in unlabeled data. Somerecent results (Laine and Aila, 2017; Miyato et al.,

2019; Tarvainen and Valpola, 2017) have shownthat semi-supervised learning could reach perfor-mance of purely supervised learning in certain sce-narios.

Currently, most NLU models rely on the utter-ance text and its annotation to learn domain, in-tent, and slots of the utterance. However, this doesnot scale to unlabeled data. In this work, we aim tofind and represent the relationship between labeledand unlabeled data in a non-Euclidean space, agraph, for SSL. We show that graph-based SSLis a high-performant method which improves anNLU model by leveraging unlabeled data.

In order to represent the labeled and unlabeleddata in a graph, we used a paraphrase detectionmodel. Nodes and edges in the graph representutterances and paraphrase relations respectively.Given the constructed graph, a transductive graphmodel was applied for node classification, whichin our case is intent classification (IC) for eachutterance. We used an NLU Slot Gated Model(SGM) (Goo et al., 2018) to obtain slot labels. Ex-periments on the SNIPS data set show that we canachieve 5% error reduction on the slot error rate.

The rest of the paper is structured as follows:Section 2 reviews work related to our approach.Section 3 describes the graph-based SSL methodswe propose in this paper followed by a descrip-tion of the paraphrase measures used to construct agraph. Section 5 describes the experimental setup.We share results and analysis in Section 6. Section7 shows conclusions.

2 Related Work

Over the past few years, many deep learning ap-proaches have been extended to NLU tasks—e.g.,intent classification and slot filling (Liu and Lane,2016). Manual annotation is costly. Thus, recentwork has turned to SSL in order to achieve similar

151

Figure 1: TGCN model architecture.

performance with much less manually annotateddata compared to purely supervised learning.

Aliannejadi et al. (2017) applies graph-basedsupervised learning of Conditional Random Fields(CRF) for Spoken Language Understanding(SLU) on unaligned data.

Lan et al. (2018) proposes an adversarial multi-task learning method by merging a bidirectionallanguage model (BLM) and a slot tagging model(STM). As a secondary objective, the BLM is usedto learn generalized and unsupervised knowledgewith abundant unlabeled data and improve the per-formance of STM on unseen data samples.

Cho et al. (2019a) generates paraphrases anduses them to enhance the training set in a semi-supervised learning setting for NLU. The aug-mented data is used jointly for domain classifica-tion, intent classification and slot filling.

The recent rise of neural networks has broughtsignificant advances in a large number of machinelearning tasks. While deep learning techniqueshave achieved huge success, their performance onnon-Euclidean data is not as good as on Euclideandata. The complexity of graph structures is a sig-nificant challenge to most of existing deep learn-ing algorithms and this complexity has drawn theattention of community to extend deep learning al-gorithms to graph data which in turn inspired var-ious methods for Graph Neural Networks (GNN)(Kipf and Welling, 2017; Velickovic et al., 2018;Yang et al., 2018; Zhang et al., 2018a; Tran, 2018;Xinyi and Chen, 2019).

GNNs can be applied in a supervised, semi-supervised, or purely unsupervised manner for dif-ferent tasks. For instance, graph convolutionalnetworks (GCN) (Kipf and Welling, 2017) couldbe used in a semi-supervised way for node-levelclassification (Kipf and Welling, 2017), in a su-pervised way for graph-level classification (Zhanget al., 2018b; Ying et al., 2018; Pan et al., 2016,

2017), and in an unsupervised way for graph em-bedding (Hamilton et al., 2017; Kipf and Welling,2016; Pan et al., 2018; Yang et al., 2018).

To the best of our knowledge, our work is thefirst approach to apply a text-based graph struc-ture for SSL for NLU. We evaluate our method ona publicly available data set in order to show itsapplicability.

3 Graph Methods

We propose two transductive graph models forsemi-supervised learning NLU tasks, Text GraphConvolutional Network (TGCN) and Text GraphBeam Search (TeGrabS), as well as their induc-tive versions, Pseudo labeling with TGCN (PL-TGCN) and Pseudo labeling with TeGrabS (PL-TeGrabS).

3.1 Transductive ModelsIn a semi-supervised learning setting,we have the following data sets, D =Xtrain,Xunlabeled,Xtest. In inductive sce-narios, labels of Xtest and Xunlabeled are unknownto model, and the model sees Xunlabeled duringtraining but Xtest is unseen. In transductivescenarios, the model sees Xtest and Xunlabeled attraining time.

In our task, transductive models learn para-phrase patterns among utterances from a givengraph, then they are applied as auxiliary modelsin the NLU model pseudo-labeling pipeline. Bydoing this, we make the determination of input ut-terances labels become a parametric function ofthe features and thus obtain inductive variants oftransductive models.

3.1.1 TeGrabSSimilar to beam search, TeGrabS is a heuristicmethod. For a starting node n, the algorithm keepstrack of k separate transitions; for each transition

152

Auxiliarymodel

Unlablleddata

LablleddataUnlablleddata

predictions

NLUmodel

trainingsetX2

X4

X1

X3

Figure 2: Inductive semi-supervised learning pipeline.

at each time step, random sample a node from thecurrent node’s neighbors as the next node. Thesampling process can be regarded as a Markovchain: the transition is represented by hop fromone node to another with the weight of the edgebetween two nodes is the transition probabilities.

p(n′|n) =Wn,n′ (1)

where n′ is a candidate for next node, Wn,n′ isweight of edge between n and n′. Probability of awhole transition is modelled as equation below:

p(n0, n1, · · · , nm) =

m−1∏

i=0

Wi,i+1 (2)

If a transition does not have available next nodecandidates (i.e., the current node does not have anyneighbors), this transition will be stopped and thebeam width is reduced by 1. The beam searchwill be terminated when all transitions either meetthe maximum number of hops limit or are stoppeddue to the current node not having any neighbors.Pseudocode of TeGrabS is shown as Algorithm ??in the Appendix.

3.1.2 TGCNGCN has been commonly applied on graph datain recent years (Kipf and Welling, 2017; Hamiltonet al., 2017; Zhang et al., 2018b; Ying et al., 2018;Pan et al., 2016, 2017; Kipf and Welling, 2016;Pan et al., 2018; Yang et al., 2018). However, tothe best of our knowledge, few researchers haveattempted to use it for text graphs.

Here we propose our GCN-based transduc-tive text-graph semi-supervised learning model,TGCN. The model architecture is shown in Fig-ure 1. The input is a text graph including nodesX1, X2, · · · , Xi and edges where each noderepresents a unique utterance. We use an embed-ding layer followed by LSTM cells as a featureextractor; Xij , Eij and Hij are token, word em-bedding, and LSTM hidden state for j-th word in

i-th utterance, respectively.

Ei =1

n

∑

w∈Xi

Ew (3)

where n is the length of utteranceXi. We computethe average sum of each token’s hidden state inutterance as the node feature.

Hi =1

n

n∑

j=0

Hij (4)

Inspired by the original GCN architecture de-sign (Kipf and Welling, 2017), the features are fedinto a two-layer graph convolution network. Thefirst graph convolution layer is followed by ReLUunits,

F (l+1) = σ(D−12 AD−

12F (l)W (l)) (5)

where A = A + IN is the adjacency matrix ofthe undirected graph with self-connections added,IN is the identity matrix, Dii =

∑j Aij and W (l)

is a layer-specific trainable weight matrix. σ(·)is an activation function, which is ReLU in ourcase. F (l) is the matrix of activations in l-th layer,F (0) = H .

And the output of second graph convolutionlayer is passed through a softmax layer to get dis-tribution over all classes per node.

Y = Softmax(F (2)) (6)

3.2 Inductive ModelsUsage of transductive models is limited to cer-tain test cases that have been seen by model dur-ing training. For an NLU system to support userqueries, it is crucial to be able to generalize to un-seen data. Thus, we use our proposed transduc-tive models as an auxiliary model in an inductivesemi-supervised learning pipeline, which is shownas Figure 2.

The input of pipeline is the combination of afew labeled utterances Xtrain and a large amount

153

of unlabeled data Xunlabeled as mentioned in pre-vious section. We apply paraphrase detectionmodel on both Xtrain and Xunlabeled to find pair-wise paraphrase relations between utterances. De-tails of the paraphrase detection model is given inSection 4.

For each utterance, we first find all its para-phrases as adjacency lists. We then build graphbased on the adjacency lists. Transductive mod-els (TeGrabS, TGCN, etc.) are applied as auxil-iary model as shown in the graph, to predict labelsfor unlabeled data Xunlabeled from both labeleddata and graph structure. Finally, predictions ofXunlabeled are fed into NLU model training set tore-train the NLU model and test on the unseen testset.

Based on the transductive TeGrabS andTGCN, here we propose their inductive variants,named Pseudo-Labeling with Text-Graph BeamSearch (PL-TeGrabS) and Pseudo-Labelingwith TextGCN (PL-TGCN) where TeGrabS andTGCN are used as auxiliary models in the afore-mentioned inductive semi-supervised learningpipeline, respectively.

4 Paraphrase Detection for GraphConstruction

In this work, we leveraged paraphrase learning tofind potential paraphrases in the data set and con-struct a graph. In real-world applications, thiscould be obtained from analyzing usage pattern,such as repetition or rephrase of user requests. Inthis work, we apply the paraphrase classificationmodel on the NLU utterances to retrieve the para-phrase pairs within the data. We then construct thegraph where paraphrases are connected.

In this section, we explain how the paraphrasemodel is trained as well as the construction of thegraph.

4.1 Paraphrase Embedding Learning

In order to obtain embedding for paraphrases, weused a word averaging model. In this approach,once a word embedding matrix is learned, we av-erage them over a sequence:

g(x) =1

n

n∑

i

W xiw (7)

where Ww is a word embedding matrix. Parame-ters are learned by minimizing an objective func-

tion with a margin, as described in Wieting et al.(2016a).

For embedding learning, we used the PPDB-Sdata set (Pavlick et al., 2015), which comprises 1.5million paraphrase pairs.

4.2 Paraphrase ClassificationUsing the embeddings, we trained a model thatoutputs a score as an indicative for the pair to beparaphrases of each other. In the model, we usedthe embedding approach described in Section 4.1and obtain an embedding e for each utterance. Fora pair of utterances u1 and u2, we combine theirembeddings in the following way:

h = [eu1 , eu2 , |eu1 − eu2 |, eu1 × eu2 ] (8)

where we concatenate each utterance’s embed-ding, element-wise difference and product be-tween the two.

We then used a fully-connected network to out-put the probability for two utterances being para-phrases. We used two 100-dimension hidden lay-ers with ReLU activation (Nair and Hinton, 2010)for the task. Further details of the embeddinglearning and classification model can be found inCho et al. (2019b).

To train the paraphrase classification model, weused a back-translated paraphrase corpus (Wiet-ing and Gimpel, 2017). For positive examples, werandomly selected 1.4M paraphrase pairs from thecorpus. For negative examples, we randomly pairup utterances within the corpus so that the utter-ances in the pair are not paraphrases of each other.In the end, we obtained 2.8M pairs of data, withbalanced positive and negative labels.

Using the method above leads to an F-score of98.39 on a test set with balanced 20K pairs. Notethat the performance of classification model is ex-pected to regress when applied on the target taskdata, due to the domain mismatch between thesizable, publicly available paraphrase corpus (Wi-eting and Gimpel, 2017) and the NLU task data(Coucke et al., 2018).

In this work, we consider paraphrase pairswhose score returned by the model is higher thana threshold θ = 0.99. Detailed statistics on theconstructed graph can be found in Section 6.

5 Experimental Setup

In this section, we discuss the experimental setupused in this work. First, we describe the data sets

154

Intent Utterances with slot labelssearchFlight find me a flight from [origin](Paris) to [destination](New York)searchFlight I need a flight leaving [date](this weekend) to [destination](Berlin)searchFlight show me flights to go to [destination](new york) leaving [date](this evening)

Table 1: Examples of the SNIPS Dataset.

Figure 3: Intent distribution in SNIPS Train, SSL, Dev and Test sets.

No. UtterancesTrain 1,310SSL candidate (unlablled) 11,774Dev 700Test 700

Table 2: SNIPS data statistics.

used for training and evaluating the suggested SSLapproach. We also describe the NLU model usedin this work, followed by description on the com-parative systems.

5.1 Data

To evaluate the proposed model, experiments wereperformed on SNIPS dataset (Coucke et al., 2018),which is collected from the SNIPS personal voiceassistant. This data comes with a pre-cut train,dev, and test sets, which contain 13,084, 700 and700 utterances respectively. There are 72 slot la-bels and 7 intent types for the training set. Ex-ample utterances from the SNIPS data is shown inTable 1.

Designing a real-world application often faceswith a challenge where there is an abundantamount of unlabeled data, but only a limitedamount of labeled data. In order to simulate thisscenario, we split the training data portion further,so that only 10% of the labeled training data isused for model training. The rest 90% of the la-beled training data would be considered as candi-dates for SSL. Thus, we did not rely on the an-notated labels in the SSL portion of the trainingdata, but consider this as an unlabeled data and try

to learn them from SSL process. Overall datasetstatistics is given in Table 2, intent distributions inTrain, SSL, Dev and Test set are shown in Figure3.

5.2 NLU System Description

The NLU model we used is a Slot-Gated attention-based bidirectional long short-term memoryModel (SGM) (Goo et al., 2018). In our setting,the full-attention setup was used, which achievesthe best performance in the paper. We used thedefault hyper-parameters from the code base. ForTGCN and TeGrabS, graph structure was used todetermine intent labels of unlabelled SSL candi-dates. Then, since Slot-Gated model leverages in-tent classification results to make slot predictions,graph SSL actually also helps in filling slots.

5.3 Comparative systems

In order to explore the effectiveness of the SSL ap-proaches we discuss in this work, we rely on twocomparative systems. First system (“Baseline”)is trained only on training data in Table 2, with-out applying SSL. In the second system PseudoLabelling Baseline (“PL-Baseline”), we appliedSSL, more specifically, pseudo labelling, but with-out leveraging the graph structure. For this sys-tem, we first trained the Baseline then infer labelsfor unlabeled data Xunlabeled with Baseline. Af-ter giving Xunlabeled pseudo labels, we put it intoNLU model training set, along with Xtrain, to re-train NLU model.

155

Whatistheweatherforecast

tomorrowinfrench

Whatstheweatherforecast

forseychelles

Whatsthetemperature

todayingriffin

Whatstheweatherlikeinserbia

Whatsthehumidityright

nowinaguila

Figure 4: Excerpt of constructed graph using SNIPS data. Graph is constructed to represent utterance similarityusing a paraphrase measure.

Model IC Acc. IC F1 Slot F1 SER

Baseline* 92.57 92.52 59.30 67.43PL-Baseline 92.86 92.03 59.61 68.71PL-TGCN 93.14 92.48 63.95 63.86PL-TeGrabS 73.43 72.95 58.02 73.71

Table 3: Transductive results on SNIPS data. *Baseline for Snips dataset is Slot Gated Modeling (Goo et al., 2018).

5.4 EvaluationWe report IC accuracy, F1 score, slot F1 and SlotError Rate (SER) as metrics to measure the perfor-mance of the models.

SER is a metric used to combine intent classifi-cation accuracy and the slot classification accuracyin a single score. It is defined as:

SER =S + I +D

S +D + C

where S is number of substitution errors for intentsor slots, I is the number of insertion errors for in-tents or slots, D is the number of deletion errorsfor intents or slots, and C is the number of correctslots and intents.

6 Results

In this section, we first discuss the constructedgraph using paraphrase measures. We then reportthe inductive SSL performance with graph meth-ods as auxiliary models.

6.1 Constructed GraphThe paraphrase graph is built based on the trainand SSL candidate set as we discussed in previ-ous sections. Figure 4 is a part of the constructedgraph, from which we can observe paraphrase pat-terns among connected components. We have alsoconfirmed that neighbors in this excerpt share the

same intent (GetWeather) as well as similarslots.

The whole graph contains 12,895 nodes (whichindicates that there are duplicates in SNIPSdataset), 52,876 edges when we set the paraphrasethreshold θ = 0.99.

6.2 Inductive ResultsWe evaluated baselines and our proposed modelson the SNIPS dataset. Intent classification and slotfilling experiment results are shown in Table 3.

We can observe that PL-TGCN outperformedother models on intent classification accuracy.However, this model is slightly defeated by base-line on intent classification F1-score. Our anal-ysis revealed that PL-TGCN tends to predictmore utterances into AddToPlaylist insteadof PlayMusic, compared to baseline. SinceAddToPlaylist is the biggest intent class intest set (124/700), more predictions in this classwill certainly raise accuracy, but will do little harmto F1-score, given that we are reporting F1-scoreaveraged from all classes. However, though Base-line did good job in not assigning more false pos-itives to AddToPlaylist, it is more likely toassign utterances in AddToPlaylist to otherclasses, which is actually not good. Therefore,we can conclude that PL-TGCN achieved best per-formance on intent classification in general. Itboosted the performance of slot filling through slot

156

gate in SGM, leading to a great reduction on SER.

7 Conclusion and Future Work

In this work, we proposed transductive graph-based semi-supervised learning models as well astheir inductive variants for NLU. In order to findsimilar utterances and construct a graph, we usea paraphrase detection model. To the best of ourknowledge, our work is the first approach to ap-ply text based graph structure for an SSL of NLU.We evaluate our method’s applicability on publiclyavailable data and model. Results show that ap-plying the inductive graph-based semi-supervisedlearning can reduce the error rate of the NLUmodel by 5%.

In the future, we will extend our work on otherpublic datasets, explore methods to directly pre-dict slots with transductive graph models. We willalso make further research on applying other SSLtechniques, e.g. iterative bootstrapping, instead ofpseudo-labelling.

ReferencesMohammad Aliannejadi, Masoud Kiaeeha, Shahram

Khadivi, and Saeed Shiry Ghidary. 2017. Graph-based semi-supervised conditional random fieldsfor spoken language understanding using unaligneddata. CoRR, abs/1701.08533.

Eunah Cho, He Xie, and William M Campbell. 2019a.Paraphrase generation for semi-supervised learningin nlu. In Proceedings of the Workshop on Meth-ods for Optimizing and Evaluating Neural LanguageGeneration, pages 45–54.

Eunah Cho, He Xie, John Lalor, Varun Kumar, andWilliam M Campbell. 2019b. Efficient semi-supervised learning for natural language understand-ing by optimizing diversity.

Alice Coucke, Alaa Saade, Adrien Ball, TheodoreBluche, Alexandre Caulier, David Leroy, ClementDoumouro, Thibault Gisselbrecht, Francesco Calt-agirone, Thibaut Lavril, Mael Primet, and JosephDureau. 2018. Snips voice platform: an embeddedspoken language understanding system for private-by-design voice interfaces. CoRR, abs/1805.10190.

Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-LiHuo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for joint slotfilling and intent prediction. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, NAACL-HLT, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 2(Short Papers), pages 753–757.

William L. Hamilton, Zhitao Ying, and Jure Leskovec.2017. Inductive representation learning on largegraphs. In Advances in Neural Information Process-ing Systems 30: Annual Conference on Neural In-formation Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA, pages 1024–1034.

Thomas N. Kipf and Max Welling. 2016. Variationalgraph auto-encoders. CoRR, abs/1611.07308.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In 5th International Conference onLearning Representations, ICLR 2017, Toulon,France, April 24-26, 2017, Conference Track Pro-ceedings.

Samuli Laine and Timo Aila. 2017. Temporal ensem-bling for semi-supervised learning. In 5th Inter-national Conference on Learning Representations,ICLR 2017, Toulon, France, April 24-26, 2017, Con-ference Track Proceedings.

Ouyu Lan, Su Zhu, and Kai Yu. 2018. Semi-supervisedtraining using adversarial multi-task learning forspoken language understanding. In 2018 IEEE In-ternational Conference on Acoustics, Speech andSignal Processing, ICASSP 2018, Calgary, AB,Canada, April 15-20, 2018, pages 6049–6053.

Bing Liu and Ian Lane. 2016. Attention-based recur-rent neural network models for joint intent detectionand slot filling. In Interspeech 2016, 17th AnnualConference of the International Speech Communica-tion Association, San Francisco, CA, USA, Septem-ber 8-12, 2016, pages 685–689.

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama,and Shin Ishii. 2019. Virtual adversarial training:A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal.Mach. Intell., 41(8):1979–1993.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proceedings of the 27th international conferenceon machine learning (ICML-10), pages 807–814.

Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, LinaYao, and Chengqi Zhang. 2018. Adversarially reg-ularized graph autoencoder for graph embedding.In Proceedings of the Twenty-Seventh InternationalJoint Conference on Artificial Intelligence, IJCAI2018, July 13-19, 2018, Stockholm, Sweden., pages2609–2615.

Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long,and Chengqi Zhang. 2017. Task sensitive featureexploration and learning for multitask graph classi-fication. IEEE Trans. Cybernetics, 47(3):744–758.

Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang,and Philip S. Yu. 2016. Joint structure feature explo-ration and regularization for multi-task graph classi-fication. IEEE Trans. Knowl. Data Eng., 28(3):715–728.

157

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers), volume 2, pages 425–430.

Antti Tarvainen and Harri Valpola. 2017. Mean teach-ers are better role models: Weight-averaged consis-tency targets improve semi-supervised deep learningresults. In Advances in Neural Information Process-ing Systems 30: Annual Conference on Neural In-formation Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA, pages 1195–1204.

Phi Vu Tran. 2018. Learning to make predictions ongraphs with autoencoders. In 5th IEEE Interna-tional Conference on Data Science and AdvancedAnalytics, DSAA 2018, Turin, Italy, October 1-3,2018, pages 237–245.

Gokhan Tur and Renato De Mori. 2011. Spoken lan-guage understanding: Systems for extracting seman-tic information from speech. John Wiley & Sons.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph attention networks. In 6th Inter-national Conference on Learning Representations,ICLR 2018, Vancouver, BC, Canada, April 30 - May3, 2018, Conference Track Proceedings.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016a. Towards universal paraphrastic sen-tence embeddings. International Conference onLearning Representations (ICLR).

John Wieting and Kevin Gimpel. 2017. Paranmt-50m:Pushing the limits of paraphrastic sentence embed-dings with millions of machine translations. arXivpreprint arXiv:1711.05732.

Zhang Xinyi and Lihui Chen. 2019. Capsule graphneural network. In 7th International Conferenceon Learning Representations, ICLR 2019, New Or-leans, LA, USA, May 6-9, 2019.

Zhilin Yang, Junbo Jake Zhao, Bhuwan Dhingra,Kaiming He, William W. Cohen, Ruslan Salakhut-dinov, and Yann LeCun. 2018. Glomo: Unsuper-vised learning of transferable relational graphs. InAdvances in Neural Information Processing Systems31: Annual Conference on Neural Information Pro-cessing Systems 2018, NeurIPS 2018, 3-8 December2018, Montreal, Canada., pages 8964–8975.

Zhitao Ying, Jiaxuan You, Christopher Morris, Xi-ang Ren, William L. Hamilton, and Jure Leskovec.2018. Hierarchical graph representation learningwith differentiable pooling. In Advances in NeuralInformation Processing Systems 31: Annual Con-ference on Neural Information Processing Systems

2018, NeurIPS 2018, 3-8 December 2018, Montreal,Canada., pages 4805–4815.

Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Ir-win King, and Dit-Yan Yeung. 2018a. Gaan: Gatedattention networks for learning on large and spa-tiotemporal graphs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intel-ligence, UAI 2018, Monterey, California, USA, Au-gust 6-10, 2018, pages 339–349.

Muhan Zhang, Zhicheng Cui, Marion Neumann, andYixin Chen. 2018b. An end-to-end deep learningarchitecture for graph classification. In Proceed-ings of the Thirty-Second AAAI Conference on Ar-tificial Intelligence, (AAAI-18), the 30th innovativeApplications of Artificial Intelligence (IAAI-18), andthe 8th AAAI Symposium on Educational Advancesin Artificial Intelligence (EAAI-18), New Orleans,Louisiana, USA, February 2-7, 2018, pages 4438–4445.

158


Graph Enhanced Cross-Domain Text-to-SQL Generation

Siyu HuoIBM Research, [email protected]

Tengfei MaIBM Research AI, [email protected]

Jie ChenMIT-IBM Watson AI Lab, USA

[email protected]

Maria ChangIBM Research AI, [email protected]

Lingfei WuIBM Research AI, USA

[email protected]

Michael WitbrockUniversity of Auckland

[email protected]

Abstract

Semantic parsing is a fundamental problemin natural language understanding, as it in-volves the mapping of natural language tostructured forms such as executable queriesor logic-like knowledge representations. Ex-isting deep learning approaches for seman-tic parsing have shown promise on a varietyof benchmark data sets, particularly on text-to-SQL parsing. However, most text-to-SQLparsers do not generalize to unseen data sets indifferent domains. In this paper, we proposea new cross-domain learning scheme to per-form text-to-SQL translation and demonstrateits use on Spider, a large-scale cross-domaintext-to-SQL data set. We improve upon astate-of-the-art Spider model, SyntaxSQLNet,by constructing a graph of column names forall databases and using graph neural networksto compute their embeddings. The resultingembeddings offer better cross-domain repre-sentations and SQL queries, as evidenced bysubstantial improvement on the Spider data setcompared to SyntaxSQLNet.

1 Introduction

Text-to-SQL translation is currently one of themost important tasks in semantic parsing. Itinvolves mapping natural language sentences toSQL queries that can be executed on associateddatabase tables. Most text-to-SQL data sets havetwo very important limitations: (1) they mostlycontain only very simple SQL queries, and (2)they use the same databases (and often identicalSQL queries) for training and testing. Finegan-Dollak et al. (2018) demonstrated that text-to-SQLparsers that perform well on existing benchmarksdo not generalize well to unseen SQL queries, asmeasured by experiments on modified data setswhere no two identical SQL queries appear in boththe training and testing sets. The Spider data set(Yu et al., 2018c) was developed to address these

limitations by including a large number of com-plex programs, databases with multiple tables, andby ensuring that the SQL queries and databasesthat appear in the training set do not also appear inthe testing set. In this way, Spider can be used as abetter measure of a model’s ability to produce un-seen complex programs and to generalize to newdomains.

The Spider data set has led to the developmentof complex, cross-domain semantic parsers, suchas SyntaxSQLNet (Yu et al., 2018b). In priorwork, cross-domain semantic parsing referred tothe ability to generalize among different logi-cal forms (Su and Yan, 2017). For the text-to-SQL task, however, cross-domain semantic pars-ing refers to the ability to generalize across dif-ferent queries and databases. Very recently, Syn-taxSQLNet (Yu et al., 2018b) was proposed tosolve this task by introducing a SQL-specific syn-tax tree-based decoder, with a SQL generationpath history and table-aware column attention en-coder. For better cross-domain performance, itemploys a data augmentation method to generatemore diverse training examples across databases.However, creating this augmented data is expen-sive and time-consuming. It requires grouping theSQL patterns and then manually editing the se-lected SQL patterns and their corresponding listsof questions.

In this paper, we propose a new approach to en-hance cross-domain learning in text-to-SQL sys-tems. In most existing systems, selecting entriesfrom available databases is one of the most im-portant components. When experimenting withbaselines, Yu et al. (2018c) observed that most ofthe component matching errors (that is, compo-nents of the predicted SQL query do not matchthose of the gold query) were caused by errorsin column prediction. A good representation ofthe columns in the databases should lead to accu-

159

rate matching between queries and columns. Ouridea is to connect all tables and databases acrossdomains via shared column names, so that thecolumn name representation can use informationacross domains for better generalization. We im-plement this idea by constructing a database graphand using a graph neural network to encode thecolumns. In this way, we obtain higher qualitycolumn embeddings, which lead to more accuratecolumn matching and better SQL generation.

2 Related Work

Many semantic parsers have been developed totranslate natural language text into structured,symbolic forms, including abstract meaning rep-resentation (Lyu and Titov, 2018), executable pro-grams (e.g. Python, Lisp, Bash) (Allamanis et al.,2015; Rabinovich et al., 2017; Yin and Neubig,2017; Liang et al., 2017; Lin et al., 2018), andSQL queries (Dong and Lapata, 2018; Yu et al.,2018b,a; Xu et al., 2017).

For text-to-SQL parsing, the work most closelyrelated to ours is SyntaxSQLNet (Yu et al., 2018b),which is the state-of-the-art approach for the Spi-der data set (Yu et al., 2018c). SyntaxSQLNetextends prior text-to-SQL models, such as SQL-Net (Xu et al., 2017) and TypeSQL (Yu et al.,2018a), by encoding both local information fromcolumn names and global information from ta-ble names. The primary difference between Syn-taxSQLNet and our work is that we use a novelcolumn embedding technique that additionally in-cludes a graph of the tables, connected throughshared column names.

3 Problem Formulation

We focus on the cross-domain SQL generationtask within the Spider data set (Yu et al., 2018c).Spider consists of 10,181 questions and 5,693unique complex SQL queries on 200 databaseswith multiple tables, covering 138 different do-mains.

Specifically, in the data set each natural lan-guage query Q is associated with a correspondingSQL query S and a database DB. The databasecontains multiple tables T , and each table containsmultiple columns C. The task of SQL genera-tion is to generate S given only Q and DB. Forcross-domain learning, the databases for trainingand testing are separate, so that one can test the

generalization ability of the models to unseen do-mains.

4 Method

Our model is based on the framework of Syn-taxSQLNet (Yu et al., 2018b) but extends it withour approach for generating column embeddings.We first briefly introduce the SyntaxSQLNet sys-tem, and then present the proposed graph-basedcolumn embedding method.

4.1 Background of SyntaxSQLNet

For the SQL generation task, one is given a naturallanguage sentence and a database and is asked togenerate the corresponding SQL query. Therefore,one may use an encoder to encode the sentenceand table columns in the database and use a de-coder to generate the SQL query. In SyntaxSQL-Net, the sentence is encoded by a bi-directionalLSTM (BiLSTM), while each column in the tableis encoded by using a simple scheme called table-aware column representation. The scheme takesthe list of words in the table name and the columnname, as well as the type information of the col-umn, as input to a BiLSTM and outputs the finalstate as the column representation. This approachincorporates both the global table information andthe local column information.

To better leverage SQL structures, SyntaxSQL-Net uses an SQL specific tree-based decoderwith SQL path histories. It decomposes the de-coder into a set of recursive modules: IUEN,KW, COL, OP, AGG, Root/Terminal, AND/OR,DESC/ASC/LIMIT, and HAVING. Each moduledeals with different SQL components. For ex-ample, the KW module predicts keywords fromWHERE, GROUP BY and ORDER BY, and theCOL module predicts columns. For details of theother modules, see (Yu et al., 2018b). In this work,we focus on improving the COL module.

The recursive modules in SyntaxSQLNet arecombined to form the whole generation process.Specifically, when the decoder generates the SQLquery, it first determines which module to invoke,and then uses that module to predict the next SQLtoken. For each module, the input encoding goesthrough an attention computation, before beingused to make the prediction of the next token. Forexample, in the COL module, the prediction iscomputed by the using following equations:

160

Table 1: Example Databases

Database Table Name Column Names

department storecustomers customer id customer name customer address

customer orders order id customer id order datecoffee shop shop shop id address num of staff

P numCOL = P

(Wnum

1 HnumQ/COL

> +Wnum2 Hnum

HS/COL>)

P valCOL = P

(Wval

1 HvalQ/COL

>+Wval

2 HvalHS/COL

>

+ Wval3 HCOL

>),

where “num” means the number of columns, “val”means the index(es) of the column(s), “Q” meansquestion, “HS” means path history, “COL” meanscolumn, P(U) = softmax(V tanh(U)) is aprobability distribution given score U and param-eter V, the W’s are learnable parameters, andthe H1/2’s are conditional embeddings defined asH1/2 = softmax(H1WH>2 )H1.

4.2 Database Graph Construction4.2.1 MotivationFor a good accuracy of the final SQL genera-tion, every module needs to perform well. How-ever, the COL module is a significant bottleneckof the system. When reproducing the results fromSyntaxSQLNet, we find that its accuracy is onlyslightly above 50%, while other modules gener-ally achieve 90%. SyntaxSQLNet gains gener-alizability across domains through encoding bothglobal and local information by simply using allthe words from the table name, column name, andcolumn data type. There are two problems withthis approach. First, no explicit ordering existsfor these words; hence, the use of BiLSTM to en-code them seems less justified. Second, although itadds the global information from table names forbetter column name embedding, the table namesare completely independent of other tables anddatabases, and therefore it incorporates little infor-mation from other domains.

Our idea is to use a graph to connect the columnnames across all tables and databases and computerepresentations of them by using a graph neuralnetwork. In this way, the information of differentdomains is passed to each other, so that one canlearn a better column embedding that generalizesacross domains.

Our approach is relevant to those that use neuralnetworks for domain adaptation; see, e.g., Parejaet al. (2019) who adapt the graph neural networkmodel for data at different time steps. However,our approach is conceptually different from thesemethods, because we do not map the tasks to anew domain but rather, learn a better representa-tion of the table elements through leveraging alldomain information. Another related work thatalso uses GNN for sementic parsing is the very re-cently published Bogin et al. (2019), but the graphtherein is an abstraction of the database schema,as opposed to being a tool for producing columnrepresentations as in our work.

4.2.2 Graph ConstructionWe construct a graph to connect all column namesacross tables and databases. To encode more infor-mation, we also include the table names and col-umn data types as additional nodes in the graph.In order to obtain more connections and reduceunseen phrases, we split the names into words,so that different tables and databases share morenodes. Specifically, our database graph is con-structed as follows:

1. Every column name is separated into a set ofwords and the words are used as nodes in thegraph.

2. Within each table, we connect all word nodesto each other.

3. For each table, we include the words of tablename as additional nodes, and connect themwith all column name words in that table.

4. Different tables and databases are thus con-nected through shared words.

5. We also add column data types as additionalnodes and connect them with correspondingcolumn name words.

In Table 1 and Figure 1 respectively, we showan example of the databases and the graph con-

161

Figure 1: An example of the constructed graph fordatabases in Table 1. Orange nodes are words in thetable names and blue nodes are words in the columnnames. For simplicity, column data types are not in-cluded in this example.

structed for them. Clearly, in the graph, two differ-ent databases are connected through shared words,such as id and address.

4.2.3 Graph EncodingWe encode the nodes in the graph by using a 2-layer graph convolutional network (GCN) (Kipfand Welling, 2016). After we obtain the nodeembeddings for all nodes, for every table col-umn we combine the embeddings of the wordsof the corresponding table name, column name,and column data type to get the overall repre-sentation of the column. We call this approachsubgraph pooling. In details, assume that wehave a table name TN , column name wordsC1, C2, C3, and column data type TP . We firstuse the pre-trained GloVe vectors (Penningtonet al., 2014) to initialize the embedding of eachnode: ETN , EC1 , EC2 , EC3 , ETP . With the graphconvolutional network, we obtain correspondingembedding vectors GTN , GC1 , GC2 , GC3 , GTP .For subgraph pooling, we first concatenate theoriginal GloVe vectors and the newly learnedembedding vectors, then do averaging over allnodes for this column, and finally map it toanother lower-dimensional space: H(∗) =Wh[Meanω(E(ω)||G(ω))], where || denotes con-catenation, ω ∈ TN,C1, C2, C3, TP and Wh isa learnable parameter. Afterwards, we replace theBiLSTM encoding in SyntaxSQLNet byH(∗) andcontinue the SyntaxSQLNet decoding process.

5 Experiments

5.1 Experimental Setting

We use the Spider data set and follow the set-ting of Yu et al. (2018c). The data set is split

into 7,000/1,034/2,147 train/development/test ex-amples, and the databases are correspondinglysplit into 146/40/20 for train/development/test.The models are evaluated by exact matching accu-racy with the provided test script Yu et al. (2018c).Since we made no changes to the modules otherthan COL, we do not compare the componentmatching accuracy as in Yu et al. (2018b). Instead,we include the accuracy of the COL module as anadditional metric to better interpret the effects ofthe proposed graph encoding method.

For GCN (Kipf and Welling, 2016), we usedthe inductive version. That is, in the trainingphase we only construct the graph from the train-ing databases, but for testing we connect thetest databases with the training ones and use thelearned GCN parameters for embedding computa-tion.

5.2 Results

Method Matching acc. COL acc.Seq2seq+attention 1.8% NA

SQLNet 10.9% NASyntaxSQLNet 18.9% 53.5%

SyntaxSQLNet +Data Augmentation 24.8% 61.9%

GNN-SQL 22.0% 57.0%

Table 2: Comparison of different methods with respectto final exact matching accuracy and COL module ac-curacy on development set. The results of SyntaxSQL-Net were run by ourselves using the provided system.

From the table we see that the COL moduleaccuracy is 3.5% higher than that in SyntaxSQL-Net, without performing any modification to othermodules. The final accuracy is also 3.1% higherthan SyntaxSQLNet without data augmentation.Although our final matching accuracy is lowerthan the data augmented SyntaxSQLNet (22.0%vs 24.8%), we have demonstrated the usefulnessof the improved COL module and the potential forreducing the need and effort of labeling additionalaugmented data.

6 Conclusion

In this paper we propose a new graph-basedmethod for cross-domain text-to-SQL genera-tion. As opposed to data augmentation in Syn-taxSQLNet, we construct a graph to connect all

162

columns across tables and databases, yielding bet-ter generalizablity of the column representationand SQL generation. Experimental results on Spi-der demonstrates the superiority of our methodover SyntaxSQLNet without data augmentation.

ReferencesMiltos Allamanis, Daniel Tarlow, Andrew Gordon, and

Yi Wei. 2015. Bimodal modelling of source codeand natural language. In International Conferenceon Machine Learning, pages 2123–2132.

Ben Bogin, Matt Gardner, and Jonathan Berant. 2019.Representing schema structure with graph neuralnetworks for text-to-sql parsing. In ACL.

Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-coding for neural semantic parsing. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 731–742.

Catherine Finegan-Dollak, Jonathan K Kummerfeld,Li Zhang, Karthik Ramanathan, Sesh Sadasivam,Rui Zhang, and Dragomir Radev. 2018. Improv-ing text-to-sql evaluation methodology. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 351–360.

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth DForbus, and Ni Lao. 2017. Neural symbolic ma-chines: Learning semantic parsers on freebase withweak supervision. In 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,pages 23–33. Association for Computational Lin-guistics (ACL).

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer,and Michael D Ernst. 2018. Nl2bash: A corpusand semantic parser for natural language interfaceto the linux operating system. In Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC-2018).

Chunchuan Lyu and Ivan Titov. 2018. Amr parsing asgraph prediction with latent alignment. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 397–407.

Aldo Pareja, Giacomo Domeniconi, Jie Chen, TengfeiMa, Toyotaro Suzumura, Hiroki Kanezashi, TimKaler, and Charles E. Leiserson. 2019. EvolveGCN:Evolving graph convolutional networks for dynamicgraphs. Preprint arXiv:1902.10191.


Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract syntax networks for code genera-tion and semantic parsing. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1139–1149.

Yu Su and Xifeng Yan. 2017. Cross-domain seman-tic parsing via paraphrasing. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 1235–1246.

Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:Generating structured queries from natural languagewithout reinforcement learning. arXiv preprintarXiv:1711.04436.

Pengcheng Yin and Graham Neubig. 2017. A syntacticneural model for general-purpose code generation.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 440–450.

Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir Radev. 2018a. Typesql: Knowledge-based type-aware neural text-to-sql generation. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers), pages 588–594.

Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,Dongxu Wang, Zifan Li, and Dragomir Radev.2018b. Syntaxsqlnet: Syntax tree networks for com-plex and cross-domain text-to-sql task. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 1653–1663.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-ing Yao, Shanelle Roman, et al. 2018c. Spider: Alarge-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages3911–3921.

163


Reasoning Over Paths via Knowledge Base Completion

Saatviga Sudhahar, Ian Roberts and Andrea PierleoniHealx, Cambridge, UK

saatviga.sudhahar, ian.roberts, [email protected]

AbstractReasoning over paths in large scale knowledgegraphs is an important problem for many ap-plications. In this paper we discuss a sim-ple approach to automatically build and rankpaths between a source and target entity pairwith learned embeddings using a knowledgebase completion model (KBC). We assembleda knowledge graph by mining the availablebiomedical scientific literature and extracteda set of high frequency paths to use for val-idation. We demonstrate that our method isable to effectively rank a list of known pathsbetween a pair of entities and also come upwith plausible paths that are not present inthe knowledge graph. For a given entity pairwe are able to reconstruct the highest rankingpath 60% of the time within the top 10 rankedpaths and achieve 49% mean average preci-sion. Our approach is compositional since anyKBC model that can produce vector represen-tations of entities can be used.

1 Introduction

A large amount of work has been dedicated in thepast on Knowledge base completion (KBC) whichis an automated process that adds missing factsthat can be predicted from existing ones alreadyin the Knowledge base. This is crucial for the useof large Knowledge bases in many downstreamapplications. However explaining the predictionsgiven by a KBC algorithm is quite important forseveral real world use cases. For example in rec-ommender systems, a knowledge graph of users,items and their interactions are used to recommendan item to a user based on the user’s interactionson several items. The ability to explain and rea-son on the decision is of critical importance to addknowledge to recommender systems. Similarly ina knowledge graph consisting human biologicaldata such as genes, drugs, symptoms and diseases,it is crucial to know which gene and symptoms

were involved in predicting a drug for a disease.This requires automatic extraction and ranking ofmulti-hop paths between a given source and a tar-get entity from a knowledge graph.

Previous work has focused on using path in-formation in knowledge graphs for KBC knownas path-based inference (Lao et al., 2011; Gard-ner et al., 2014; Neelakantan et al., 2015; Daset al., 2017b), in which a model is trained to pre-dict missing links between a given pair of enti-ties taking as input several paths that existed be-tween them. Paths are ranked according to a scor-ing method and used as features to train the model.Embedding-based inference models (Bordes et al.,2013; Lin et al., 2015; Nickel et al., 2011; Socheret al., 2013; Trouillon et al., 2016) for KBC learnentity and relation embeddings by solving an op-timization problem that maximises the plausibilityof known facts in the knowledge graph. A thirdset of models bridge path-based and embedding-based inference with deep-reinforcement learningfor reasoning in knowledge graphs (Xiong et al.,2017; Das et al., 2017a; Lin et al., 2018; Songet al., 2019). All these models address variousobjectives. They try to predict missing links in agraph by ranking the target entities given a queryentity, infer new relations between entity pairsgiven a set of multi-hop input paths between thosepairs or infer target entities while also reason-ing on paths identified. Our work is inline withthe third objective in which we propose a simpleapproach that combines embedding-based mod-els with a path building and ranking strategy tocome up with the most probable explanations fora given prediction from a source to target. Theproblem can be formulated as follows: Given asource entity e1 and target entity e2 the goal isto first come up with a set of meaningful pathsP (e1, e2) = p1, p2....pn connecting e1 and e2.P (e1, e2) can contain known and predicted edges

164

as given by the embedding-based model. Rankingof the paths is given by,

Re1e2 = fΘ(e1, e2|P (e1, e2)) (1)

where fΘ denotes the underlying model with pa-rameters Θ, and Re1e2 presents the ranking of thepaths.

Table 1 shows two examples of ranked 1-hoppaths for query types, ‘Gene-Phenotype-Disease’and ‘Gene-Drug-Disease’. In the first example,given a predicted fact that Gene ‘Carbonic An-hydrase 1 (CA1)’ is associated with Disease ‘Is-chemia’, the most probable explanations can begenerated by building such paths between the en-tities. Gene ‘CA1’ is linked to Phenotypes ‘Hy-pothermia’, ‘Neuronal loss’ and ‘Hyperglycemia’and these Phenotypes are linked with Disease ‘Is-chemia’ and therefore reasoning on the fact thatGene ‘CA1’ is associated with Disease ‘Ischemia’.Similarly in second example Gene ‘Bruton ty-rosine kinase (BTK)’ is associated with Disease‘Chronic lymphocytic leukemia’ since the Drugs‘Ibrutinib’ and ‘Acalabrutinib’ are found to belinked with Gene ‘BTK’ and the Disease. Ourmethod is able to extract such paths and rank themproviding the ability to reason on predictions.

From a knowledge graph data set, we initiallytrain an embedding-based KBC model that canpredict target entities given a source, relation pair.The KBC model is then used to build a set of 1-hop paths between a given source and target entityand paths are ranked according to a scoring func-tion. Any embedding based KBC model capableof producing separate vector representations of en-tities can be used for training in our method. Ourapproach is quite suitable for downstream applica-

Gene Phenotype DiseaseCA1 Hypothermia IschemiaCA1 Neuronal loss IschemiaCA1 Hyperglycemia IschemiaGene Drug DiseaseBTK Ibrutinib Chronic

lymphocytic leukemiaBTK Acalabrutinib Chronic

lymphocytic leukemia

Table 1: Example paths for query types ‘Gene-Phenotype-Disease’, ‘Gene-Drug-Disease’ showingthe ability to reason over 1-hop.

tions since it only requires training the model oncewith the whole data set and does not involve anyadditional training overhead. We discuss two mainexperiments in the paper, Experiment 1 showingthe model’s ability to reconstruct the highest rank-ing path in the set of retrieved ranked paths in spiteof not seeing it during training and Experiment2 that proves the ranking capability alone of thetrained model with known paths. We show thatour method is able to reconstruct the highest rank-ing path 60% of the time in the top 10 ranked pathsfor a given set of entity pairs also achieving 49%mean average precision. There is almost a 3-foldincrease in the ranking correlation between pre-dicted ranks and ground truth ranks of a longer listof known paths when compared to random. To ourknowledge this is the first paper that is focused ontrying to use path ranking to identify relevant enti-ties bridging a pair of known entities and thereforenot directly comparable with other approaches.

The paper is organised as follows: Section 2 dis-cusses related work in this area; Section 3 showsdetails of how the data sets of triples and paths be-tween entity pairs were generated and elaborateson model path construction and path scoring forranking the paths; Section 4 discusses the statis-tics of the data sets; Section 5 shows and discussesexperimental results on the model’s ability to re-cover paths and ranking in experiment 1 and rank-ing alone of known paths in experiment 2. Section6 concludes the work and discusses possible futuredirections.

2 Related Work

One of the first path building approach in literatureis the Path-Ranking Algorithm (PRA) (Lao andCohen, 2010; Gardner et al., 2013, 2014) whichuses random walk with restarts to rank entities ina graph relative to a source entity. Random walksconstrained to follow a set of edge types are per-formed in the graph to produce relational featureweights. The weights are combined to predict theprobability that a relation between a source, targetentity pair using a per-target relation binary classi-fier. The bottleneck of this method is the explosionof the feature space when there are millions of dis-tinct paths obtained from random exploration andthe fact that it relies on a binary classifier per targetrelation which is not scalable.

Guu et al. (2015) proposed a method to buildrelation paths between entity pairs using composi-

165

tional techniques to address knowledge base com-pletion and path query answering problems. Theylearn representations of nodes and relations byadapting the scoring functions from pre-existingKBC models such as TransE (Bordes et al., 2013)and RESCAL (Nickel et al., 2011). They cannotuse models in which the scoring function cannotbe decomposed in a way to produce separate vec-tors for a source/relation without the target. Incontrast our method can use any embedding basedKBC model for training. Another limitation intheir work is that they model only a single path be-tween an entity pair but we model multiple paths.Toutanova et al. (2015) propose a similar frame-work and also model intermediate entities in thepath, training a convolutional neural network torank a set of target entities given a source and re-lation pair. Path-RNN (Neelakantan et al., 2015)is also a compositional model that takes paths be-tween entity pairs as input and infers new rela-tions between them. However it uses only the re-lation information in the path and ignores mod-elling intermediate entities, and it requires traininga model for each relation type making it not usablein downstream applications. Das et al. (2017b)train a high capcity RNN model to infer relationsbetween entity pairs by taking several multi-hoppaths between them as input and score pool themusing techniques such as averaging across all pathsor the top-k paths.

Recent work in explainable reasoning is used inrecommender systems. Wang et al. (2019) pro-pose a knowledge-aware path recurrent network togenerate representation for each path by compos-ing semantics of entities and relations and reasonon paths to infer user preferences for items. Pre-defined meta-paths and LSTMs have been usedto model the paths between users and items viabreadth-first-search (BFS). Practically this is inef-ficient and can miss many meaningful paths. Toaddress these challenges, Song et al. (2019) pro-posed a deep reinforcement learning based methodthat can automatically generate meaningful pathsbetween a user and item via policy gradient meth-ods. What we present in this paper is a very simpleapproach to come up with a set of most relevantpaths consisting of known and predicted links be-tween a given source and target entity in a knowl-edge graph using embedding based models. Itsalso possible to extend this approach to use con-textualised knowledge graph embedding models

(Wang et al., 2018) which learn representationsbased on an entire path in the graph rather thana triple. We believe using this in conjunction withour method in turn could produce more meaning-ful paths in the path building process. This is leftfor future work.

3 Methodology

3.1 Generating the Data

In this section we discuss how the data set onwhich we perform the experiments was gener-ated. This includes extraction of triples from thePubmed abstracts data set using a NLP pipelineand extraction of 1-hop paths from the triple dataset. The 1-hop path data set is used to construct aranked list of paths for 100 entity pairs to generateground truth data for the path ranking experiments.

3.1.1 Extracting triplesWe used the output of the Open Targets LibraryNLP pipeline1 to access all the subject-verb-object(SVO) triples from the biomedical abstracts re-leased by PubMed and updated to March 2019, fora total of more than 29M documents and 540MSVO triples. From this set we selected only theSVO triples between entities linked to a specificbiomedical ID, discarding the ones with ‘concept’entity type and stop words. From the filtered listwe built an undirected weighted knowledge graphlinking each pair of entities identified by a triplewith a weight representing the number of docu-ments in which the triple was found.

3.1.2 Extracting 1-hop pathsWhile it is possible to access many knowledgebases like the one described above in graph format,to our knowledge there is currently none focusedon building relations spanning more than two enti-ties. We leveraged the ability to track SVO triplesoccurring in the same abstract and built a data setof paths spanning three entities by making the as-sumption that if A is related to B, and B is relatedto C in the same abstract, then its safe to assumethatA is related to C viaB. By counting the num-ber of occurrences of such connected triples acrossdifferent documents we are also able to weight thepaths, and thus can generate a ranked lists of pathsstarting from source entity A to target entity C byscoring higher the most frequent paths.

1https://github.com/opentargets/library-beam

166

By combining several 1-hop paths, it is possibleto also build multi-hop paths described within thesame document however, in this data set, even the2-hop paths are too rare to be used for a test ofstatistical significance.

3.2 Model Path ConstructionThe primary objective of our method is to au-tomatically build a set of paths between a givensource and target entity pair in a knowledge graphand rank them using a scoring method to come upwith the most relevant paths connecting the pair.We first train a 200 dimensional node, relation em-bedding model with the data set using ProjE. ProjEis a neural network based KBC model that fillsthe missing information in a knowledge graph bylearning joint embeddings of the graph’s entitiesand links. Given an input triple 〈h, r, ?〉 the modelcan find the optimal ordered list of tail entities andvice versa.

Let E be the set of all entities and R the set ofall relation types in the data set. For a given entitypair (e1, e2), we choose the set of suitable rela-tion types rt ⊂ R to be used and query the trainedmodel with (e1, rt) and (e2, rt) pairs. The setof suitable relations types are the ones that agreewith the query type. For entity pair with types,‘Drug:Disease’, the corresponding query typesare ‘Drug-Gene-Disease ’ or ‘Drug-Phenotype-Disease’ in our data set. Therefore the method willchoose ‘Drug Gene’, ‘Drug Phenotype’, ‘Dis-ease Gene’, ‘Disease Phenotype’ relations to per-form the predictions. This results in a set of topn predicted tail and head entities et, eh ⊂ E foreach relation type. If the same entity was predictedagain for e1 or e2 but with another relation typefrom previous, we retain the relation type produc-ing the higher prediction rank in the graph. Wecreate an undirected graph G out of e1, e2, andthe predicted sets of entities et, eh with their rela-tions. From this graph we extract all 1-hop pathsconnecting e1 and e2.

The same approach can be used iteratively tobuild multi-hop paths between a pair of entities.Given the limits of the available dataset, in thispaper we limit the discussion to the 1-hop case.

3.3 Path ScoringOnce the paths are extracted a scoring function isused to score them for ranking. A path p consistsof a source, target and an intermediate predictedentity connected by relevant relations. For 1-hop

paths there are two 〈h, t, r〉 pairs along the path.The score Sp for each path p in the extracted set ofpaths P = p1, p2...pn is given by,

1. Embed+predrank: sum of the predictionranks (R) of the predicted entity in p. Paths withlowest score are the high ranking ones.

Sp =∑

h,t,r∈pR (2)

2. Embed+cosine: sum of the cosine similar-ities between entity embeddings in p. The mostrelevant paths receive higher scores.

Sp =∑

h,t,r∈pcos(eh, et) (3)

Embed+predrank score only considers the rankof the predicted entities as given by the KBCmodel while Embed+cosine score uses the actuallearned embedding to score the paths which ismore informative than the previous. In the nextsection we show in detail the data set used and theexperiments performed proving the significance ofthe path ranking method in two different settings.

4 Data

The data set used for the experiments are con-structed by extracting relations using a NLPpipeline from the Pubmed repository as shown inSection 3.1.1. It consists of 3,025,541 head, rel,tail triples. Details of this base data set is shownin Table 2. From this data set we created a set of100 entity pairs with a ranked list of several one-hop paths P connecting entity pairs (e1, e2) in theknowledge graph.

We use two different path data sets for ex-periments 1 and 2. The details of this dataset is shown in Table 3. Paths extracted arefrom query types: ‘Gene-Drug-Disease’, ‘Gene-Phenotype-Disease’, ‘Drug-Phenotype-Disease’,‘Anatomy-Phenotype-Disease’, ‘Disease-Drug-Anatomy’ and ‘Anatomy-Gene-Organism’.Selection of query types are purely based onsimple and meaningful queries. There are at least2 paths for each entity pair for experiment 1 and10 for experiment 2 and they were observed twiceor more in the data. Ranking of paths for eachentity pair are based on how frequent they wereobserved in the underlying data set. We assumemost frequently seen paths are reliable and shouldrank higher than others. We use this ranking asground truth. An archive containing the base data

167

#triples 3,025,541entities 38,772entity types 10relation types 541-hop paths 5.5M

Table 2: Triple Data set statistics.

Exp 1 Exp 2query types 4 6avg no of paths 6 33min no of paths 2 10total paths 573 3356

Table 3: Path data set statistics used in experiment 1and 2.

sets with train/test split and ground truth pathranking data sets for experiment 1 and 2 presentedare available here.2

5 Experimental Results

5.1 Path recovery and ranking

In this experiment we show that the trained KBCmodel is able to recover the highest ranking pathin the ground truth among the ranked set of pathsretrieved by our ranking method for entity pairsin spite of not seeing the links belonging to thepath in the training set. We also report on themean average precision (MAP) obtained for topn number of paths retrieved by the model. Fromthe ground truth data set we choose one triple inthe top ranking path p1 for each entity pair andremove that triple from the training set. For exam-ple, if the highest ranking path is, ‘Drug1-Gene1-Disease1’ for entity pair ‘Drug1:Disease1’, weremove one triple from the path which could be‘Gene1,Disease1,Gene1 Disease1’ from the train-ing set. We train an embedding model with ProjEin the resulting data set using the following param-eters: d 200, epochs 200, lr 0.0001, batchsize 128,dropout 0.5, lossweight 0.0001. We use a negativesampling rate of 0.25 and minimise the loss usingadam optimiser.

The trained model is then used to build a setof ranked paths between the source and target en-tity pairs as described in section 3.2. What we test

2https://storage.googleapis.com/pubmed-path-kg/pubmed exp data.zip

here is whether the model is able to recover thehighest ranking path p1 from the ground truth foreach entity pair in the ranked list of paths. Table4 shows Hits@n (1,10,25,100) which is the num-ber of times p1 was ranked in the top n rankedpaths out of all entity pairs for scoring func-tions Embed+predrank and Embed+cosine and re-ports MAP for the n ranked paths retrieved. Wecompare the result with a baseline in which webuild paths by choosing the most similar entity tothe source and target entities by cosine similarityand rank them based on the total score obtained.Though it still uses the learned embedding to com-pute similarities it does not consider any predic-tions made by the model. Column 1 refers to thenumber of top n predictions (100,300 and 500)that we request from the KBC model.

According to results in Table 4, Embed+cosinescoring function yields better Hits@n andMAP@n scores than Embed+pred rank in gen-eral. When more predictions are requested fromthe KBC model, we observe an increase inthe Embed+cosine scores for both Hits@n andMAP@n. Increasing the number of predictionsin the KBC model, allows for more entities tobe predicted which in turn increases the depth ofpath search, therefore leading to a better chancethat p1 could be retrieved for Hits@n and morecorrect paths for MAP@n. When using the top500 predictions, the model is able to recover p1 ontop 26% of the time, within the top 10, 60% of thetime and within the top 25, 67% of the time. Thescores are even higher when more ranked pathsare extracted according to Hits@100. Within 100and 300 predictions the model is able to achieve34% and 49% MAP@10 and 31% and 45%MAP@25 using Embed+cosine scoring. Whencompared to the baseline we see the scores forboth Hits@n and MAP@n obtained by our pathbuilding and ranking method is quite significant.Note that the baseline does not depend on thenumber of predictions requested by the model andhence it remains same for all three settings.

5.2 Path ranking

In this experiment we want to test how good themodel can rank known paths, that consists ofknown links between entities in a graph. In orderto do this we retrieve the rank for known paths inthe ground truth as given by the model. We traina KBC model with ProjE again, but this time the

168

# Pred Scoring Hits MAP Hits MAP Hits MAP Hits MAP@100 @100 @25 @25 @10 @10 @1 @1

100 Embed+predrank 0.36 0.13 0.22 0.14 0.1 0.15 0 0.07Embed+cosine 0.36 0.27 0.27 0.31 0.23 0.34 0.07 0.26Baseline 0.01 0 0 0 0 0 0 0



Table 4: Hits@n and MAP@n scores for Embed+predrank, Embed+cosine scoring functions comparing to baselinefor top 100, 300, 500 predictions given by KBC model.

model is trained on the whole data set. For eachentity pair in the same test set, the model producesa set of ranked paths from which we compute therelative ranking of the known paths in the groundtruth. For example if entity pair (e1, e2) has 15ranked paths, and our model produced 25 rankedpaths in total, we compute the relative ranking ofonly the 15 paths for this test. Our test statisticis spearman ranking correlation between the ranksretrieved by the model and the ground truth. Thesame test is repeated after randomly permutingthe paths in the ground truth. We also computethe ranking correlation between the randomly per-muted set and the actual ground truth. The groundtruth data set used for this experiment has at least10 ranked paths for each entity pair and on aver-age has 33 paths per pair as shown in Table 3. Weincreased the number of predictions given by theKBC model to 1500 in this experiment because itincreases the depth of path search and retrieves allknown paths in the ground truth. This was purelydone for evaluation purposes since its importantto compare the rankings of all paths in the groundtruth.

Table 5 shows the Spearman ranking correla-tion (rs) for Rp-Rt ranks of paths in ground truth(Rt) and their relative ranking in predicted paths(Rp), Rp-Rr ranks of randomly permuted paths(Rr) and their relative ranking in predicted paths(Rp), Rt-Rr ranks of paths in ground truth (Rt)and their relative ranking in randomly permutedpaths (Rr). We show rs observed for all entitypairs (100) in row 1 and for the ones with at least20 or more paths (67) in row 2. rs decreases withranking complexity as we can see but in both casesrs with the ground truth is better compared to rs

# Paths Coverage Rank pair rs≥10 100 Rp-Rt 0.45

Rp-Rr 0.24Rt-Rr 0.26

≥ 20 67 Rp-Rt 0.37Rp-Rr 0.13Rt-Rr 0.12

Table 5: Spearman ranking correlation (rs) for Rp-Rt

ranks of paths in ground truth and their relative rankingin predicted paths, Rp-Rr ranks of randomly permutedpaths and their relative ranking in predicted paths, Rt-Rr ranks of paths in ground truth and their relativeranking in randomly permuted paths.

with the permuted set showing almost 3-fold (2.8)increase in the most difficult case. This proves thesignificance of the model’s ranking capability.

The ground truth data set used for this experi-ment is quite scarce and not very well distributedin terms of ranking. Since ranks are generatedbased on frequency of the paths observed, pathswith same frequency get similar rankings. This isa limitation in the data set. We expect the methodto perform better in cases where there exists largernumber of paths with an evenly distributed rank-ing. This is left for future work.

Figure 1a plots the ranks of known paths inground truth vs their relative rank in predictedpaths and Figure 1b plots the ranks of randomlypermuted paths in the ground truth vs their rela-tive rank in predicted paths for all entity pairs withat least 20 or more paths reported in row2 of Table5. Points in Figure 1a are mostly correlated andshows a huge tail towards the bottom left indicat-ing the highest ranking paths are well correlatedbetween known and predicted ranks. Figure 1b is

169

(a)

(b)

Figure 1: (a) Ranks of known paths in ground truthvs their relative rank in predicted paths, (b) Ranks ofrandomly permuted paths in the ground truth vs theirrelative rank in predicted paths for all entity pairs withat least 20 or more paths.

quite scattered compared to 1a as indicated by re-sults in Table 5.

6 Conclusion & Future Work

In this paper we proposed a simple method to au-tomatically build and rank paths between a sourceand target entity pair using embedding basedknowledge base completion (KBC) models. Toour knowledge this is the first paper that is fo-cused on trying to use path ranking to identifyrelevant entities bridging a pair of known entitiesand hence not directly comparable with other ap-proaches. To this purpose we built a data set thatallow us to test our hypothesis. We demonstratedthat our method is able to effectively rank knownpaths if available and also infer important miss-ing links between entities during the path buildingprocess, ranking them high when they are signif-icant. Any embedding based KBC model can beused after initial training allowing for more flexi-bility and also less overhead. The number of pathsbuilt between a source and target is dependent onthe number of predictions requested from the KBCmodel as the search space for paths increase withthis parameter.

As future work we plan to build a benchmark

data set for multi-hop paths with a smooth dis-tribution of ranked paths, evaluate the same ap-proach and also perform an extensive evaluationwith other state-of-the-art KBC models based onconvolutional networks and matrix factorisationfor path construction and ranking. We intend toalso work on extending our method to use contex-tualised embedding representations to make bet-ter use of path information in the graph which webelieve will have a positive impact on the currentpath building and ranking methodology.

ReferencesAntoine Bordes, Nicolas Usunier, Alberto Garcia-

Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in neural informationprocessing systems, pages 2787–2795.

Rajarshi Das, Shehzaad Dhuliawala, Manzil Za-heer, Luke Vilnis, Ishan Durugkar, Akshay Kr-ishnamurthy, Alex Smola, and Andrew McCal-lum. 2017a. Go for a walk and arrive at the an-swer: Reasoning over paths in knowledge basesusing reinforcement learning. arXiv preprintarXiv:1711.05851.

Rajarshi Das, Arvind Neelakantan, David Belanger,and Andrew McCallum. 2017b. Chains of reason-ing over entities, relations, and text using recurrentneural networks. In Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 1, Long Pa-pers, pages 132–141, Valencia, Spain. Associationfor Computational Linguistics.

Matt Gardner, Partha Talukdar, Jayant Krishnamurthy,and Tom Mitchell. 2014. Incorporating vector spacesimilarity in random walk inference over knowledgebases. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 397–406, Doha, Qatar. Associationfor Computational Linguistics.

Matt Gardner, Partha Pratim Talukdar, Bryan Kisiel,and Tom Mitchell. 2013. Improving learning and in-ference in a large knowledge-base using latent syn-tactic cues. In Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 833–838, Seattle, Washington, USA.Association for Computational Linguistics.

Kelvin Guu, John Miller, and Percy Liang. 2015.Traversing knowledge graphs in vector space. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, pages318–327, Lisbon, Portugal. Association for Compu-tational Linguistics.

Ni Lao and William W Cohen. 2010. Fast query execu-tion for retrieval models based on path-constrained

170

random walks. In Proceedings of the 16th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 881–888. ACM.

Ni Lao, Tom Mitchell, and William W. Cohen. 2011.Random walk inference and learning in a large scaleknowledge base. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing, EMNLP ’11, pages 529–539, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Xi Victoria Lin, Richard Socher, and Caiming Xiong.2018. Multi-hop knowledge graph reasoning withreward shaping. arXiv preprint arXiv:1808.10568.

Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu,and Xuan Zhu. 2015. Learning entity and relationembeddings for knowledge graph completion. InTwenty-ninth AAAI conference on artificial intelli-gence.

Arvind Neelakantan, Benjamin Roth, and Andrew Mc-Callum. 2015. Compositional vector space mod-els for knowledge base completion. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers), pages 156–166, Beijing,China. Association for Computational Linguistics.

Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In ICML, vol-ume 11, pages 809–816.

Richard Socher, Danqi Chen, Christopher D Manning,and Andrew Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. InAdvances in neural information processing systems,pages 926–934.

Weiping Song, Zhijian Duan, Ziqing Yang, HaoZhu, Ming Zhang, and Jian Tang. 2019. Ex-plainable knowledge graph-based recommendationvia deep reinforcement learning. arXiv preprintarXiv:1906.09506.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 1499–1509.

Theo Trouillon, Johannes Welbl, Sebastian Riedel, EricGaussier, and Guillaume Bouchard. 2016. Com-plex embeddings for simple link prediction. In In-ternational Conference on Machine Learning, pages2071–2080.

Haoyu Wang, Vivek Kulkarni, and William YangWang. 2018. Dolores: Deep contextualized knowl-edge graph embeddings. CoRR, abs/1811.00147.

Xiang Wang, Dingxian Wang, Canran Xu, XiangnanHe, Yixin Cao, and Tat-Seng Chua. 2019. Explain-able reasoning over knowledge graphs for recom-mendation. In Proceedings of the AAAI Conferenceon Artificial Intelligence, volume 33, pages 5329–5336.

Wenhan Xiong, Thien Hoang, and William YangWang. 2017. Deeppath: A reinforcement learn-ing method for knowledge graph reasoning. arXivpreprint arXiv:1707.06690.

171


Node Embeddings for Graph Merging: Case of Knowledge GraphConstruction

Ida Szubert Mark SteedmanSchool of Informatics

University of EdinburghEdinburgh, Scotland, UK

[email protected], [email protected]

Abstract

Combining two graphs requires merging thenodes which are counterparts of each other.In this process errors occur, resulting in incor-rect merging or incorrect failure to merge. Wefind a high prevalence of such errors when us-ing AskNET, an algorithm for building Knowl-edge Graphs from text corpora. AskNET nodematching method uses string similarity, whichwe propose to replace with vector embeddingsimilarity. We explore graph-based and word-based embedding models and show an overallerror reduction of from 56% to 23.6%, with areduction of over a half in both types of incor-rect node matching.

1 Introduction

The work we present here is an extension of thegraph building algorithm of AskNET (Harringtonand Clark, 2008). The overall task we consideris to automatically extract information from textand integrate it into one resource, a knowledgegraph. In this paper we focus on one aspect of theAskNET graph building algorithm – the process ofmerging document-level graphs into a corpus levelone. We propose improvements which reduce therates of two common and non-trivial errors whicharise when matching nodes between two graphs.

Merging two graphs, with an assumption thatthey potentially represent overlapping informa-tion, requires node matching. We need to decidewhich nodes should be merged, and this can re-sult in two types of errors: merging two nodeseven if they do not represent the same thing, or notmerging nodes which do. As an example, Figure1 shows a sample graph representing informationfrom a news article and a new sentence togetherwith its own graph. In merging the two graphswe could correctly decide to match and mergethe Williams node to the Rowan Williams one. Ifwe matched to Justin Welby we would merge the

Figure 1: A graph representing a news article about Arch-bishop appointment and a sentence to be integrated into thegraph. Green dotted arrow shows correct node matching, thered match leads to spurious merging, and the blue arrow andnode represent no match found and a spurious addition.

nodes incorrectly, resulting in conflation of infor-mation about two different people. If we did notfind a match at all the result would be the pres-ence of two nodes representing the same person.We call the first error spurious merging, and thesecond spurious addition.

Both of those problems occur in graphs builtwith the AskNET algorithm, which uses stringsimilarity as a tool for node matching. We proposeto modify the algorithm with using vector embed-dings instead. We explore word-based and graph-based ways of obtaining graph node embeddingsand show that both lead to reduction in mergingand addition errors.

2 Methods

In this paper we explore a modification of theAskNET algorithm. We provide an overviewof the KG-building approach of (Harrington andClark, 2008), and a more detailed explanation ofthe step which we improve upon.

172

2.1 AskNET

The AskNET system uses a semantic parser(Boxer (Bos, 2005)) to obtain graph representationof sentential content in a large corpus and incre-mentally combines them, first into document levelgraphs (DG), and then into one semantic network,which we are calling a knowledge graph (KG)1.The central challenge in the process of graphmerging is, as described before, node matching.In AskNET the matching problem is approachedwith an iterative message passing (aka activationspreading) procedure, the intuition behind whichis to find, for a node nDG from DG, a node nKG

in the KG such that the neighbourhoods of nDG

and nKG in their respective graphs are similar.Given a node nDG, the algorithm selects a can-

didate set CKG of KG nodes such that it likelycontains the correct matching nKG. Afterwardsin an iterative procedure for each of the DG nodesthe scores of each element of CKG are updateduntil convergence. The pre-selection step is cru-cial for ensuring time and space efficiency of thealgorithm.However it also means that if the qual-ity of CKG is a limiting factor for the correctnessof the output. If the correct match is not presentin CKG, either spurious merging or addition willresult. Moreover, errors will also arise when CKG

does contain the correct match, but it is easily con-fused with the rest of the set members.

As Harrington (2009) recognizes, spurious ad-dition and merging are challenging errors to fix,and the AskNET algorithm does not attempt it. Inour implementation of AskNET, only about 44%of nodes are correctly matched, with the rest di-vided between spurious addition (24%) and merg-ing(32%). We approach this issue by improvingthe candidate node sets with the aim of minimiz-ing the problems. The task we consider is thus:given a KG, a DG, and node nDG, produce a setCKG such that it (1) contains the true match and(2) minimizes merging and addition errors. In theapproach of (Harrington and Clark, 2008) the se-lection is based on approximate string matchingbetween the name associated with nDG and thenames associated with the KG nodes. This sim-ilarity measure is not reliable in contexts where

1Semantic parses used in AskNET follow a version ofDiscourse Representation Theory as implemented by theBoxer parser; in our replication we represent the informationcontained by a sentence with a set of entity-relation triples,which express all the binary relations involving at least onenamed entity

one entity can be called by various names and ti-tles, as is especially common in news text in whichrepetitions are avoided. In our evaluation, onlyabout 55% of candidate sets generated with stringsimilarity actually contain a correct match. Ourcontribution is investigating the use of vector em-beddings for candidate node selection. We showthat both word-based and graph-based embeddingmodels provide a better notion of similarity, andthat combining them brings optimal benefits.

2.2 Dataset

For our experiments we use a subsection of theNewsSpike corpus (Zhang and Weld, 2013). Forpurposes of efficiency and ease of evaluation wedecided to build a KG of around 15k nodes (sizein line with the FB15k dataset (Bordes et al., 2013)popular in graph embedding literature), which weachieved by selecting 127,221 documents fromthe NewsSpike corpus. We preprocessed theNewsSpike corpus using the DBPedia entity linker(Nguyen et al., 2014) which enabled us to identifythe most frequently occurring named entities. Wechose top n named entities such that when a KGis created of all the documents mentioning those nentities, the graph has approximately 15k nodes.2

Documents in our dataset are relatively short, av-eraging 338 words. The average number of namedentities per document is 10.3. We held out 10 ran-domly selected documents each for developmentand test sets, and the rest forms the training set.

2.3 Base KG

The KG used in our experiments is built out of thetraining set documents using the original AskNETalgorithm. To obtain graphs representing indi-vidual sentences we use the semantic parser of(Reddy et al., 2014) and extract binary relationsfrom its output. We only use relations which in-volve at least one named entity. The nodes inthe graph are labeled with a set containing allstrings from the original documents which havebeen linked to that node (e.g. a set containingRowan Williams, Williams, Rev. Williams, Arch-bishop of Canterbury). The edges are labeled withrelation names, where the relation set is open andcan include any two place predicate used in thesource documents.

2The graph also includes nodes representing non-namedentities.

173

2.4 Graph-based embeddings

The first way of generating DG and KG node em-beddings is using a graph embedding method. Wedecided to use the GraphSAGE model of (Hamil-ton et al., 2017) which generates node representa-tions by taking into account the features of eachnode and the structure of it’s neighborhood. Thismodel, as compared to numerous recent graph em-bedding models, is particularly well suited to ouruse case. It can be trained with an unsupervisedobjective, once trained can produce embeddingsfor unseen nodes and even nodes in unseen graphs,and leverages node features, e.g. text attributes.We train GraphSAGE, in an unsupervised setting3

and mean aggregators, on the base KG. The initialnode features are node degree and mean of one-hot encoding of the node labels. The model learnsa set of aggregator functions, which not only gen-erate final embeddings for the KG nodes, but alsofor DG nodes in the development and test sets.

2.5 Word-based embeddings

Another approach to obtaining node embeddingsis through word embeddings. We make use ofof the ELMo model for deep contextualized wordrepresentation (Peters et al., 2018). We representa node in a KG as an average of the word em-beddings of every entity that has been resolved tothat node during graph building. In other words,given a document we generate ELMo embeddingsfor every mention of every entity4, and in the DGa node representing an entity is assigned an em-bedding which is an average of all the mentionsof that entity. When nodes are merged during in-tegration of the DG into the KG, the embeddingof the KG node is updated so that it is always anaverage of the embeddings of all document-levelnodes resolved to that KG node. For the purposesof our experiments we perform this embedding ag-gregation when building the base KG so that em-beddings for all of its nodes are available. How-ever, we do not use those embeddings to aid in thebuilding process.

We use the original large pre-trained ELMomodel (ELMo large), which has been trained on

3with the objective of maximizing the similarity of closeby nodes and minimizing that of distant nodes, where thecloseness is determined by co-occurrence on fixed-lengthrandom walks

4Because of pre-processing the documents in our corpuswith an entity linker we know which word sequences consti-tute entity names and we treat them as single lexical item.

the same genre as our corpus. For the purposesof meaningful comparison with the graph-basedembeddings, which cannot be pre-trained, we alsotrain an ELMo model on our corpus only (ELMosmall and exeriment with the resulting embed-dings.

2.6 Hybrid embeddings

We expect that the two methods might providecomplementary benefits: GraphSAGE make useof the structure of the KG, and ELMo accumulateinformation from the original texts, including in-formation which is not present in the KG. We pro-pose to combine them by using the GraphSAGEmodel with initial node features being node degreeand the ELMo large embedding of the node.

3 Experiments

We propose to find the set CKG by evaluatingembedding similarity, rather than string similar-ity, between embeddings of the nDG and all KGnodes. We treat the string-similarity method as abaseline. We expect relying on embedding simi-larity to result in better candidate sets and in re-duced number of merging and addition errors.

Regardless of the embedding method, candidateselection requires us to pick k closest neighboursfor a given node from all of the nodes of the KG.To do that we use random projection-based ap-proximate nearest neighbour search algorithm im-plemented in the Annoy library5. For candidate se-lection in the original string-based method we usethe SimString library which performs approximatestring matching according to the method proposedin (Okazaki and Tsujii, 2010), and we define thesimilarity between nd and nKG as the edit distancedivided by the length of the shorter of the names.We use the development set to set k to 7 and stringsimilarity threshold to 0.36.

For each test document, we build a DG using theoriginal AskNET method. Each node in that graphis associated with a set of names and ELMo largeand ELMo small embeddings. Using the Graph-SAGE and hybrid models trained on the base KGwe assign each node in DG a GraphSAGE andhybrid embeddings. Then, for each DG node wefind five CKG sets, one for each method, and runthe rest of the AskNEt algorithm for each set to

5https://github.com/spotify/annoy6We also use the development data to set various AskNET

hyperparameters not discussed in this work

174

errorsgood CKG correct merging addition

baseline 54.5 44.2 24.2 31.6GraphSAGE 67.3 61.8 18.2 20.0ELMo small 63.6 55.1 19.4 25.5ELMo large 71.5 64.2 13.9 21.9

hybrid 82.4 76.4 10.9 12.7

Table 1: Node matching results: percentage of CKG con-taining a correct match (higher is better); percentage of testnodes correctly resolved (higher is better), spuriously mergedwith some KG node and spuriously added to KG as a newnode (lower is better)

recover the one (or none) final match for eachmethod.

3.1 Evaluation

Evaluation of information graph building methodsis inherently challenging in the absence of gold-standard KGs. In our experimental setting the baseKG is automatically constructed and as such isnoisy. We then ask how well are different meth-ods capable of matching DG nodes to the nodesin the noisy KG. There is no ground truth, andso our evaluation relies on a human annotator in-specting the neighbourhoods of the candidate KGnodes and making a judgment about how likely isit that they are a match to the DG node. When theannotator finds no likely matches in the set, theysearch the KG (by keywords, names, etc.) to ascer-tain whether likely matches are present in the KGat all. The annotator has access to the test set docu-ments, document-level graphs, and can inspect theKG.

For each node in a DG in the test set, we man-ually evaluate the candidate sets produced by thefive methods (baseline, GraphSAGE, ELMo large,ELMo small, hybrid) and the one (or none) nodereturned by the AskNET algorithm as the finalmatch. There are 165 nodes in the 10 test set doc-uments.

We evaluate the candidate sets by manually as-sessing whether any of the candidates can be rea-sonably considered to be a match, and if so the setis deemed to be correct. For the final match de-cision we perform a 3-way classification: correctKG node selected or correct in choosing no KGnode; incorrect KG node selected and leading tospurious merging; or incorrectly selecting no KGnode and leading to spurious addition.

4 Results

Overall, both graph- and word-based embeddingsimilarity outperform string similarity as a candi-

date node selection tool, and the most benefit is tobe had by using a hybrid model. Both the percent-age of good candidate sets and the percentage ofcorrect matches are significantly increased. More-over, the difference between the two measures issmaller for our proposed methods than for base-line, indicating that our candidate sets are less con-founding for the AskNET algorithm.

As can be seen in table 1 word- and graph-based embedding differ in what kind of confound-ing candidates they introduce to the sets alongsidethe correct match. This is reflected in the differentrates of spurious merging and addition they resultin. GraphSAGE reduces addition more then merg-ing. Given a node nDG, using GraphSAGE weselect KG nodes whose neighbourhood is similar,in terms of features and structure, to that of nDG.This makes it more likely that the correct matchis in CKG, thus reducing spurious addition. How-ever, the subsequent steps of the AskNET algo-rithm also rely on estimating the similarity of theneighborhoods which means that the candidatesselected using this method are easy to confuse.

An opposite tendency can be observed forELMo embeddings, where CKG includes entitiesmentioned in similar textual context as nDG. Ifthe correct match is present in the set, it is eas-ier for the AskNET algorithm to identify it usingthe neighbourhood information, because the can-didates are likely to have diverse neighbourhoods.Therefore merging mistakes are significantly re-duced.

5 Conclusions

We show that modern context-aware word embed-dings and graph-based embeddings can both beused, separately or in conjunction, to improve KGbuilding from text. Our experiments show that wecan reduce both of the two difficult problems withresolving entities to KG nodes: spurious mergingof separate entities into one node and spurious ad-dition of nodes for entities which are already rep-resented in the KG. We explore how different em-bedding methods are better suited to solving oneof the problems than the other, and how combin-ing them provides synergistic effects. While weexplore the proposed methods from the point ofview of KG creation, the same methods could beused for other KG-based tasks, e.g. question an-swering.

175

ReferencesAntoine Bordes, Nicolas Usunier, Alberto Garcia-

Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in neural informationprocessing systems, pages 2787–2795.

Johan Bos. 2005. Towards wide-coverage semantic in-terpretation. In Proceedings of Sixth InternationalWorkshop on Computational Semantics IWCS, vol-ume 6, pages 42–53.

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.Inductive representation learning on large graphs.In Advances in Neural Information Processing Sys-tems, pages 1024–1034.

Brian Harrington. 2009. ASKNet: automatically creat-ing semantic knowledge networks from natural lan-guage Text. Ph.D. thesis, Oxford University.

Brian Harrington and Stephen Clark. 2008. Asknet:Creating and evaluating large scale integrated se-mantic networks. International Journal of SemanticComputing, 2(03):343–364.

Dat Ba Nguyen, Johannes Hoffart, Martin Theobald,and Gerhard Weikum. 2014. AIDA-light: High-Throughput Named-Entity Disambiguation. InWorkshop on Linked Data on the Web, pages 1–10,Seoul, Korea.

Naoaki Okazaki and Jun’ichi Tsujii. 2010. Sim-ple and efficient algorithm for approximate dictio-nary matching. In Proceedings of the 23rd Inter-national Conference on Computational Linguistics,pages 851–859. Association for Computational Lin-guistics.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365.

Siva Reddy, Mirella Lapata, and Mark Steedman. 2014.Large-scale semantic parsing without question-answer pairs. Transactions of the Association forComputational Linguistics, 2:377–392.

Congle Zhang and Daniel S Weld. 2013. Harvest-ing parallel news streams to generate paraphrases ofevent relations. In Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1776–1786.

176


DBee: A Database for Creating and Managing Knowledge Graphs andEmbeddings

Viktor Schlegel, Andre FreitasDepartment of Computer Science

University of Manchesterviktor.schlegel,[email protected]

Abstract

This paper describes DBee, a database tosupport the construction of data-intensive AIapplications. DBee provides a unique datamodel which operates jointly over large-scaleknowledge graphs (KGs) and embedding vec-tor spaces (VSs). This model supports querieswhich exploit the semantic properties of bothtypes of representations (KGs and VSs). Addi-tionally, DBee aims to facilitate the construc-tion of KGs and VSs, by providing a library ofgenerators, which can be used to create, inte-grate and transform data into KGs and VSs.

1 Introduction

Many AI tasks can be summarised into the cy-cle of collecting data, overlaying a representation(schema) on the top of the data and performinglearning and inference algorithms, which will even-tually produce new data or extend the representa-tion. While in many cases learning and inferenceare put at the centre of the stage, managing the dataand the supporting representations are fundamentalparts of the design and delivery of an AI system.

Currently, the prevalence of workflow architec-tures for many types of AI systems reflects theemphasis on learning and inference, where datamanagement becomes a secondary concern. How-ever, complex AI tasks such as Question Answer-ing (QA) (Kumar et al., 2016), Text Entailment(Hashimoto et al., 2016) or Natural Language Infer-ence are either directly dependent on or can benefitfrom the construction of supporting KnowledgeBases.

Recently, latent and explicit semantic represen-tations are emerging as fundamental elements forsupporting those tasks, due to their dependencyon commonsense and domain specific knowledge.Moreover, the recent rise of successful approachesoperating at the neuro-symbolic representationlevel (Parisotto et al., 2016; Liang et al., 2016),demands for a closer dialogue between explicit and

latent models. Word embeddings (Mikolov et al.,2013) and Knowledge Graphs (lexico-semanticgraphs) (Bollacker et al., 2008) are becoming thede-facto representation models within different AItasks. Moreover, they have complementary proper-ties, where word embeddings provide more coarse-grained semantics which are complemented by thefine-grained semantics of KGs being commonlyused in coordination (Silva et al., 2018; Xie et al.,2017).

This paper describes DBee, a database for cre-ating, querying and consuming embeddings andknowledge graphs. DBee aims to be a databasedesigned for satisfying recurring demands from AIapplications. DBee provides a seamless layer tojointly query knowledge graphs and embeddings,simultaneously exploiting the semantic propertiesof both resources, taking into account performanceand scalability aspects. At the centre of the pro-posed database is the goal of bridging the gap be-tween data, representation, learning and inferencealgorithms, where classifiers and extractors directlyinterface with the schema. By design, DBee pro-vides a declarative layer for data and representationmanagement in AI systems. Finally, DBee also sup-ports the combination of different models and rep-resentations (cross-model and cross-representationqueries) and their customisation.

In the following sections of the paper we mo-tivate our approach with an initial scenario, dis-cuss background and and related work, describethe proposed framework, present the implementedsystem by instantiating it for archetypal use casesand conclude with a discussion outlining the ex-pected performance, hardware requirements andcurrent limitations of the system.

2 Motivational Scenario

An AI application engineer wants to build a QA sys-tem to support investors in NASDAQ companies.Most of the data relevant for this task such as finan-

177

Figure 1: Architecture and Workflow overview: the overall architecture (blue) supports the implementation of themotivational scenario (green).

cial reports, blog articles and recent news only existin the form of unstructured text. The engineer alsoanticipates the benefits of integrating structuredKnowledge Graphs such as DBpedia (Auer et al.,2007), integrating KGs to the textual data sources.Realizing the importance of his application to beable to generate traceable and explainable answers,he decides to use an explicit internal representa-tion, such as the graph-based RDF-NL (Cetto et al.,2018). With the associated chain of classifiers andextractors available at DBee, he performs open-information extraction (OIE), Coreference Reso-lution (CR), Entity Linking (EL) and RhetoricalStructure Classification (RSC) to obtain the graphfrom a chosen set of documents. After the extrac-tion, the graph is indexed to ensure efficiency fordifferent types of queries over the graph represen-tation. In order to support semantic approximationduring the queries, he associates two pre-trainedword embedding models to the KG using the DBeeAPI and uses the provided set of query primitives toquery the knowledge graph. Deciding to use triplesfrom the KG as features for a neural stock predictormodel, he uses the DBee API to create input andanswer sets (for a set of pre-defined queries) readyto be consumed by the automatic differentiationframework of his choice.

3 Background & Related Work

Current machine learning systems such as Keras(Chollet et al., 2015), and PyTorch (Paszke et al.,2017) focus mostly on exposing their user to thedefinition of neural architectures, abstracting awaythe computation details of automatic differentia-tion - or trying to learn even those (Jin et al., 2018)- with TensorFlow (Abadi et al., 2016) being themost complete suite providing assistance from datastreaming, over training to model serving. Ourapproach can be seen as complementary to theseefforts since we aim to provide the infrastructure toextract, represent and query structured and unstruc-tured data (with an emphasis on KGs from text andassociated embeddings).

Early efforts in a similar direction include (Saleset al., 2018), that present a uniform service-basedAPI for storing, querying and comparing word em-beddings, pre-computed with varying models andon different datasets. Another information manage-ment tool for unstructured data is Apache UIMA1.

Contemporary machine comprehension systemsbased on neural architectures have targeted evalu-ation settings which have limited document scale(e.g. SQUAD (Rajpurkar et al., 2016)).

Different works explored the connection be-

1http://uima.apache.org

178

tween distributional semantics and structuredKnowledge Graph representations in the contextof semantic parsing over large-scale RDF graphs(Freitas and Curry, 2014; Freitas, 2015; Sales et al.,2016) and approximative abductive reasoning overcommonsense KBs (Freitas et al., 2014, 2013).Comparatively, DBee focuses on explicit seman-tic representation models (Knowledge Graphs) ex-tracted from text.

4 Proposed Framework

To satisfy the emerging need to work with unstruc-tured text representations, as depicted in the intro-ductory part of this work, we propose a frameworkthat supports the extraction and management ofboth explicit and latent text representation modelsand facilitates the integration with downstream ma-chine learning based models. DBee was designedto deliver the following features:

1. Bridging the gap between unstructureddata and semantic representations: Con-forming data into latent and explicit text repre-sentations is a primary requirement for manyAI applications. DBee allows users to cre-ate, reuse and compose a library of text ex-tractors and classifiers which will be used tostructure and integrate existing unstructureddata. The library includes standard represen-tation generators such as syntactic and lexicalparsers, open information extractors, namedentity recognisers and linkers and discourse-level extractors.

2. Multi-representation model: DBee sup-ports users in experimenting with differenttypes of explicit and latent semantic represen-tations and models. Different tasks will re-quire different types of representation. Usersshould be able to query across multiple repre-sentations.

3. Expressive structured queries and ML in-tegration: To give its users fine-grained con-trol over the data and to overlay their ownmachine learning algorithms, DBee featuresan intuitive query language and seamless inte-gration with existing machine learning algo-rithms.

4. Extensibility: Representation schemas andtheir supporting generators are extensible andcustomisable.

5. Scalability: Operating over large-scale datasources, large knowledge graphs and em-beddings require principled query processingstrategies. DBee inherits indexing strategiesfrom databases and kNN embedding queriesin order to support scaling to large datasets,memory footprints and storage space require-ments.

Figure 2: Initial data model

5 The DBee Model

5.1 Data Model

DBee operates over two types of representation:knowledge graphs and word embeddings.

The underlying knowledge graph data modeluses RDF-NL, an extension of the RDF (Lassilaand Swick, 1999) data model suitable to representtext as a lexico-semantic Knowledge Graph. RDF-NL is built upon a sentence representation modelproposed by (Niklaus et al., 2017, 2019, 2018)which splits complex sentences into simpler linkedclausal and phrasal elements, later splitting theseelements into predicate-argument structures.

The graph data model (Figure 2 (a)) is defined bya subject-predicate-object (SPO) triple which canhave contextual relations (C) as reifications or canbe linked to other SPO triples. Contextual links canbe named. This data model supports the creationof versatile sparse graph representations. For ex-ample, the data model smoothly captures linguisticpredicate-argument structures and phrasal (e.g. ap-positive), clausal (coordination and subordination),rhetorical and argumentation relations. Figure 3shows an example of a concrete knowledge graphextracted from a sentence.

All SPO nodes are defined by their lexical real-isation (typically a text chunk) and can be linkedto a canonical identifier in the entity componentof the data model (Figure 2 (b)), which allows an

179

Figure 3: Example generator chain output of the sentence “Asian stocks fell anew and the yen rose to sessionhighs in the afternoon as worries about North Korea simmered, after a senior Pyongyang official said the U.S. isbecoming ”more vicious and more aggressive” under President Donald Trump .”

entity-centric data integration, such as it is per-formed by co-reference resolution, entity linkingor word-vector clustering.

The data model is materialised into differenttypes of supporting indexes in order to enable ef-ficient and scalable query processing. There aretwo main types of indexes associated with the datamodel:

• Embedding Indexing (EI): Supportsthe efficient querying of embedding spaces (k-NN similarity queries). By default it uses therandom projections of the locality-sensitivehashing method proposed by (Charikar,2002).

• Knowledge Graph LexicalIndexing (T I): Supports information-retrieval style keyword search queries overthe KG structure using inverted indexes andassociated weighting schemes (by default,TF-IDF is used).

5.2 Operations

At the centre of the DBee data model is the abilityto build, transform and combine KGs and VectorSpaces/Embeddings (VSs).

Different KGs and VSs can be combined us-ing a view allowing support of querying specificcompositions. Projections (~π) are operators whichbuild VSs (embeddings) from KGs and unstruc-tured datasets.

On the top of the views, query operators are de-fined. View, projections and queries can be chainedtogether. If called in the middle of a function chain,

these functions serve the purpose of a join in asense similar to relational databases. This meansviews and projections later in the operation chainwill only operate on the subset of results satisfyingthe query defined in the chain so far. This behaviouris visualized in Figure 5.

The domain-specific query language (DSL) asso-ciated with DBee includes the following functions:

• query(term, n): This operation re-trieves up to n best matching candidates fromthe view/projection it is being executed uponwith respect to its type. For a projection, forexample, it retrieves n nearest neighbours re-garding their embeddings, after embeddingthe query term using the projection space’scorresponding embedding function.

• filter(attribute=value | bgp| conditional statement): Thisoperation filters are selection operators(σ) for predicates defined as the function’sparameters. They can be defined as anattribute=value form or with the help of basicgraph patterns).

• rank(wrt): This function can be used torank a set of results with respect to a giventerm by their distance to it in the correspond-ing vector space.

• top(n), count(): Aggregation operatorswill retrieve the top results in up to a givenlimit or count them up, respectively. Provideda name, the result set will consist of attributesof this name.

180

• create view(name): Creates a new viewwith a given name from the current result set.

• create projection(name, using,features): Similarly, creates a newprojection using a given embedding functionby extracting the features from every resultin the set. Features might be defined simplyby providing a list of attribute names to useor any callable operation to extract customfeatures.

5.3 Representation Generators (g) & Chains(c)

Representation elements have associated genera-tors g, which are classifiers, extractors and linkerswhich operate over Data, KGs or VSs.

The generators are stored into libraries, typed ac-cording to their representation function and taggedwith the model metadata (such as training corpusand evaluation score, architecture and hyperparam-eter configuration). Generators can be composedusing generator chains. For example, generatinga KG from textual data would typically employthe chain: gCR gEL gOIERDF−NL. As with thegenerators, chains can be named and persisted intolibraries.

Generators can also be associated with vectorrepresentations, e.g. gV SW2V . The set of pre-definedgenerators currently present at DBee are describedin Table 1.

Figure 1 summarises the main primitives of thesystem depicting a schematic high-level compo-nents diagram of DBee.

Concretely, we propose a pipeline with the fol-lowing steps: First, using contextualised open in-formation extraction (Cetto et al., 2018), structuredinformation is extracted from the unstructured text,in the form of a set of inter-linked subject-predicate-object triples, thus yielding a graph. With coref-erence resolution, the graph is further enriched se-mantically, linking nodes that refer to the same

Table 1: Library of pre-defined Generators

Symbol DescriptiongCR Coreference Resolution generatorgELX Entity Linker to the resource X

gNER Named Entity RecognizergOIE Open Information Extractorgπ provider of an embedding function

pi

entity in the text. In a final entity linking step,recognised entities are connected to existing re-sources, contextualising them further regarding ex-isting background knowledge.

The extracted knowledge graph is then serialisedand indexed, while still retaining its logical graphrepresentation. In particular, we use full-text searchcapable databases and nearest neighbour indicesto enable querying and approximation of storeddata using string-based as well as embedding-basedmethods.

The API layer features a chainable IDSL to allowintuitive interaction with the data. Concretely itis designed to support expressive recurring querypatterns while reducing impedance mismatch.

6 Usage

6.1 Implementation

DBee was conceptualised and implemented as anextensible Python library. We use the HOCON2

format to enable for easy generator chain defini-tion and persistence. We provide a pre-definedchain featuring the contextualised open informa-tion extraction tool Graphene (Cetto et al., 2018),the Stanford CoreNLP coreference resolution sys-tem (Manning et al., 2014) and the entity linkerSpotlight (Mendes et al., 2011) that links recog-nised entities to DBpedia resources. Furthermore,we use ElasticSearch3 as the full-text search engineand Annoy4 to index the embeddings for the kNNqueries.

Note, that following the design goals of exten-sibility and scalability, the software is not con-strained to use those specific tools. Even the choiceof generators and storage types is not fixed, as it re-quires low effort to add a new generator or storagetype, such as a relational database to perform joinsmore efficiently, for instance.

6.2 DBee in Code

Listed below are example instantiations illustrat-ing the usage of the framework exercised on fourexemplar use cases.

6.2.1 ExtractionCode 1 shows the boilerplate code required to in-stantiate DBee. From a list of Wikipedia articletitles one can query the Wikipedia API and apply

2https://github.com/chimpler/pyhocon3https://www.elastic.co4https://github.com/spotify/annoy

181

Figure 4: Example output of a DBee fact query.

TFin KGFinRDF−NL(1)

(2) (2)

(3)

(4)library

V SGNw2v

library

V SWSJELMo

TGN TWSJ

πRDF -NL→w2v πRDF -NL→ELMo πRDF -NL→tf -idf

V SFinw2v V SFin

ELMo

DF⊂FinR EI

FinELMo EI

Finw2v TI

Fintf -idf

qdistsim qapprox qFTS

./ qσ

cRDF -NL = gCR gEL gOIE

πELMo→R

Figure 5: Conceptual overview of the code snippets. From unstructured text, using a chain of generators (1) aknowledge graph is extracted and stored (2) in different indices, which are later queried jointly by the correspond-ing query types they offer (3) or used to create a data set (4) to train a neural network.

the selected generator chain. The snippet further

main = DBee()docs = main.get_pipeline("extended").assemble().load("wikipedia-nasdaq-100.txt")

'()': semantic.OIEGeneratorcoref_provider: class: graphene.FromRESTProviderserver_address = localhost

Code Snippet 1: Extraction Example

highlights the definition of a generator as one stepof the chain, using HOCON syntax, with semanticssimilar python’s logging module configuration.

6.2.2 Index CreationFrom the KG extracted in the previous step, thestorage indices can be populated. The DSL-level user does not necessarily need to know

kb('nasdaq-100').using(docs.get_iterator('fact')).create_view('fact')

kb('nasdaq-100').view('spo').create_projection('po',pi=IndraEmbedder,features=['predicate', 'object'])

Code Snippet 2: Example of data Storage

the actual name of the data view (’fact’ inthis case) but can obtain it by querying theclass of the corresponding generator type (i.e.OIEGenerator.provides). Similarly, addi-tional indices can be constructed and stored fromthe representation generated by an already definedchain or indices - as shown in the example, by util-ising one of the pre-trained embedding generatorsprovided by DBee.

182

6.2.3 QueryingCode 3 shows the query equivalent to the naturallanguage query Which companies have offices inChina?. The query describes the process of filter-ing the list of initial entities to retain only those ofthe type ”company”, switch the data view to facts(performing a join implicitly), further filtering outfacts, and finally projecting the remaining entitiesinto the previously created vector space and rank-ing them by distance to the computed projectionof a given term. An implicit join back to the tex-tual view is made to retrieve the subjects of there-ordered remaining facts.

kb('nasdaq-100')view('entity').filter(type='dbo:Company').view('fact').filter(label=Spatial, context='China').view('spo').projection('po').rank(wrt="have offices").get('subject')

Code Snippet 3: DSL Querying example

Note that the query uses alreadyresolved filter predicates for brevity,one could likewise use the operationview(’types’).query(’companies’)to query for the concrete type URI using theexpression obtained from the text - given the viewwas constructed beforehand.

6.2.4 ML Integration

kb('nasdaq-100')view('fact').as_classification_dataset(features=bow(["spo", "context.spo"])labels=onehot("context.label")

)

Code Snippet 4: Dataset Creation Example

Finally, the example in Code 4 shows the cre-ation of a toy dataset for link type prediction be-tween two interlinked facts. The user-defined de-fined bow and onehot functions serve as featureextractors.

7 Discussion

7.1 Analysis of TFin

While this is not meant to be a thorough empiricalanalysis, the following section gives insights into

100 101 102 103 104 105 10620

25

210

215

220

225

# of facts

Inde

xsi

ze(k

b)

AnnoyElastic

Figure 6: Index sizes for Annoy and ElasticSearch in-dices for varying number of facts

100 101 102 103 104 105 106

10−2

10−1

100

101

102

103

# of facts

Inde

xing

time

(s)

AnnoyElastic

Figure 7: Indexing times for Annoy and ElasticSearchindices

the performance of the system regarding time andspace requirements.

All of the following measurements were carriedout on a notebook featuring an SSD, a dual-corei5-6200U CPU performing at 2.3GHz and 16 GBof RAM.

We provide the vector space indexing times andsizes for a varying number of indexed vectors,averaged over ten runs. In particular, we index10n, n ∈ 0..6 embedding vectors using our vec-tor space storage implementation based on Annoy.The dimension of the embeddings is 300. For full-

183

text search indices, we performed the same proce-dure. We populated ElasticSearch indices with10n, n ∈ 0..6 facts, denormalised according tothe data model, using a single local node. Fig-ures 6 and 7 shows the result, revealing that index-ing times and index sizes scale linearly with thesize of the dataset. Axes within the plots are atlogarithmic scale.

The creation of a small dataset from 100Wikipedia articles yielded 2292 recognised entities,22633 distinct subject-predicate-object structuresand 39456 contextual links. The total requiredstorage space was 8.159 MB for the denormalisedtextual data stored in ElasticSearch and 148.37 MBfor the stored 300 dimensional word2vec em-beddings. Running all steps in sequence - fromobtaining the documents up to storing them in co-responding indices - took approximately 3.5 hours.

It is worth noting that the open information steptakes up most of the processing time. However,since by design the extraction process does notrequire to explore any dependencies between dif-ferent documents, a speedup factor of up to n canbe assumed for n parallel instantiations of the ex-traction pipeline.

7.2 Current Limitations & Future WorkIn its current version, the software does not supportdata insertion or updates due to an implementationdetail of choosing the annoy implementation fornearest neighbour approximation, in favour of itsspeed. There are, however, recent approaches fornearest neighbour estimation that support dynamicindex updates (Li and Malik, 2017).

Furthermore, the current approach requires dif-ferent tools to store different data representationtypes such as views and projections. One futuredirection is to investigate how to build low-levelvector space index support into existing DBMS.

Finally, a rigorous analysis regarding scalabil-ity, complexity, performance and usability will becarried out in the future.

7.3 ConclusionIn this paper, we formalised an approach to createand manage knowledge graphs and embeddingsand to query them jointly and introduced DBee, asystem implementing this approach. We hope toprovide the community with a tool that facilitatesthe management, storage and querying of latentand explicit text representation facilitating its inte-gration to downstreal ML/AI applications.

Acknowledgements

The authors would like to express their gratitudetowards members of the AI Systems lab at the Uni-versity of Manchester for many fruitful discussions.

ReferencesMartın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: a system for large-scale ma-chine learning.. In OSDI, Vol. 16. 265–283.

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In The semantic web. Springer, 722–735.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a col-laboratively created graph database for structuringhuman knowledge. In Proceedings of the 2008 ACMSIGMOD international conference on Managementof data. AcM, 1247–1250.

Matthias Cetto, Christina Niklaus, Andre Freitas,and Siegfried Handschuh. 2018. Graphene:Semantically-Linked Propositions in Open Informa-tion Extraction. In Proceedings of the 27th Inter-national Conference on Computational Linguistics.Association for Computational Linguistics, 2300–2311.

Moses S Charikar. 2002. Similarity estimation tech-niques from rounding algorithms. In Proceedings ofthe thiry-fourth annual ACM symposium on Theoryof computing. ACM, 380–388.

Francois Chollet et al. 2015. Keras. (2015).

Andre Freitas. 2015. Schema-agnositc queries overlarge-schema databases: a distributional semanticsapproach. Ph.D. Dissertation. Digital Enterprise Re-search Institute (DERI), National University of Ire-land, Galway.

Andre Freitas and Edward Curry. 2014. Naturallanguage queries over heterogeneous linked datagraphs: a distributional-compositional semantics ap-proach. In 19th International Conference on Intelli-gent User Interfaces, IUI 2014, Haifa, Israel, Febru-ary 24-27, 2014. 279–288.

Andre Freitas, Joao Carlos Pereira da Silva, EdwardCurry, and Paul Buitelaar. 2014. A Distributional Se-mantics Approach for Selective Reasoning on Com-monsense Graph Knowledge Bases. In Natural Lan-guage Processing and Information Systems - 19thInternational Conference on Applications of Natu-ral Language to Information Systems, NLDB 2014,Montpellier, France, June 18-20, 2014. Proceedings.21–32.

184

Andre Freitas, Joao C. P. da Silva, Sean ORiain, andEdward Curry. 2013. Distributional Relational Net-works. In AAAI 2013 Fall Symposium on Semanticsfor Big Data.

Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-ruoka, and Richard Socher. 2016. A joint many-taskmodel: Growing a neural network for multiple NLPtasks. arXiv preprint arXiv:1611.01587 (2016).

Haifeng Jin, Qingquan Song, and Xia Hu. 2018. Effi-cient Neural Architecture Search with Network Mor-phism. arXiv preprint arXiv:1806.10282 (2018).

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,James Bradbury, Ishaan Gulrajani, Victor Zhong,Romain Paulus, and Richard Socher. 2016. Ask meanything: Dynamic memory networks for naturallanguage processing. In International Conference onMachine Learning. 1378–1387.

Ora Lassila and Ralph R Swick. 1999. Resource de-scription framework (RDF) model and syntax speci-fication. (1999).

Ke Li and Jitendra Malik. 2017. Fast k-nearest neigh-bour search via prioritized dci. arXiv preprintarXiv:1703.00440 (2017).

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D For-bus, and Ni Lao. 2016. Neural symbolic machines:Learning semantic parsers on freebase with weak su-pervision. arXiv preprint arXiv:1611.00020 (2016).

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The Stanford CoreNLP natural language pro-cessing toolkit. In Proceedings of 52nd annual meet-ing of the association for computational linguistics:system demonstrations. 55–60.

Pablo N Mendes, Max Jakob, Andres Garcıa-Silva, andChristian Bizer. 2011. DBpedia spotlight: sheddinglight on the web of documents. In Proceedings of the7th international conference on semantic systems.ACM, 1–8.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their composition-ality. In Advances in neural information processingsystems. 3111–3119.

Christina Niklaus, Bernhard Bermeitinger, SiegfriedHandschuh, and Andr Freitas. 2017. A SentenceSimplification System for Improving Relation Ex-traction. (2017). arXiv:cs.CL/1703.09013

Christina Niklaus, Matthias Cetto, Andre Freitas, andSiegfried Handschuh. 2018. A Survey on OpenInformation Extraction. In Proceedings of the 27thInternational Conference on Computational Lin-guistics. Association for Computational Linguistics,Santa Fe, New Mexico, USA, 3866–3878.

Christina Niklaus, Matthias Cetto, Andre Freitas, andSiegfried Handschuh. 2019. Transforming ComplexSentences into a Semantic Hierarchy. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics. Association for Com-putational Linguistics, Florence, Italy, 3415–3427.

Emilio Parisotto, Abdel-rahman Mohamed, RishabhSingh, Lihong Li, Dengyong Zhou, and PushmeetKohli. 2016. Neuro-symbolic program synthesis.arXiv preprint arXiv:1611.01855 (2016).

Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.(2017).

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questionsfor machine comprehension of text. arXiv preprintarXiv:1606.05250 (2016).

Juliano Efson Sales, Andre Freitas, Brian Davis, andSiegfried Handschuh. 2016. A Compositional-Distributional Semantic Model for Searching Com-plex Entity Categories. In Proceedings of the FifthJoint Conference on Lexical and Computational Se-mantics. Association for Computational Linguistics,Berlin, Germany, 199–208.

Juliano Efson Sales, Leonardo Souza, Siamak Barze-gar, Brian Davis, Andre Freitas, and SiegfriedHandschuh. 2018. Indra: A Word Embeddingand Semantic Relatedness Server. In Proceedingsof the Eleventh International Conference on Lan-guage Resources and Evaluation (LREC 2018). Eu-ropean Language Resources Association (ELRA),Miyazaki, Japan.

Vivian Dos Santos Silva, Siegfried Handschuh, andAndre Freitas. 2018. Recognizing and JustifyingText Entailment Through Distributional Navigationon Definition Graphs.. In AAAI.

Qizhe Xie, Xuezhe Ma, Zihang Dai, and Eduard Hovy.2017. An interpretable knowledge transfer modelfor knowledge base completion. arXiv preprintarXiv:1704.05908 (2017).

185


A Constituency Parsing Tree based Method for Relation Extraction fromAbstracts of Scholarly Publications

Ming Jiang, Jana DiesnerUniversity of Illinois at Urbana-Champaignmjiang17,[email protected]

Abstract

We present a simple, rule-based method forextracting entity networks from the abstractsof scientific literature. By taking advantageof selected syntactic features of constituentparsing trees, our method automatically ex-tracts and constructs graphs in which nodesrepresent text-based entities (in this case, nounphrases) and their relationships (in this case,verb phrases or preposition phrases). We usetwo benchmark datasets for evaluation andcompare with previously presented results forthese data . Our evaluation results show thatthe proposed method leads to accuracy ratesthat are comparable to or exceed the resultsachieved with state-of-the-art, learning-basedmethods in several cases.

1 Introduction

As a public and formal record of original contribu-tions to knowledge, scientific literature is a criticalresource that promotes the progress of science andtechnology in society. To help researchers to com-prehend the large and growing amount of scien-tific literature, automated methods can be used toextract and organize information from corpora ofpublications, e.g., in terms of scientific key con-cepts and their relationships. Leveraging priorwork that has achieved high accuracy for entityrecognition (Lample et al., 2016; Habibi et al.,2017), in this paper, we focus on identifying ex-tracting relationships between entities.

Overall, prior studies consider the task of rela-tion extraction from two perspectives: 1) identi-fying if a relationship exists between a given pairof identified entities (Gabor et al., 2018), whichis also the goal with this paper, and 2) further la-beling or classifying the identified relationships(Luan et al., 2018a; Mintz et al., 2009).

Prior work has used different methods for re-lation extraction. Rule-based algorithms primar-ily rely on lexical patterns, such as word co-occurrences (Jenssen et al., 2001) and depen-dency templates (Fundel et al., 2006; Kilicogluand Bergler, 2009; Romano et al., 2006). Su-pervised learning-based methods mainly use ei-ther feature engineering (Kambhatla, 2004; Chanand Roth, 2011) or kernel functions (Culotta andSorensen, 2004; Zelenko et al., 2003; Bunescuand Mooney, 2005). Using supervised learningto detect the existence (and type) of relationshipsbetween concepts in scholarly publications mayfurther require domain expertise for annotation.Accounting for the fact that humans tend to usetheir background knowledge to identify relations,Chan and Roth (2010) showed that using externalknowledge, such as Wikipedia, improves the accu-racy of relation extraction. Recently, deep learn-ing models have also been used for supervisedlearning (Luan et al., 2018a, 2019). For example,Luan et al. (2019) developed a dynamic span graphframework based on sentence-level BiLSTM formulti-task information extraction. Since data an-notation by humans is expensive, semi-supervisedlearning methods, such as the snowball system(Agichtein and Gravano, 2000), as well as un-supervised learning, such as clustering methods(Daelemans and Van den Bosch, 2005; Davidovand Rappoport, 2008), have also been explored inprior studies. Though remarkable contributionshave been made, one remaining limitation withprior machine-learning based work is that theseapproaches may involve time and related costs forparameter optimization and labeling. Moreover,data-driven training methods may be limited bythe domain specificity of learned models. Finally,algorithms trained on deep learning models lackinterpretability.

To address the above-mentioned limitations, we

186

Eachcomponentusesadecisiontreearchitecturetocombinemul0pletypesofevidenceabouttheunknownword.

Eachcomponent

NP VP

uses adecisiontreearchitecture

VBZ NP S

VP

to

TO VP

combine

VB NP

mul0pletypes

NP PP

IN

of evidence

NP

PP

IN

about theunknownword

NP

Uncle/Ancestor’sSibling

Sibling

⇒ (Eachcomponent,uses,adecisiontreearchitecture)⇒ (adecisiontreearchitecture,tocombine,mul:pletypesofevidence)⇒ (mul:pletypesofevidence,about,theunknownword)

component A00-1024.9decisiontreearchitecture A00-1024.10… …

⇒ (A00-1024.9,A00-1024.10)⇒ (A00-1024.10,A00-1024.11)⇒ (A00-1024.11,A00-1024.12)

Eachcomponentusesadecisiontreearchitecturetocombinemul:pletypesofevidenceabouttheunknownword.

Figure 1: An illustrative example of building a constituency-based concept network (CTN) for a sentence.

propose a rule-based concept network construc-tion method. In the resulting networks, nodesrepresent annotated entities per document, andedges are constructed based on constituency pars-ing. Our approach is motivated by the observationthat scientific literature typically use formal lan-guage with clear syntactic structure and compar-atively fixed word order (e.g., term phrases). In-spired by prior work that defines a relationship aseither an interaction or an association between twoentities (Jurafsky, 2000), we take advantage of thestructured information provided by constituencyparse trees built for each sentence. More specif-ically, We capture two types of tuples: 1) (nounphrase, verb-based connection, noun phrase), and2) (noun phrase, preposition-based connection,noun phrase). To avoid over-fitting on the givendomain, our rules are generated on the basis of acontext-free grammar.

We evaluate our method against two benchmarkdatasets that have been previously labeled for en-tities and relations. Our experimental results showthat the proposed constituency-based concept net-works achieve comparable accuracy to or can evenoutperform state-of-the-art, learning-based meth-ods for identifying entity networks. We find thatrelationships of the type used-for and part-of arebetter captured by our approach than other typesof relationships. Finally, we describe differencesbetween domain-level concept networks.

2 Method

The construction of constituency-based conceptnetwork (CTN) has three stages: 1) data pre-prossessing, 2) the identification of nodes (weused entities given in annotated data), and of edgesbased on a constituency parsing tree, and 3) map-

ping entities from the constituency parsing tree tolabeled entities. Figure 1 provides an illustrativeexample of the process of building a CTN for asentence.

Data Pre-processing In this stage, we focus ontwo aspects. First, we segment each documentinto a set of sentences that are the input to con-stituency parsing. Since scientific texts may havesome long sentences with complex sentence struc-tures, we further segment sentences by using regu-lar expressions. Second, we identify the annotatedentities per sentence by extracting the labeled en-tity id and corresponding entity phrase, as well asthe entities’ index in the sentence.

Node and Edge Identification To generate aparsing tree for each sentence, we use the Al-lenNLP constituency parsing (Joshi et al., 2018)As shown in Figure 1, we extract noun phrasesat the lowest layer (i.e., children are noun-basedend-nodes) as candidates of CTN nodes. We madethis decision to capture unique and specific (to thesub-field of science) noun phrases from scientificliterature. For linking nodes, we capture keywordsthat occur between two adjacent node candidatesin the parse tree. According to the parse tree struc-ture, these keywords usually occur in two typesof positions: 1) the sibling of a CTN node candi-date, and 2) the uncle or the ancestor’s sibling of apotential CTN node. Based on node candidatesand connecting keywords, we generate an edgeif two node candidates are linked by a connect-ing keyword. One specific issue in this process isthe of-phrase: according to our empirical observa-tion, we believe that of-phrases (e.g., “computa-tional model of discourse”) often represent a sin-gle concept. Based on this assumption, we merge

187

SCIERC SemEval18

#entities 8089 7482#relations 4716 1595#relations/doc 9.4 3.2cross-sentence relations yes no

Table 1: Data statistics.

edge candidates connected by the keyword “of” asa single CTN node, and remove the original edges.

Entity Mapping We remove nodes that hadbeen identified by the constituency parsing treeprocess described above, but are not labeled as en-tities in the ground truth data. As shown in Fig-ure 1, the final output of CTN is a set of nodeswith ids, and edges.

3 Experiments

3.1 Experimental SetupData We perform experiments on two publiclyavailable datasets where humans annotated scien-tific entities and relations from abstracts of scien-tific publication. Table 1 provides a brief sum-mary of both datasets. SemEval18 Dataset has500 abstracts prepared by Gabor et al. (2018) forthe shared task 7 (subtask 2) of SemEval 2018. Allabstracts in this dataset are from published papersin the field of computational linguistics. The an-notated relations are divided into six types of se-mantic relationships between scientific concepts.The SCIERC Dataset was provided by Luan et al.(2018a). This dataset has 500 scientific abstractsfrom 12 AI conference/workshop proceedings thatcover five research areas: 1) artificial intelligence(AI), 2) natural language processing (NLP), 3)speech, 4) machine learning (ML), and 5) com-puter vision (CV).

Baselines For the SemEval data, we comparedthe results from our network construction method(CTN) with the official baseline, which was gen-erated by using a memory-based k-nearest neigh-bor (k-nn) search (Gabor et al., 2018). The topthree reported submissions in the SemEval leader-board were: UWNLP (Luan et al., 2018b), ETH-DS3Labl (Rotsztejn et al., 2018), and SIRIUS-LTG-UiO (Nooralahzadeh et al., 2018). ForSCIERC, we compared our method with two state-of-the-art (SOTA) systems: SciIE (Luan et al.,2018a) and DyGIE (Luan et al., 2019). The origi-nal outputs from SciIE and DyGIE also identified

Dev Test

P R F1 P R F1

SciIE 62.0 47.7 53.9 66.4 46.7 54.9DyGIE 55.0 48.6 51.6 63.3 52.2 57.2

CTN (ours) 73.4 47.3 57.5 75.4 46.5 57.5

Table 2: Comparison with previous methods for rela-tion extraction on SCIERC dataset.

and categorized the type and direction of relations,and the boundary of entities. Since we do not iden-tify these elements, we also do not consider themfor comparison.

3.2 ResultsExtraction Performance Table 2 and Table 3compare our concept network CTN with baselinesfor SCIERC and SemEval18, respectively. Com-pared to SOTA on SCIERC, our approach outper-formed the best previously reported results on boththe development and testing data. Specifically,CTN achieved a noticeable improvement in preci-sion (we achieved∼74%) compared to prior meth-ods, which benefits our F1 value. Further com-paring each method’s performance on the devel-opment data versus testing data, we observe thatour rule-based CTN produces more stable or con-sistent results than the considered, prior, learning-based methods.

For SemEval18, we find that CTN outper-formed the baseline by ∼15% (our F1 score was41.6%), and the third best reported prior result, butcould not come close to the top two prior results.From the presented results, we conclude that ourrule-based concept network approach can serve asa strong baseline method for identifying relatedentity pairs in scientific texts.

Ablation Study In order to explore the isolatedcontribution of each rule considered for addingedges in the CTN, we conduct an ablation analysis

P R F1

UWNLP - - 50.0ETH-DS3Lab - - 48.8SIRIUS-LTG-UiO - - 37.4SemEval Baseline - - 26.8

CTN (ours) 33.6 54.8 41.6

Table 3: Comparison with previous methods for rela-tion extraction on SemEval18 dataset.

188

Dev Test

P R F1 P R F1

CTN (ours) 73.4 47.3 57.5 75.4 46.5 57.5

− long sentence segmentation 73.4 47.3 57.5 75.4 46.5 57.5− of-phrase (merge+remove) 75.1 44.4 55.8 76.0 43.8 55.6− of-phrase (merge) 74.6 40.7 52.6 76.6 41.3 53.6− entity id positioning 64.0 31.2 42.0 72.9 32.6 45.1− all above 64.1 29.0 39.9 73.1 31.0 43.5

Table 4: Ablation study of isolated contribution of eachrule.

SCIERC SemEval18

TP% Total TP% Total

USED-FOR 57.22% 533 PART WHOLE 69.51% 82PART-OF 50.79% 63 USAGE 59.77% 174FEATURE-OF 49.15% 59 RESULT 50.00% 16EVALUATE-FOR 43.96% 91 COMPARE 36.83% 19HYPONYM-OF 40.30% 67 MODEL-FEATURE 34.25% 73COMPARE 36.84% 38 TOPIC 0.00% 3CONJUNCTION 4.88% 123 - - -

Table 5: Accuracy (TP = true positives) per relationshiptype on testing data.

by removing each rule from the network construc-tion process and measuring the overlap betweenpredicted connections and the ground truth. Ta-ble 4 shows the results. We observe that the rulefor mapping entity ids for a potentially connectedentity pair has the highest isolated impact. By fur-ther looking into the ground truth data, we find thatan entity phrase can be labeled by multiple differ-ent ids, e.g., when the related phrase repeatedlyappears in a document. Therefore, without a map-ping step, CTN would be prone to considering andlinking such entities as different nodes. Comingback to of-phrases, we find that not consideringedges between the preposition “of” (i.e., (node 1,“of”, node 2)) leads to a decrease in recall, whichfurther indicates our initial recommendation that

of phrases should be merged to represent a single(scientific) concept.

Relation Type Sensitivity Table 5 shows theability of CTN to identify each type of relation-ship that is labeled in the ground truth data. CTNdoes not actually predict relationship type, weonly retroactively compute the accuracy rate perlink type. We observe that usage (used-for) andpart whole (part-of) are identified with the highestaccuracy. Note that these two categories are alsothe most frequently occurring ones in the groundtruth data. On the other hand, we find that rela-tionship type of conjunction and topic associationresult in the lowest accuracy. This might be be-cause we do not consider conjunction words askeywords that indicate relationships. To under-stand the comparatively low performance for topicrelationships, we further looked into the context ofthe actual entity pairs. Doing so, we found that allthree instance of this link type were expressed byan of-phrase, where we consider the whole phraseas a single concept. For example, in the expression“qualitative analysis of results”, the entity “quali-tative analysis” is annotated with a topic relation-ship with the entity “results”.

Network Analysis To understand the character-istics of the network data that were constructedwith the proposed method in more depth, we fur-ther built a corpus-level CTN for all texts per do-main in the SCIERC testing data. Figure 3 showstwo illustrative examples of the networks for theCV and ML domain, respectively. The nodename represents the extracted entity phrases, nodesize represents the weighted degree centrality, andnode colors denote membership in components.

Figure 2: CTN of abstracts from the CV domain. Figure 3: CTN of abstracts from the ML domain.

189

We find that the most central (in terms of degree)nodes from the CV corpus mainly represent gen-eral scientific concepts, such as “method”, “algo-rithm” and “approach”, while in the ML corpus,key nodes represent domain-specific terms such as“robust PCA” and “side information”. Compara-tively, the size of the CTN from the CV domain islarger than that from the ML domain, which mightbe due to different numbers of abstracts in eachdomain.

4 Conclusions

In this paper, we have proposed and evaluateda rule-based network construction method thatleverages constituency parsing to extract rela-tions between entities in scientific texts. Ourmethod does not require machine learning or do-main knowledge. Experiments on two bench-mark datasets show that the proposed CTNachieve comparable performance with state-of-the-art learning-based methods in multiple cases.Even though our method could not outperformthe two best performing systems built for one ofthe considered datasets, our results suggest thatthe demonstrated approach can work as a base-line method for relation extraction. In addition,we find that entities with a relationship of used-for and part-of are more likely to be connectedin our network. Based on the corpus-level CTN,we further saw that key nodes in the networksbased on the CV corpus are mainly general scien-tific terms, while for the abstracts from the ML do-main, key nodes represent domain-specific terms.To improve the construction of CTN in the future,we plan to consider cross-sentence link formationand link label detection. Finally, the rules used tobuild CTN can further support the development oflearning-based algorithms for relation extraction.

Acknowledgments

This material is based upon work supported bythe U.S. Department of Homeland Security underGrant Award Number, 2015-ST-061-CIRC01. Theviews and conclusions contained in this documentare those of the authors and should not be inter-preted as necessarily representing the official poli-cies, either expressed or implied, of the U.S. De-partment of Homeland Security.

ReferencesEugene Agichtein and Luis Gravano. 2000. Snow-

ball: Extracting relations from large plain-text col-lections. In Proceedings of the fifth ACM Confer-ence on Digital Libraries, pages 85–94. ACM.

Razvan C Bunescu and Raymond J Mooney. 2005.A shortest path dependency kernel for relation ex-traction. In Proceedings of the Conference on Hu-man Language Technology and Empirical Methodsin Natural Language Processing, pages 724–731.Association for Computational Linguistics.

Yee Seng Chan and Dan Roth. 2010. Exploiting back-ground knowledge for relation extraction. In Pro-ceedings of the 23rd International Conference onComputational Linguistics, pages 152–160. Associ-ation for Computational Linguistics.

Yee Seng Chan and Dan Roth. 2011. Exploitingsyntactico-semantic structures for relation extrac-tion. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies-Volume 1, pages 551–560. Association for Computational Linguistics.

Aron Culotta and Jeffrey Sorensen. 2004. Dependencytree kernels for relation extraction. In Proceed-ings of the 42nd Annual Meeting of the Associa-tion for Computational Linguistics, pages 423–429,Barcelona, Spain.

Walter Daelemans and Antal Van den Bosch. 2005.Memory-based Language Processing. CambridgeUniversity Press.

Dmitry Davidov and Ari Rappoport. 2008. Classifi-cation of semantic relationships between nominalsusing pattern clusters. In Proceedings of ACL-08:HLT, pages 227–235, Columbus, Ohio. Associationfor Computational Linguistics.

Katrin Fundel, Robert Kuffner, and Ralf Zimmer.2006. Relexrelation extraction using dependencyparse trees. Bioinformatics, 23(3):365–371.

Kata Gabor, Davide Buscaldi, Anne-Kathrin Schu-mann, Behrang QasemiZadeh, Haifa Zargayouna,and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in sci-entific papers. In Proceedings of The 12th Inter-national Workshop on Semantic Evaluation, pages679–688.

Maryam Habibi, Leon Weber, Mariana Neves,David Luis Wiegandt, and Ulf Leser. 2017. Deeplearning with word embeddings improves biomed-ical named entity recognition. Bioinformatics,33(14):i37–i48.

Tor-Kristian Jenssen, Astrid Lægreid, Jan Ko-morowski, and Eivind Hovig. 2001. A literature net-work of human genes for high-throughput analysisof gene expression. Nature Genetics, 28(1):21.

190

Vidur Joshi, Matthew Peters, and Mark Hopkins. 2018.Extending a parser to distant domains using a fewdozen partially annotated examples. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1190–1199, Melbourne, Australia. As-sociation for Computational Linguistics.

Dan Jurafsky. 2000. Speech & language processing.Pearson Education India.

Nanda Kambhatla. 2004. Combining lexical, syntactic,and semantic features with maximum entropy mod-els for information extraction. In Proceedings of theACL Interactive Poster and Demonstration Sessions,pages 178–181, Barcelona, Spain. Association forComputational Linguistics.

Halil Kilicoglu and Sabine Bergler. 2009. Syntacticdependency based heuristics for biological event ex-traction. In Proceedings of the BioNLP 2009 Work-shop Companion Volume for Shared Task, pages119–127.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270, San Diego, California. Associationfor Computational Linguistics.

Yi Luan, Luheng He, Mari Ostendorf, and HannanehHajishirzi. 2018a. Multi-task identification of enti-ties, relations, and coreference for scientific knowl-edge graph construction. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 3219–3232, Brussels, Bel-gium. Association for Computational Linguistics.

Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi.2018b. The uwnlp system at semeval-2018 task 7:Neural relation extraction model with selectively in-corporated concept embeddings. In Proceedings ofThe 12th International Workshop on Semantic Eval-uation, pages 788–792.

Yi Luan, Dave Wadden, Luheng He, Amy Shah, MariOstendorf, and Hannaneh Hajishirzi. 2019. A gen-eral framework for information extraction using dy-namic span graphs. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 3036–3046, Minneapolis, Minnesota.Association for Computational Linguistics.

Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-rafsky. 2009. Distant supervision for relation ex-traction without labeled data. In Proceedings ofthe Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conferenceon Natural Language: Volume 2-Volume 2, pages1003–1011. Association for Computational Linguis-tics.

Farhad Nooralahzadeh, Lilja Øvrelid, and Jan ToreLønning. 2018. Sirius-ltg-uio at semeval-2018 task7: Convolutional neural networks with shortest de-pendency paths for semantic relation extraction andclassification in scientific papers. In Proceedings ofThe 12th International Workshop on Semantic Eval-uation, pages 805–810.

Lorenza Romano, Milen Kouylekov, Idan Szpektor,Ido Dagan, and Alberto Lavelli. 2006. Investigatinga generic paraphrase-based approach for relation ex-traction. In Proceedings of the 11th Conference ofthe European Chapter of the Association for Com-putational Linguistics.

Jonathan Rotsztejn, Nora Hollenstein, and Ce Zhang.2018. Eth-ds3lab at semeval-2018 task 7: Effec-tively combining recurrent and convolutional neuralnetworks for relation classification and extraction.In Proceedings of The 12th International Workshopon Semantic Evaluation, pages 689–696.

Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2003. Kernel methods for relation ex-traction. Journal of Machine Learning Research,3(Feb):1083–1106.

191

Author Index

Alt, Christoph, 58Ammanabrolu, Prithviraj, 1Amplayo, Reinald Kim, 124Andrews, Martin, 85Auer, Sören, 90

Banerjee, Pratyay, 78Beck, Daniel, 26

Campbell, William, 151Chang, Maria, 159Chen, Chen, 52Chen, Jie, 159Chia, Yew Ken, 85Cho, Eunah, 151Cohn, Trevor, 26

Daix-Moreux, Pierre, 118Das, Rajarshi, 101Dhuliawala, Shehzaad, 101Diesner, Jana, 186D’Souza, Jennifer, 90

Freitas, André, 42, 177

Gallé, Matthias, 118Godbole, Ameya, 101Golshan, Behzad, 52

Haffari, Gholamreza, 26Harbecke, David, 58Hennig, Leonhard, 58Hovy, Eduard, 134Hübner, Marc, 58Huo, Siyu, 159Hwang, Seung-won, 124

Jansen, Peter, 63Jiang, Meng, 140Jiang, Ming, 186

Khalife, Sammy, 17

Last, Mark, 32

Ma, Danni, 52

Ma, Tengfei, 159Ma, Xiaochun, 151Mandyam, Raghuram, 134McCallum, Andrew, 101Miura, Yasuhide, 11Mulang’, Isaiah Onando, 90

Nemoto, Keiichi, 11

Ohkuma, Tomoko, 11

Pierleoni, Andrea, 164

Qiu, Zimeng, 151

Riedl, Mark, 1Roberts, Ian, 164

Schlegel, Viktor, 42, 177Schwarzenberg, Robert, 58Shi, Yiyu, 140Song, Min, 124Steedman, Mark, 172Sudhahar, Saatviga, 164Szubert, Ida, 172

Tagawa, Yuki, 11Tan, Wang-Chiew, 52Taniguchi, Motoki, 11Taniguchi, Tomoki, 11Thayaparan, Mokanarangan, 42

Ustalov, Dmitry, 63

Vaibhav, Vaibhav, 134Valentino, Marco, 42Vazirgiannis, Michalis, 17

Witbrock, Michael, 159Witteveen, Sam, 85Wu, Lingfei, 159

Xiong, JinJun, 140

Yamamoto, Takayuki, 11Yu, Mengxia, 140Yu, Wenhao, 140

193

Zaheer, Manzil, 101Zeng, Qingkai, 140Zuckerman, Matan, 32