naacl hlt 2019 · distantly supervised biomedical knowledge acquisition via knowledge graph based...

93
NAACL HLT 2019 Extraction of Structured Knowledge from Scientific Publications ESSP Proceedings of the Workshop June 6th, 2019 Minneapolis, USA

Upload: others

Post on 14-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • NAACL HLT 2019

    Extraction of Structured Knowledgefrom Scientific Publications

    ESSP

    Proceedings of the Workshop

    June 6th, 2019Minneapolis, USA

  • c©2019 The Association for Computational Linguistics

    Order copies of this and other ACL proceedings from:

    Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

    ISBN 978-1-948087-99-5

    ii

  • Introduction

    Scientific knowledge is one of the greatest assets of humankind. This knowledge is recorded anddisseminated in scientific publications, and the body of scientific literature is growing at an enormousrate. Automatic methods of processing and cataloguing that information are necessary for assistingscientists to navigate this vast amount of information, and for facilitating automated reasoning, discoveryand decision making on that data.

    Structured information can be extracted at different levels of granularity. Previous and ongoing workhas focused on bibliographic information (segmentation and linking of referenced literature), keywordextraction and categorization (e.g., what are tasks, materials and processes central to a publication), andcataloguing research findings. Scientific discoveries can often be represented as pairwise relationships,e.g., protein-protein, drug-drug, and chemical-disease interactions, or as more complicated networkssuch as action graphs describing scientific procedures (e.g., synthesis recipes in material sciences).Information extracted with such methods can be enriched with time-stamps, and other meta-information,such as indicators of uncertainty or limitations of the discovered facts.

    Structured representations, such as knowledge graphs, summarize information from a variety of sourcesin a convenient and machine readable format. Graph representations that link the information of a largebody of publications can reveal patterns and lead to the discovery of new information that would not beapparent from the analysis of just one publication, or from extracted isolated pieces of information. Thiskind of aggregation can lead to new scientific insights and it can also help to detect trends or find expertsfor a particular scientific area.

    While various workshops have focused separately on several aspects – extraction of information fromscientific articles, building and using knowledge graphs, the analysis of bibliographical information,graph algorithms for text analysis – the aim of the ESSP workshop is to elicit and stimulate work thattargets the extraction and aggregation of structured information, and to ultimately lead to finding novelinformation and scientific discoveries.

    We have received 15 submissions, of which we accepted 10: 5 for oral presentation, 4 as posters and onedemo. The topics covered the biomedical domain, mathematics, computer science and general science,with approaches focusing on various aspects of the extraction, learning, and knowledge processing.

    To complement the accepted papers, we welcome four invited speakers from industry, state institutionsand academia, to provide insights into knowledge requirements and state of the art in specific fields(medicine, social sciences) and contexts:

    Michael CafarellaUniversity of MichiganExtraction-Intensive Systems for the Social Sciences

    Dina Demner-FushmanNational Library of MedicineExtracting structured knowledge from biomedical publications

    Hoifung PoonDirector, Precision Health NLP @ MicrosoftMachine Reading for Precision Medicine

    Chris WeltyGoogle ResearchJust when I thought I was out, they pull me back in – The role of KG in AKBC

    iii

  • We thank our authors, speakers and program committee members for helping us assemble an excitingprogram on this timely topic. We are grateful to our sponsors – BASF SE Ludwigshafen, the LeibnizScience Campus "Empirical Linguistics and Computational Language Modeling" (LiMo), the GermanResearch Foundation (DFG grant RO5127/2-1) – for making such a diverse and speaker-rich programpossible.

    Vivi Nastase, Benjamin Roth, Laura Dietz, Andrew McCallum

    iv

  • Organizers:

    Vivi Nastase, University of HeidelbergBenjamin Roth, Ludwig Maximilian University of MunichLaura Dietz, University of New HampshireAndrew McCallum, University of Massachusetts Amherst

    Program Committee:

    Rabah Al-Zaidy, KAUST, Saudi ArabiaSergio Baranzini, UCSFKen Barker, IBMChaitan Baru, UCSDChandra Bhagavatula, Allen Institute for AIVolha Bryl, Springer NatureTrevor Cohen, MBChBAnette Frank, University of HeidelbergIngo Frommholz, University of BedfordshireDaniel Garijo, ISIHannaneh Hajishirzi, University of WashingtonKeith Hall, GoogleMarcel Karnstedt Hulpus, Springer Semantic WebBhushan Kotnis, NEC LabsAnne Lauscher, Mannheim UniversityYi Luan, University of WashingtonSebastian Martschat, BASFPhilipp Mayr-Schlegel, GESISArunav Mishra, BASFMathias Niepert, NEC LabsAdam Roegiest, Kira SystemsMartin Schmitt, LMU MunichIsabel Segura-Bedmar, University Carlos III of MadridMihai Surdeanu, University of ArizonaNiket Tandon, Allen Institute for AIKarin Verspoor, University of MelbourneGerhard Weikum, MPII SaarbrueckenRobert West, EPFLGuido Zucchon, Queensland University

    Invited Speakers:

    Michael Cafarella, University of MichiganDina Denmer-Fushman, National Library of MedicineHoifung Poon, Director, Precision Health NLP @ MicrosoftChris Welty, Google AI

    v

  • Table of Contents

    Distantly Supervised Biomedical Knowledge Acquisition via Knowledge Graph Based AttentionQin Dai, Naoya Inoue, Paul Reisert, Ryo Takahashi and Kentaro Inui . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Scalable, Semi-Supervised Extraction of Structured Information from Scientific LiteratureKritika Agrawal, Aakash Mittal and Vikram Pudi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    Understanding the Polarity of Events in the Biomedical Literature: Deep Learning vs. Linguistically-informed Methods

    Enrique Noriega-Atala, Zhengzhong Liang, John Bachman, Clayton Morrison and Mihai Surdeanu21

    Dataset Mention Extraction and ClassificationAnimesh Prasad, Chenglei Si and Min-Yen Kan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    Annotating with Pros and Cons of Technologies in Computer Science PapersHono Shirai, Naoya Inoue, Jun Suzuki and Kentaro Inui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    Browsing Health: Information Extraction to Support New Interfaces for Accessing Medical EvidenceSoham Parikh, Elizabeth Conrad, Oshin Agarwal, Iain Marshall, Byron Wallace and Ani Nenkova

    43

    An Analysis of Deep Contextual Word Embeddings and Neural Architectures for Toponym Mention De-tection in Scientific Publications

    Matthew Magnusson and Laura Dietz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    STAC: Science Toolkit Based on Chinese Idiom Knowledge GraphMeiling Wang, Min Xiao, Changliang Li, Yu Guo, Zhixin Zhao and Xiaonan Liu . . . . . . . . . . . . . 57

    Playing by the Book: An Interactive Game Approach for Action Graph Extraction from TextRonen Tamari, Hiroyuki Shindo, Dafna Shahaf and Yuji Matsumoto . . . . . . . . . . . . . . . . . . . . . . . . . 62

    Textual and Visual Characteristics of Mathematical Expressions in Scholar DocumentsVidas Daudaravicius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    vii

  • Workshop Program

    Thursday, June 6, 2019

    9:00–10:30 Session 1

    9:00–9:15 Welcome

    9:15–10:10 INVITED TALK: Machine Reading for Precision MedicineHoifung Poon

    10:10–10:30 Distantly Supervised Biomedical Knowledge Acquisition via Knowledge GraphBased AttentionQin Dai, Naoya Inoue, Paul Reisert, Ryo Takahashi and Kentaro Inui

    10:30–11:00 Coffee break

    11:00–12:30 Session 2

    11:00–11:50 INVITED TALK: Extraction-Intensive Systems for the Social SciencesMichael Cafarella

    11:50–12:10 Scalable, Semi-Supervised Extraction of Structured Information from Scientific Lit-eratureKritika Agrawal, Aakash Mittal and Vikram Pudi

    12:10–12:30 Understanding the Polarity of Events in the Biomedical Literature: Deep Learningvs. Linguistically-informed MethodsEnrique Noriega-Atala, Zhengzhong Liang, John Bachman, Clayton Morrison andMihai Surdeanu

    12:30–14:00 Lunch break

    14:00–15:15 Session 3

    14:00–14:50 INVITED TALK: Extracting Structured Knowledge from Biomedical PublicationsDina Demner-Fushman

    14:50–14:55 Dataset Mention Extraction and ClassificationAnimesh Prasad, Chenglei Si and Min-Yen Kan

    ix

  • Thursday, June 6, 2019 (continued)

    14:55–15:00 Annotating with Pros and Cons of Technologies in Computer Science PapersHono Shirai, Naoya Inoue, Jun Suzuki and Kentaro Inui

    15:00–15:05 Browsing Health: Information Extraction to Support New Interfaces for AccessingMedical EvidenceSoham Parikh, Elizabeth Conrad, Oshin Agarwal, Iain Marshall, Byron Wallace andAni Nenkova

    15:05–15:10 An Analysis of Deep Contextual Word Embeddings and Neural Architectures forToponym Mention Detection in Scientific PublicationsMatthew Magnusson and Laura Dietz

    15:10–15:15 STAC: Science Toolkit Based on Chinese Idiom Knowledge GraphMeiling Wang, Min Xiao, Changliang Li, Yu Guo, Zhixin Zhao and Xiaonan Liu

    15:15–16:00 Coffee break and Poster session

    16:00–17:30 Session 4

    16:00–16:50 INVITED TALK: Just When I Thought I Was Out, They Pull Me Back In: The Roleof Knowledge Representation in Automatic Knowledge Base ConstructionChris Welty

    16:50–17:10 Playing by the Book: An Interactive Game Approach for Action Graph Extractionfrom TextRonen Tamari, Hiroyuki Shindo, Dafna Shahaf and Yuji Matsumoto

    17:10–17:30 Textual and Visual Characteristics of Mathematical Expressions in Scholar Docu-mentsVidas Daudaravicius

    x

  • Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications, pages 1–10Minneapolis, USA, June 6, 2019. c©2019 Association for Computational Linguistics

    Distantly Supervised Biomedical Knowledge Acquisition via KnowledgeGraph Based Attention

    Qin Dai1, Naoya Inoue1,2, Paul Reisert2, Ryo Takahashi1 and Kentaro Inui1,21Tohoku University, Japan

    2RIKEN Center for Advanced Intelligence Project, Japan{daiqin, naoya-i, preisert, ryo.t, inui}@ecei.tohoku.ac.jp

    Abstract

    The increased demand for structured scientificknowledge has attracted considerable attentionin extracting scientific relation from the evergrowing scientific publications. Distant su-pervision is widely applied approach to auto-matically generate large amounts of labelleddata for Relation Extraction (RE). However,distant supervision inevitably accompanies thewrong labelling problem, which will nega-tively affect the RE performance. To addressthis issue, (Han et al., 2018) proposes a novelframework for jointly training RE model andKnowledge Graph Completion (KGC) modelto extract structured knowledge from non-scientific dataset. In this work, we firstly in-vestigate the feasibility of this framework onscientific dataset, specifically on biomedicaldataset. Secondly, to achieve better perfor-mance on the biomedical dataset, we extendthe framework with other competitive KGCmodels. Moreover, we proposed a new end-to-end KGC model to extend the framework.Experimental results not only show the fea-sibility of the framework on the biomedicaldataset, but also indicate the effectiveness ofour extensions, because our extended modelachieves significant and consistent improve-ments on distantly supervised RE as comparedwith baselines.

    1 Introduction

    Scientific Knowledge Graph (KG), such as Uni-fied Medical Language System (UMLS) 1, is ex-tremely crucial for many scientific Natural Lan-guage Processing (NLP) tasks such as QuestionAnswering (QA), Information Retrieval (IR), Re-lation Extraction (RE), etc. Scientific KG provideslarge collections of relations between entities, typ-ically stored as (h, r, t) triplets, where h = head

    1https://www.nlm.nih.gov/research/umls/

    entity, r = relation and t = tail entity, e.g., (ac-etaminophen, may treat, pain). However, as withgeneral KGs such as Freebase (Bollacker et al.,2008) and DBpedia (Lehmann et al., 2015), sci-entific KGs are far from complete and this wouldimpede their usefulness in real-world applications.Scientific KGs, on the one hand, face the data spar-sity problem. On the other hand, scientific pub-lications have become the largest repository everfor scientific KGs and continue to increase at anunprecedented rate (Munroe, 2013). Therefore, itis an essential and fundamental task to turn theunstructured scientific publications into well orga-nized KG, and it belongs to the task of RE.

    In RE, one obstacle that is encountered whenbuilding a RE system is the generation of traininginstances. For coping with this difficulty, (Mintzet al., 2009) proposes distant supervision to au-tomatically generate training samples via leverag-ing the alignment between KGs and texts. Theyassumes that if two entities are connected by arelation in a KG, then all sentences that containthese entity pairs will express the relation. For in-stance, (aspirin, may treat, pain) is a fact tripletin UMLS. Distant supervision will automaticallylabel all sentences, such as Example 1, Exam-ple 2 and Example 3, as positive instances for therelation may treat. Although distant supervisioncould provide a large amount of training data atlow cost, it always suffers from wrong labellingproblem. For instance, comparing to Example 1,Example 2 and Example 3 should not be seen asthe evidences to support the may treat relationshipbetween aspirin and pain, but will still be anno-tated as positive instances by the distant supervi-sion.

    (1) The clinical manifestations are generally typ-ical nocturnal pain that prevents sleep andthat is alleviated with aspirin.

    1

  • (2) The tumor was remarkably large in size , andpain unrelieved by aspirin.

    (3) The level of pain did not change significantlywith either aspirin or pentoxifylline , but thewalking distance was farther with the pentox-ifylline group .

    To automatically alleviate the wrong labellingproblem, (Riedel et al., 2010; Hoffmann et al.,2011) apply multi-instance learning. In order toavoid the handcrafted features and errors propa-gated from NLP tools, (Zeng et al., 2015) proposesa Convolutional Neural Network (CNN), whichincorporate mutli-instance learning with neuralnetwork model, and achieves significant improve-ment in distantly supervised RE. Despite the im-pressive achievement in RE, this model still hasthe limitation that it only selects the most infor-mative sentence and ignores the rest, thereby losesthe rich information stored in those neglected sen-tences, For instance, among Example 1, Exam-ple 2 and Example 3, Example 1 is undoubtedlythe most informative one for detecting relationmay treat, but it unnecessarily means other sen-tences such as Example 3 could not contribute tothe relation detection. In Example 3, entity aspirinand entity pentoxifylline have alternative relation,and the latter is a drug to treat muscle pain, there-fore the former is also likely to be a pain-killingdrug. To address this issue, recently, attentionmechanism is applied to extract features from allcollected sentences. (Lin et al., 2016) proposes arelation vector based attention mechanism for dis-tantly supervised RE. (Han et al., 2018) proposesa novel joint model that leverages the KG-basedattention mechanism and achieves better perfor-mance than (Lin et al., 2016) on distantly super-vised RE from New York Times (NYT) corpus.

    The success that the joint model (Han et al.,2018) has attained in the newswire domain (ornon-scientific domain) inspires us to choose thestrong model as our base model and assess itsfeasibility on biomedical domain. Specifically,the first question of this research is how the jointmodel behaves when the system is trained onbiomedical KG (e.g., UMLS) and biomeical cor-pus (e.g., Medline corpus). (Han et al., 2018)indicates that the performance of the base modelcould be affected the representation ability ofKGC model. The representation ability of a KGCmodel also varies with dataset (Wang et al., 2017).

    Therefore, given a new dataset (e.g., a biomedicaldataset), it is necessary to extend the base modelwith other competitive KGC models, and choosethe best fit for the given dataset. However, thebase model only implements two KGC models,which are based on TransE (Bordes et al., 2013)and TransD (Ji et al., 2015) respectively. Thus, thesecond question of this work is how other com-petitive KGC models such as ComplEx (Trouil-lon et al., 2016) and SimplE (Kazemi and Poole,2018) influence the performance of the base modelon biomedical dataset. At last but not least,in biomedical KG, a relation is scientifically re-stricted by entity type (ET). For instance, in therelation (h, may treat, t), the ET of t should beDisease or Syndrome. Therefore, ET in-formation is an important feature for biomedicalRE and KGC. For leveraging the ET information,which the base model lacks, in this work, we pro-pose an end-to-end KGC model to enhance thebase model. The proposed KGC model is capableof identifying ET via the word embedding of tar-get entity and incorporating the predicted ET intoa state-of-to-art KGC model to evaluate the plau-sibility of potential fact triplets.

    We conduct evaluation on biomedical datasetsin which KG is collected from UMLS and textualdata is extracted from Medline corpus. The ex-perimental results not only show the feasibility ofthe base model on the biomedical domain, but alsoprove the effectiveness of our proposed extensionsfor the base model.

    2 Related Work

    RE is a fundamental task in the NLP commu-nity. In recent years, Neural Network (NN)-basedmodels have been the dominant approaches fornon-scientific RE, which include ConvolutionalNeural Network (CNN)-based frameworks (Zenget al., 2014; Xu et al., 2015; Santos et al., 2015)Recurrent Neural Network (RNN)-based frame-works (Zhang and Wang, 2015; Miwa and Bansal,2016; Zhou et al., 2016). NN-based approachesare also used in scientific RE. For instance, (Guet al., 2017) utilizes a CNN-based model for iden-tifying chemical-disease relations from Medlinecorpus. (Hahn-Powell et al., 2016) proposes anLSTM-based model for identifying causal prece-dence relationship between two event mentions inbiomedical papers. (Ammar et al., 2017) applies(Miwa and Bansal, 2016)’s model for scientific

    2

  • RE.

    Although remarkably good performances areachieved by the models mentioned above, theystill train and extract relations on sentence-leveland thus need a large amount of annotation data,which is expensive and time-consuming. To ad-dress this issue, distant supervision is proposedby (Mintz et al., 2009). To alleviate the noisydata from the distant supervision, many studiesmodel distant supervision for RE as a MultipleInstance Learning (MIL) problem (Riedel et al.,2010; Hoffmann et al., 2011; Zeng et al., 2015), inwhich all sentences containing a target entity pair(e.g.,aspirin and pain) are seen as a bag to be clas-sified. To make full use of all the sentences in thebag, rather than just the most informative one, (Linet al., 2016) proposes a relation vector based atten-tion mechanism to extract feature from the entirebag and outperforms the prior approaches. (Hanet al., 2018) proposes a joint model that adopts aKG-based attention mechanism and achieves bet-ter performance than (Lin et al., 2016) on distantlysupervised RE from NYT corpus.

    In this work, we are primarily interested in ap-plying distant supervision techniques to extractbiomedical fact triplets from scientific publica-tions. To validate and enhance the efficacy ofthe previous techniques in biomedical domain,we choose the strong joint model proposed by(Han et al., 2018) as the base model and makesome necessary extension for our scientific REtask. Since from the two main groups of KGCmodels (Wang et al., 2017): translational dis-tance models and semantic matching models, thebase model only implements the translational dis-tance models, TransE (Bordes et al., 2013) andTransD (Ji et al., 2015), we thus extend the basemodel with the semantic matching models, Com-plEx (Trouillon et al., 2016) and SimplE (Kazemiand Poole, 2018), for selecting the best fit for ourtask. In addition, the base model has not incor-porated the ET information, which we assume iscrucial for scientific RE. Therefore, we proposean end-to-end KGC model to enhance the basemodel. Different from the work (Xie et al., 2016),which utilizes an ET look-up dictionary to obtainET, the end-to-end KGC is capable of identifyingET via the word embedding of a target entity andthus is free of the attachment to an incomplete ETlook-up dictionary.

    ...

    C&P

    ...

    ×a1

    ...

    C&P

    ...

    ×a2

    ...

    C&P

    ...

    ×a3

    ...C&P

    ...

    ×am

    ...

    +

    RC

    ... ... ...

    KGS

    (e1, e2) (e1, r, e2)

    s1 s2 s3 sm

    s1 s2 s3 sm

    h r t

    ATS

    sfinal

    RE Part KGC Part

    RC

    C&P

    ATS

    KGS KnowledgeGraph  Scoring

    AttentionScoring

    PointwiseOperation

    RelationClassification

    Convolution &Pooling

    ...

    ...

    ...

    Figure 1: Overview of the base model.

    3 Base Model

    The architecture of the base model is illustratedin Figure 1. In this section, we will introduce thebase model proposed by (Han et al., 2018) in twomain parts: KGC part, RE part.

    3.1 KGC PartSuppose we have a KG containing a set of facttriplets O = {(e1, r, e2)}, where each fact tripletconsists of two entities e1, e2 ∈ E and their re-lation r ∈ R. Here E and R stand for the setof entities and relations respectively. KGC modelthen encodes e1, e2 ∈ E and their relation r ∈ Rinto low-dimensional vectors h, t ∈ Rd and r∈ Rd respectively, where d is the dimensional-ity of the embedding space. As mentioned above,the base model adopts two representative trans-lational distance models Prob-TransE and Prob-TransD, which are based on TransE (Bordes et al.,2013) and TransD (Ji et al., 2015) repectively, toscore a fact triplet. Specifically, given an entitypair (e1, e2), Prob-TransE defines its latent rela-tion embedding rht via the Equation 1.

    rht = t− h (1)

    Prob-TransD is an extension of Prob-TransE andintroduces additional mapping vectors hp, tp ∈Rd and rp ∈ Rd for e1, e2 and r respectively.Prob-TransD encodes the latent relation embed-ding via the Equation 2, where Mrh and Mrtare projection matrices for mapping entity embed-dings into relation spaces.

    rht = tr − hr, (2)

    3

  • hr = Mrhh,

    tr = Mrtt,

    Mrh = rph>p + I

    d×d,

    Mrt = rpt>p + I

    d×d

    The conditional probability can be formalizedover all fact triplets O via the Equations 3 and4, where fr(e1, e2) is the KG scoring function,which is used to evaluate the plausibility of a givenfact triplet. For instance, the score for (aspirin,may treat, pain) would be higher than the one for(aspirin, has ingredient, pain), because the formeris more plausible than the latter. θE and θR are pa-rameters for entities and relations respectively, b isa bias constant.

    P(r|(e1, e2), θE , θR) =exp(fr(e1, e2))∑

    r′∈R exp(fr′(e1, e2))(3)

    fr(e1, e2) = b− ‖rht − r‖ (4)

    3.2 RE PartSentence Representation Learning. Given a sen-tence s with n words s = {w1, ..., wn} includ-ing a target entity pair (e1, e2), CNN is usedto generate a distributed representation s for thesentence. Specifically, vector representation vtfor each word wt is calculated via Equation 5,where Wwemb is a word embedding projection ma-trix (Mikolov et al., 2013), Wwpemb is a word posi-tion embedding projection matrix, xwt is a one-hotword representation and xwpt is a one-hot word po-sition representation. The word position describesthe relative distance between the current word andthe target entity pair (Zeng et al., 2014). For in-stance, in the sentence “Patients recorded pain

    e2and aspirin

    e1consumption in a daily diary”, the

    relative distance of the word “and” is [1, -1].

    vt = [vwt ;v

    wp1t ;v

    wp2t ], (5)

    vwt = Wwembx

    wt ,

    vwp1t = Wwpembx

    wp1t ,

    vwp2t = Wwpembx

    wp2t

    The distributed representation s is formulated viathe Equation 6, where, [s]i and [ht]i are the i-thvalue of s and ht, M is the dimensionality of s,W is the convolution kernal, b is a bias vector,and k is the convolutional window size.

    [s]i = maxt{[ht]i}, ∀i = 1, ...,M (6)

    ht = tanh(Wzt + b),

    zt = [vt−(k−1)/2; ...;vt+(k−1)/2]

    KG-based Attention. Suppose for each facttriplet (e1, r, e2), there might be multiple sen-tences Sr = {s1, ..., sm} in which each sentencecontains the entity pair (e1, e2) and is assumed toimply the relation r, m is the size of Sr. As dis-cussed before, the distant supervision inevitablycollect noisy sentences, the base model adopts aKG-based attention mechanism to discriminate theinformative sentences from the noisy ones. Specif-ically, the base model use the latent relation em-bedding rht from Equation 1 (or Equation 2) asthe attention over Sr to generate its final represen-tation sfinal. sfinal is calculated via Equation 7,where Ws is the weight matrix, bs is the bias vec-tor, ai is the weight for si, which is the distributedrepresentation for the i-th sentence in Sr.

    sfinal =m∑

    i=1

    aisi, (7)

    ai =exp(〈rht,xi〉)∑m

    k=1 exp(〈rht,xk〉),

    xi = tanh(Wssi + bs)

    Finally, the conditional probability P (r|Sr, θ)is formulated via Equation 8 and Equation 9,where, θ is the parameters for RE, which includes{Wwemb,W

    wpemb,W,b,Ws,bs,M,d}, M is the

    representation matrix of relations, d is a bias vec-tor, o is the output vector containing the predic-tion probabilities of all target relations for the in-put sentences set Sr, and nr is the total number ofrelations.

    P (r|Sr, θ) =exp(or)∑nrc=1 exp(oc)

    (8)

    o = Msfinal + d (9)

    4 Extensions

    The base model opens the possibility to jointlytrain RE models with KGC models for distantlysupervised RE. The empirical results of the basemodel on NYT corpus indicate that the perfor-mance of distantly supervised RE varies withKGC models (Han et al., 2018). In addition, theperformance of KGC models depends on a givendataset (Wang et al., 2017). Therefore, we assumethat it is necessary to attempt multiple competi-tive KGC models for the joint framework so as

    4

  • to find the optimal combination for our biomedi-cal dataset. However, the base model only imple-ments translational distance models: TransE andTransD, but not the semantic matching models,and this, we assume, might hinder its performancein the new dataset. To address this, we select tworepresentative semantic matching models: Com-plEx (Trouillon et al., 2016) and SimplE (Kazemiand Poole, 2018) as the alternative KGC part.

    As discussed in Section 1, in scientificKGs, a fact triplet is severely restricted byET information (e.g., ET of e2 should beDisease or Syndrome in the fact triplet(e1,may treat, e2)). Therefore, for leveragingET information, which the base model lacks, wealso propose an end-to-end KGC model to extendthe base model. Since the proposed KGC model isbuild on SimplE and is capable of Named EntityRecognition (NER), we call it SimplE NER.

    4.1 ComplEx based Attention

    Given a fact triplet (e1, r, e2), ComplEx then en-codes entities e1, e2 and relation r into a complex-valued vector e1 ∈ Cd, e2 ∈ Cd and r ∈ Cdrespectively, where d is the dimensionality of theembedding space. Since entities and relations arerepresented as complex-valued vector, each x ∈Cd consists of a real vector component Re(x)and imaginary vector component Im(x), namelyx = Re(x)+iIm(x). The KG scoring function ofComplEx for a fact triplet (e1, r, e2) is calculatedvia Equation 10, where ē2 is the conjugate of e2;Re(·) (or Im(·)) means taking the real (or imagi-nary) part of a complex value. 〈u, v, w〉 is definedvia Equation 11, where [·]n is the n-th entry of avector.

    fr(e1, e2) = Re(〈e1, r, ē2〉) =〈Re(r), Re(e1), Re(e2)〉

    +〈Re(r), Im(e1), Im(e2)〉+〈Im(r), Re(e1), Im(e2)〉−〈Im(r), Im(e1), Re(e2)〉

    (10)

    〈u,v,w〉 =d∑

    n=1

    [u]n[v]n[w]n (11)

    Since the asymmetry of this scoring function,namely fr(e1, e2) 6= fr(e2, e1), ComplEx can ef-fectively encode asymmetric relations (Trouillonet al., 2016). For calculating the attention, the rhtin Equation 7 is defined via Equation 12, where �

    represents the element-wise multiplication.

    rht = Re(e1)�Re(e2)+Im(e1)�Im(e2) (12)

    4.2 SimplE based AttentionGiven a fact triplet (e1, r, e2), SimplE then en-codes each entity e ∈ E into two vectors he, te∈ Rd and each relation r ∈ R into two vectorsvr, vr−1 ∈ Rd respectively, where d is the di-mensionality of the embedding space. he capturesthe entity e’s behaviour as the head entity of a facttriplet and te captures e’s behaviour as the tail en-tity. vr represents r in a fact triplet (e1, r, e2),while vr−1 represents its inverse relation r−1 inthe triplet (e2, r−1, e1). The KG scoring functionof SimplE for a fact triplet (e1, r, e2) is defined viaEquation 13.

    fr(e1, e2) =1

    2(〈he1 ,vr, te2〉+ 〈he2 ,vr−1 , te1〉)

    (13)Similar to the attention from ComplEx, the rht inEquation 7 is defined via Equation 14.

    rht =1

    2(he1 � he2 + te1 � te2) (14)

    4.3 SimplE NER based AttentionThe proposed end-to-end KGC model is basedon SimplE, because SimplE outperforms sev-eral state-of-the-art models including Com-plEx (Kazemi and Poole, 2018). The proposedmodel is illustrated in Figure 2. It includes ETclassification part (below) and KG Scoring part(above). In ET classification part, a multi-layerperceptron (MLP) with two hidden layers areapplied to identify ET based on word embeddingof target entity. In KG Scoring part, head entityand tail entity along with their predicted ETs andtheir relation are projected into corresponding KGembeddings, which are then fed to a KG scoringfunction.

    ET Classification Part. In this work, we usea MLP network to classify ET for head entity andtail entity. The architecture of our MLP networkis as bellow:

    hw = tanh(Wwembx

    w),

    h1 = sigmoid(W1hw + b1),

    h2 = sigmoid(W2h1 + b2),

    y = sigmoid(WETh2 + bET )

    (15)

    where Wwemb is a word embedding projection ma-trix, which is initialized by the pre-trained word

    5

  • ... ...

    MLP MLP

    Entity Type Entity Type

    KG Scoring

    KnowledgeGraph Scoring

    Part

    Entity TypeClassification

    Part

    KG embedding

    Word embedding

    KnowledgeFact Relation Head Entity Tail Entity

    e.g.,"dopamine"e.g.,"may be treated by"e.g.,"hypotension"

    e.g.,Disease or Syndrome e.g.,Biologically Active Substance

    Figure 2: Overview of the proposed end-to-end KGCmodel.

    embedding that is trained on Medline corpus viaGensim word2vec tool, xw is a one-hot entity rep-resentation, y is the output vector containing theprediction probabilities of all target ETs. W1, b1,W2, b2, WET and bET are parameters to opti-mize.

    KG Scoring Part. Given fact triplet and pre-dicted ET pair ET1 (for e1) and ET2 (for e2),the proposed model project them into their cor-responding KG embeddings namely he1 , te1 , vr,vr−1 , he2 , te2 , hET1 , tET1 , hET2 and tET2 respec-tively, where hET1 (or tET1) represents the KGembedding of ET for e1 when e1 acts as the headentity (or tail entity) in a fact triplet. The KG scor-ing function is defined via Equation 16. Since theproposed KGC model is build on SimplE, we ap-ply Equation 14 to calculate rht.

    fr(e1, e2) =1

    4(〈he1 ,vr, te2〉

    +〈he2 ,vr−1 , te1〉+〈hET1 ,vr, tET2〉

    +〈hET2 ,vr−1 , tET1〉)

    (16)

    5 Experiments

    Our experiments aim to demonstrate that, (1) thebase model proposed by (Han et al., 2018) is fea-sible for biomedical dataset, such as UMLS andMedline corpus, and (2) in order to improve theperformance on the given biomedical dataset, itis necessary to extend the base model with othercompetitive KGC models, such as ComplEx andSimplE, and (3) the proposed end-to-end KGCmodel is effective for distantly supervised REfrom biomedical dataset.

    #Entity #Relation #Train #Test25,080 360 53,036 11,810

    Table 1: Statistics of KG in this work.

    5.1 Data

    The biomedical datasets used for evaluation con-sist of biomedical knowledge graph and biomedi-cal textual data, which will be detailed as follows.

    Knowledge Graph. We choose the UMLS asthe KG. UMLS is a large biomedical knowledgebase developed at the U.S. National Library ofMedicine. UMLS contains millions of biomedi-cal concepts and relations between them. We fol-low (Wang et al., 2014), and only collect the facttriplet with RO relation category (RO stands for“has Relationship Other than synonymous, nar-rower, or broader”), which covers the interestingrelations like may treat, my prevent, etc. From theUMLS 2018 release, we extract about 60 thousandsuch RO fact triplets (i.e., (e1, r, e2)) under therestriction that their entity pairs (i.e., e1 and e2)should coexist within a sentence in Medline cor-pus. They are then randomly divided into train-ing and testing sets for KGC. Following (Westonet al., 2013), we keep high entity overlap betweentraining and testing set, but zero fact triplet over-lap. The statistics of the extracted KG is shownin Table 1. For training the ET ClassificationPart in Section 4.3, we also collect about 35 thou-sand entity-ET pairs (e.g., heart rates-ClinicalAttribute) from the UMLS 2018 release.

    Textual Data. Medline corpus is a collectionof bimedical abstracts maintained by the NationalLibrary of Medicine. From the Medline corpus,by applying a string matching model 2, we extract732, 771 sentences that contain the entity pairs(i.e., e1 and e2) in the KG mentioned above as ourtextual data, in which 592, 605 sentences are fortraining and 140, 166 sentences for testing. Foridentifying the NA relation, besides the “related”sentences, we also extract the “unrelated” sen-tences based on a closed world assumption: pairsof entities not listed in the KG are regarded to haveNA relation and sentences containing them con-sidered to be the “unrelated” sentences. By thisway, we extract 1, 738, 801 “unrelated” sentencesfor the training data, and 431, 212 “unrelated” sen-tences for the testing data. Table 2 presents some

    2We adopt the NER model that is available at https://github.com/mpuig/spacy-lookup.

    6

  • 0.0 0.2 0.4 0.6 0.8 1.0Recall

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0Pr

    ecis

    ion

    CNN+ATTCNN+AVEJointD+KATTJointE+KATTJointComplEx+KATTJointSimplE+KATTJointSimplE_NER+KATT

    Figure 3: Aggregate precision/recall curves for differ-ent RE models.

    sample sentences in the training data.

    5.2 Parameter Settings

    We base our work on (Han et al., 2018) and ex-tend their implementation available at https://github.com/thunlp/JointNRE, and thusadopt identical optimization process. We use thedefault settings of parameters 3 provided by thebase model. Since we address the distantly super-vised RE in biomedical domain, we use the Med-line corpus to train the domain specific word em-bedding projection matrix Wwemb.

    5.3 Result and Discussion

    (Han et al., 2018) evaluates the base model on non-scientific dataset. In this work, we firstly plan toassess its feasibility on scientific dataset, and sec-ondly, to investigate the effectiveness of our exten-sions, which is discussed in Section 4, with respectto enhancing the distantly supervised RE from sci-entific dataset.

    Relation Extraction We follow (Mintz et al.,2009; Weston et al., 2013; Lin et al., 2016; Hanet al., 2018) and conduct the held-out evaluation,in which the model for distantly supervised REis evaluated by comparing the fact triplets identi-fied from textual data (i.e., the bag of sentencescontaining the target entity pairs) with those in

    3As a preliminary study, we only adopt the default hyper-parameters, but we will tune them in the furture.

    KG. We report precision-recall curves and Preci-sion@N (P@N) as well in our evaluation.

    The precision-recall curves are shownin Figure 3, where “JointD+KATT” and“JointE+KATT” represent the RE model with theKG-based attention obtained from Prob-TransDand Prob-TransE respectively, which are our basemodels and trained on both KG and textual data.Similarly, “JointComplEx+KATT”, “JointSim-plE+KATT” and “JointSimplE NER+KATT”represent the RE model with the KG-basedattention obtained from ComplEx, SimplE andSimplE NER respectively, which are our exten-sions. “CNN+AVE” and “CNN+ATT” representthe RE model with average attention and relationvector based attention (Lin et al., 2016) respec-tively, which are not joint models and only trainedon textual data. The results show that:

    (1) All RE models with KG-based attention,such as “JointE+KATT”, outperform those modelswithout it, such as “CNN+ATT”. This observationis in line with (Han et al., 2018). This demon-strates that not just for non-scientific dataset ,jointly training a KGC model with a RE modelis also an effective approach to improve the per-formance of distantly supervised RE for biomed-ical dataset. In other words, the outperformanceproves the feasibility of the base model proposedby (Han et al., 2018) on biomedical dataset. Thecomparison between (Han et al., 2018)’s results onnon-scientific dataset and ours on scientific datasetalso indicates that the performance of base modelcould differ according to the dataset. Specifically,on scientific dataset, “JointE+KATT” performsbetter than “JointD+KATT” but in non-scientificdataset the latter outperforms the former.

    (2) Our extended models, “JointCom-plEx+KATT”, “JointSimplE+KATT” and“JointSimplE NER+KATT”, achieve betterprecision than the base model over the majorrange of recall. It could be attributed to theirbetter capability of modeling asymmetric relations(e.g., may treat and may prevent), becausetheir KG scoring functions are asymmetry (i.e.,fr(e1, e2) 6= fr(e2, e1)). The superior perfor-mance indicates the necessity of our extensions onthe base model. Specifically, given the frequentlyused biomedical dataset, UMLS and Medlinecorpus, it would be an effective method to switchthe translational distance models, such as TransEand TransD, with the semantic matching models,

    7

  • Fact Triplet Textual Data

    (insulin,gene plays role in process,lipid metabolism)

    s1 : It is unknown whether short - term angiotensin receptor blocker therapycan improve glucose and lipid metabolisme2 in insuline1 - resistant subjects.s2 : Adipocyte lipid metabolisme2 is primarily regulated by insuline1 and thecatecholamines norepinephrine and epinephrine.s3 : ...

    (insulin, NA, TPA)

    s1 : M wortmannin resulted in 80% and 20% decreases of glucose uptakestimulated by insuline1 and TPAe2 , respectively.s2 : The effects of insuline1 , IGF1 and TPAe2 were also observed in thepresence of cycloheximide.s3 : ...

    Table 2: Examples of textual data extracted from Medline corpus.

    such as ComplEx and SimplE, for increasingthe performance of distantly supervised RE. Theeffect of different KGC models on the distantlysupervised RE will be discussed later.

    (3) The model enhanced by our proposed KGCmodel, “JointSimplE NER+KATT”, achieves thehighest precision over almost entire range of recallcompared with the models that apply the existingKGC models. This proves the effectiveness of ourproposed KGC model for the distantly supervisedRE. Additionally, different from the exiting KGCmodels, the proposed end-to-end KGC model iscapable of identifying ET information from wordembedding of target entity. This indicates that theincorporation of semantic information of entity,such as ET, is a promising approach for enhanc-ing the base model.

    Effect of KGC on RE. (Han et al., 2018) in-dicates that KGC models could affect the perfor-mance of distantly supervised RE. For investigat-ing the influence of KGC models on our specificRE task, we compare their link prediction resultson our KG with their corresponding Precision@N(P@N) results on our RE task. Link prediction isthe task that predicts tail entity t given both headentity h and relation r, e.g., (h, r, ∗), or predicthead entity h given (∗, r, t). We report the meanreciprocal rank (MRR) and mean Hit@N scoresfor evaluating the KGC models. MRR is definedas: MRR = 12∗|tt|

    ∑(h,r,t)∈tt(

    1rankh

    + 1rankt ),where tt represents the test triplets. Hit@N is theproportion of the correctly predicted entities (h ort) in top N ranked entities. Table 3 and Table 4 rep-resent the RE precision@N and link prediction re-sults respectively. This comparison indicates thatgiven a biomedical dataset, the performance of aKGC model on the link prediction task could pre-dict its effectiveness on its corresponding distantly

    supervised RE task. This observation also instructus how to select the best KGC model for the basemodel. In addition, Table 3 and Table 4 indicatethat ET is not only effective for distantly super-vised RE task, but also for KGC task, and this ob-servation will inspire us to explore other useful se-mantic feature of entity, such as the definition ofentity, for our task.

    Model P@2k P@4k P@6k MeanJointE+KATT 0.876 0.786 0.698 0.786JointD+KATT 0.848 0.725 0.528 0.700

    JointComplEx+KATT 0.892 0.819 0.741 0.817JointSimplE+KATT 0.900 0.808 0.721 0.809

    JointSimplE NER+KATT 0.913 0.829 0.753 0.831

    Table 3: P@N for different RE models, where k=1000.

    MRR Hit@Model Raw Filter 1 3 10TransE 0.156 0.200 0.113 0.244 0.356TransD 0.138 0.149 0.098 0.160 0.245

    ComplEx 0.278 0.457 0.380 0.507 0.587SimplE 0.273 0.455 0.368 0.516 0.598

    SimplE NER 0.339 0.538 0.473 0.578 0.651

    Table 4: Link prediction results for different KGCmodels.

    6 Conclusion and Future Work

    In this work, we tackle the task of distantly su-pervised RE from biomedical publications. Tothis end, we apply the strong joint framework pro-posed by (Han et al., 2018) as the base model. Forenhancing its performance on our specific task,we extend the base model with other competitiveKGC models. What is more, we also propose anew end-to-end KGC model, which incorporatesword embedding based entity type informationinto a sate-of-the-art KGC model. Experimentalresults not only show the feasibility of the base

    8

  • model on the biomedical domain, but also indicatethe effectiveness of our extensions. Our extendedmodel achieves significant and consistent im-provements on the biomedical dataset as comparedwith baselines. Since the semantic information oftarget entity, such as ET information, is effectivefor our task, in the future, we will explore otheruseful semantic features, such as the definition oftarget entity and fact triplet chain between enti-ties (e.g., cancer→disease has associated gene→Ku86→gene plays role in process→NHEJ), forour task.

    Acknowledgement

    This work was supported by JST CREST GrantNumber JPMJCR1513, Japan and KAKENHIGrant Number 16H06614.

    ReferencesWaleed Ammar, Matthew Peters, Chandra Bhagavat-

    ula, and Russell Power. 2017. The ai2 system atsemeval-2017 task 10 (scienceie): semi-supervisedend-to-end entity and relation extraction. In Pro-ceedings of the 11th International Workshop on Se-mantic Evaluation (SemEval-2017), pages 592–596.

    Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a collab-oratively created graph database for structuring hu-man knowledge. In Proceedings of the 2008 ACMSIGMOD international conference on Managementof data, pages 1247–1250. AcM.

    Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in neural informationprocessing systems, pages 2787–2795.

    Jinghang Gu, Fuqing Sun, Longhua Qian, andGuodong Zhou. 2017. Chemical-induced disease re-lation extraction via convolutional neural network.Database, 2017.

    Gus Hahn-Powell, Dane Bell, Marco A Valenzuela-Escárcega, and Mihai Surdeanu. 2016. This beforethat: Causal precedence in the biomedical domain.arXiv preprint arXiv:1606.08089.

    Xu Han, Zhiyuan Liu, and Maosong Sun. 2018. Neuralknowledge acquisition via mutual attention betweenknowledge graph and text. In Thirty-Second AAAIConference on Artificial Intelligence.

    Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In Proceedings of the 49th

    Annual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies-Volume 1, pages 541–550. Association for Compu-tational Linguistics.

    Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, andJun Zhao. 2015. Knowledge graph embedding viadynamic mapping matrix. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers), volume 1, pages 687–696.

    Seyed Mehran Kazemi and David Poole. 2018. Simpleembedding for link prediction in knowledge graphs.In Advances in Neural Information Processing Sys-tems, pages 4289–4300.

    Jens Lehmann, Robert Isele, Max Jakob, AnjaJentzsch, Dimitris Kontokostas, Pablo N Mendes,Sebastian Hellmann, Mohamed Morsey, PatrickVan Kleef, Sören Auer, et al. 2015. Dbpedia–alarge-scale, multilingual knowledge base extractedfrom wikipedia. Semantic Web, 6(2):167–195.

    Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 2124–2133.

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

    Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2, pages 1003–1011. Association forComputational Linguistics.

    Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. arXiv preprint arXiv:1601.00770.

    Randall Munroe. 2013. The rise of open access. Sci-ence, 342(6154):58–59.

    Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases, pages 148–163. Springer.

    Cicero Nogueira dos Santos, Bing Xiang, and BowenZhou. 2015. Classifying relations by ranking withconvolutional neural networks. arXiv preprintarXiv:1504.06580.

    9

  • Théo Trouillon, Johannes Welbl, Sebastian Riedel, ÉricGaussier, and Guillaume Bouchard. 2016. Com-plex embeddings for simple link prediction. In In-ternational Conference on Machine Learning, pages2071–2080.

    Quan Wang, Zhendong Mao, Bin Wang, and Li Guo.2017. Knowledge graph embedding: A survey ofapproaches and applications. IEEE Transactionson Knowledge and Data Engineering, 29(12):2724–2743.

    Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In AAAI, volume 14, pages1112–1119.

    Jason Weston, Antoine Bordes, Oksana Yakhnenko,and Nicolas Usunier. 2013. Connecting languageand knowledge bases with embedding models for re-lation extraction. arXiv preprint arXiv:1307.7973.

    Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016.Representation learning of knowledge graphs withhierarchical types. In IJCAI, pages 2965–2971.

    Kun Xu, Yansong Feng, Songfang Huang, andDongyan Zhao. 2015. Semantic relation clas-sification via convolutional neural networkswith simple negative sampling. arXiv preprintarXiv:1506.07650.

    Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 1753–1762.

    Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,Jun Zhao, et al. 2014. Relation classification viaconvolutional deep neural network. In COLING,pages 2335–2344.

    Dongxu Zhang and Dong Wang. 2015. Relation classi-fication via recurrent neural network. arXiv preprintarXiv:1508.01006.

    Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, BingchenLi, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory net-works for relation classification. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),volume 2, pages 207–212.

    10

  • Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications, pages 11–20Minneapolis, USA, June 6, 2019. c©2019 Association for Computational Linguistics

    Scalable, Semi-Supervised Extraction of Structured Information fromScientific Literature

    Kritika Agrawal, Aakash Mittal, Vikram PudiData Sciences and Analytics Center, Kohli Center on Intelligent Systems

    IIIT, Hyderabad, India{kritika.agrawal@research.,aakash.mittal@students.,vikram@}iiit.ac.in

    Abstract

    As scientific communities grow and evolve,there is a high demand for improved meth-ods for finding relevant papers, comparingpapers on similar topics and studying trendsin the research community. All these tasksinvolve the common problem of extractingstructured information from scientific arti-cles. In this paper, we propose a novel, scal-able, semi-supervised method for extractingrelevant structured information from the vastavailable raw scientific literature. We extractthe fundamental concepts of aim, method andresult from scientific articles and use them toconstruct a knowledge graph. Our algorithmmakes use of domain-based word embeddingand the bootstrap framework. Our experimentsshow the domain independence of our algo-rithm and that our system achieves precisionand recall comparable to the state of the art.We also show the research trends of two dis-tinct communities - computational linguisticsand computer vision.

    1 Introduction

    With the tremendous amount of research publica-tions available online, there is an increasing de-mand to automatically process this information tofacilitate easy navigation through this enormousliterature for researchers. Whenever researchersstart working on a problem, they are interested toknow if the problem has been solved previously,methods used to solve this problem, the impor-tance of the problem and the applications of thatproblem. This leads to the requirement of find-ing automatic ways of extracting such structuredinformation from the vast available raw scientificliterature which can help summarize the researchpaper as well as the research community and canhelp in finding relevant papers. Organizing scien-tific information into structured knowledge basesrequires information extraction (IE) about scien-tific entities and their relationships. However, the

    challenges associated with scientific informationextraction are greater than for a general domain.General methods of information extraction cannotbe applied to research papers due to their semi-structured nature and also the new and unique ter-minologies used in them. Secondly, annotationof scientific text requires domain expertise whichmakes annotation costly and limits resources.

    There is a considerable amount of previous andongoing work in this direction, starting from key-word extraction (Kim et al., 2010) (Gollapalli andCaragea, 2014) and textual summarization (Jaidkaet al., 2018). Other research has focused on unsu-pervised approaches such as bootstrapping (Tsaiet al., 2013)(Gupta and Manning, 2011), wherethey introduced hand-designed templates to ex-tract scientific keyphrases and categorize theminto different concepts, and then more templatesare added automatically through bootstrapping.Hand-designed templates limit their generaliza-tion to all the different domains present within thescientific literature. A recent challenge on Scien-tific Information Extraction (ScienceIE) (Augen-stein et al., 2017) provided a dataset consisting of500 scientific paragraphs with keyphrase annota-tions for three categories: TASK, PROCESS, MA-TERIAL across three scientific domains, Com-puter Science, Material Science, and Physics.This invited many supervised and semi-supervisedtechniques in this field. Although all these tech-niques can help extract important concepts of a re-search paper in a particular domain, we need moregeneral and scalable methods which can summa-rize the complete research community.

    In this work, we propose a new technique toextract key concepts from the research publica-tions. Our main insight is that a paper cites an-other paper either for its aim, or method, or re-sult. Therefore, key contribution of paper in theresearch community can be best summarized byits aim, the method used to solve the problem and

    11

  • the final result. We define these concepts as:Aim: Target or primary focus of the paper.Method: Techniques used to achieve the aim.Result: well-defined output of the experiments orcontribution which can be directly used by the re-search community.Example: “The support-vector network (Result) is a new

    learning machine for two-group classification (Aim) prob-

    lems. The machine conceptually implements the following

    idea: input vectors are non-linearly mapped to a very high-

    dimension feature space (Method). In this feature space, a

    linear decision surface is constructed.“

    We extract these concepts from Title, Abstractand Citation Contexts of a research paper. Thesesections can be accurately automatically extractedfrom research papers. Title and Abstract work asa short and to the point summary of work done inthe paper. They are an essential place to find theexact phrases for these concepts without the intro-duction of too much noise. Citation context is thetext around the citation marker. This text serves as“micro summaries“ of a cited paper and phrases inthis text are important candidates for aim, methodor result of the cited paper. We combine data min-ing and natural language techniques to solve theproblem scalably in a semi-supervised manner.Graph representations like knowledge graph thatlink the information of a large body of publica-tions can reveal patterns and lead to the discov-ery of new information that would not be apparentfrom the analysis of just one publication. Anal-ysis on top of these representations can lead tonew scientific insights and discovery of trends ina research area. They can also facilitate someother tasks like assigning reviewers, recommend-ing relevant papers or improving scientific searchengines. Therefore, we propose to build graphicalrepresentation by extracting phrases representingthe concepts Aim, Method and Result from scien-tific publications. We introduce these phrases asadditional nodes and connect them to their corre-sponding paper nodes in the citation graph. We ar-gue that the citation network is an integral part ofscientific knowledge graph and the proposed rep-resentation can adequately summarize the researchcommunity. Proposed graph is shown in Figure 1.

    Contributions: Our key contributions are:(i) We propose a novel, scalable, semi-supervisedand domain-independent method for extractingconcepts, aim, method and result from the vastavailable raw scientific literature by using domain-

    Figure 1: Structure of proposed Representation

    based word embeddings and data mining tech-niques. Our approach also takes Citation Con-text into account apart from Title and Abstract onwhich most of the work relied till now. (ii) Weexperimentally validate our approach and showstatistically significant improvements over exist-ing state-of-the-art models. (iii) We show howthe extracted concepts and the available citationgraph can be used to represent the research com-munity as a knowledge graph. (iv) We demon-strate our method on a large multi-domain datasetbuilt with the help of DBLP citation network. Ourdataset consists of 332,793 papers and 1,508,560links between them. (v) We present a case studyon the computational linguistics community andcomputer vision community using the three con-cepts extracted from its articles, for verifying theresults of our system and for showing domain in-dependence of our approach.

    Our research background, hypothesis, and mo-tivation were presented in this section. In the fol-lowing section, we describe proposed approach indetail. Finally, we present our datasets, experi-ments, and results and briefly summarize state-of-the-art approaches before concluding the paper.

    2 Approach

    2.1 Concept Extraction

    Problem Definition: Given a target document d,the objective of the concept extraction task is toextract a list of words or phrases which best repre-sent the aim, method and result of document d.Prior work has solved the problem of extractingkeyphrases and relations between them as a se-quence labelling task. However, due to the non-availability of large annotated data for this purpose

    12

  • limits this approach. Also this approach does nottake advantage of the fact that more than 96 per-cent of phrases that form aim, method and resultare noun phrases (Augenstein et al., 2017). Sincewe already have a defined set of candidates forthe key phrases, we attempt this problem as multi-class classification problem. Given a document,we classify its phrases as Aim, Method, Result.Our approach is built on the observation that thesemantics of the sentence of document d contain-ing a phrase belonging to any of the concept typeis similar across research papers. To capture thissemantic similarity, we use k nearest neighbourclassifier on top of state-of-the-art (Devlin et al.,2018) domain based word embeddings. We startby extracting features from a small set of anno-tated examples and used bootstrapping (Gupta andManning, 2014) for extracting new features fromunlabeled dataset. Figure 2 shows our pipeline.

    Figure 2: Proposed Method

    Following are some of the terminologies whichwill be used throughout the paper that follows:

    • Candidate phrases: Phrases present in thetarget document d which will be consideredfor labeling.• Concept mention: Phrases labeled as either

    aim, method or Result in the labeled dataset.• Parent sentence of a phrase p: The original

    sentence in target document to which the can-didate phrase/concept mention p belongs to.• Left context phrase(S,p): The part of the par-

    ent sentence S before the occurrence of thecandidate phrase p or concept mention.• Right context phrase(S,p): The part of the

    parent sentence S after the occurrence of thecandidate phrase p or concept mention.• Left Context Vectors(p): Vector representa-

    tions of left context phrase p.• Right Context Vectors(p): Vector representa-

    tions of right context phrase p.• Feature Vectors: Tuple of Left and Right

    Context Vectors which is being used as fea-tures to label candidate phrases.

    • Feature Score: Each feature vector has an as-sociated feature score between 0 and 1 thatrepresents the confidence of it being a repre-sentative of the class. Seed features have afeature score of 1.• Support Score of candidate phrase p for class

    c: Every phrase is assigned a support scorefor all classes that represents the confidencethat the phrase belongs to that class.

    Seed Feature Extraction: In this step, weextract features for each of the concept type usingthe small set of annotated examples. For eachconcept mention in the annotated examples, weconstruct left context vector lcv and right contextvector rcv. These lcv and rcv then form partof the features for the class to which conceptmention belongs to. Phrase embeddings aregenerated using pre-trained BERT model (Devlinet al., 2018) fine-tuned on DBLP research papersdataset. Details of BERT training and datasetsused for seed feature extraction are given in theExperiments Section.Candidate Phrase Extraction: To limit thesearch space of phrases, we propose to use nounphrases present in the Title and Abstract ofdocument d as candidate phrases. For citationcontexts, named entities form a better set of can-didates as shown by (Ganguly and Pudi, 2016).However different named entities can be linked todifferent papers cited in the same citation context.So it becomes essential to first identify whichentity e corresponds to which cited paper cp andthen use the proposed algorithm to classify e asaim/method/result for the corresponding papercp. For the above purpose, we use entity-citationlinking algorithm (Ganguly and Pudi, 2016).The matching function iterates over entities andcitations to get their closeness score. After thescoring step, a two-step pruning is performed.It first takes all the citations and keeps a list ofthe closest entity per citation. Then it takes theremaining entities and keeps only the closestcitations per entity. Finally, we get a list of tupleswhere each element contains a unique entitymatched with its citation. Only the entities whichare present in this list of tuples are considered ascandidate phrases.Labeling Candidate Phrases: For labeling can-didates in iteration i, we use k-NN. The algorithmfor labeling candidate phrases is presented inAlgorithm 1.

    13

  • Algorithm 1: Label Candidate Phrases1. For each sentence s in document d in thedataset, p← unlabeled Phrase in sentence s.

    2. Let lcv be the left context vector and rcv bethe right context vector corresponding tophrase p in sentence s.

    3. Find nearest neighbours of lcv and rcvfrom the feature vectors that are atmaxdistance r. Let the nearest neigbourscorresponding to lcv be lnn or left nearestneighbours and rcv be rnn or right nearestneighbours.

    4. If the size of both lnn and rnn is less thanthe minimum number of neighbours requiredfor classification k then the phrase can not belabeled in this iteration and we move to thenext phrase.

    5. Else we take k nearest neighbours for boththe lcv and rcv and the support score of thephrase for class c is calculated as follows :

    N = {n|n ∈ Top k Neighbours of lcv or rcvand label(n) = c }

    supportScore(p, c) =∑

    n∈NfeatureScore(n)

    6. Then the predicted class for phrase p isargmax

    csupportScore(p, c).

    Finally after T iterations, unlabeled candi-date phrases are discarded.Extraction of new features: For each phrasep assigned class c in any of the iterations, wegenerate context vectors lcv and rcv. We definethe feature score corresponding to the contextvectors of phrase p labeled as class c as:

    featureScore(p) =supprtScore(p, c)∑c′ supportScore(p, c

    ′)

    For each class, the context vectors are sorted basedon their feature score and top 5000 are taken asfeature vectors.Final Selection: For each document, we take topt phrases (based on their supportScore) for eachclass as the final output of our system.

    2.2 Graph Construction

    Graph definition: We build a graphical represen-tation by using the extracted concepts and citationgraph. Our graph has the following types of nodes

    and edges:Paper nodes: These are the original paper nodesin the citation graph. Each paper node hasmetadata related to the paper like dblp id, title,authors, conference, year of publication.Entity nodes: These nodes are the phrases ex-tracted in the concept extraction step.Cited by relation: A cited by relation is definedbetween paper nodes pi and pj if paper pi hascited pj .Aim relation: Aim relation is defined betweena paper node pi and entity node ei if ei wasextracted as aim concept for pi.Method relation: A method relation is definedbetween a paper node pi and entity node ei if eiwas extracted as method concept for pi.Result relation: A result relation is definedbetween a paper node pi and entity node ei if eiwas extracted as a result concept for pi.

    Construction of Graph: A major challengein the construction of graph using phrases ex-tracted in concept extraction step is merging ofphrases with the same meaning. For the purposeof entity node merging, we do the following:1. We group the papers according to the confer-ence in which they were published. Then ∀ papersin the same group, we cluster their extractedphrases by running DBSCAN (Ester et al., 1996)over vector space representations of these phrases.The clusters are created based on lexical similaritywhich is captured by cosine distance betweenphrase embeddings. The intuition behind cluster-ing phrases conference wise is that the researchpapers in a conference have same domain and thusphrases with high lexical similarity belonging toa particular conference are much more likely tomean the same as compared to phrases acrossconferences. This helps to avoid error as inthe example : ‘real time intrusion detection‘ insecurity domain and ‘real time object detection‘in computer vision domain are very different fromeach other but they may be clustered together byDBSCAN algorithm based on lexical similarity ifDBSCAN is run on all the papers in the dataset atonce.2. Clusters merging across conferences: A clusteri belonging to conference c1 and a cluster j be-longing to conference c2 are merged if they haveany common phrase. This is done to capture thefact that there can be more than one conference

    14

  • on same domain and hence some of their clustersshould be merged if they correspond to same termor phrase. For example, both NAACL and ACLhave papers on machine translation and thereforethe individual clusters of these conferencescorresponding to machine translation should bemerged.Finally we get clusters such that phrases in eachcluster have the same meaning. We add onlyone entity node to the graph for each cluster.We define the relation type between a papernode and an entity node based on the label ofthe entity (phrase inside the entity node) for thecorresponding paper as identified in ConceptExtraction step.

    3 Experimental Setup

    Dataset Creation: All the experiments were con-ducted on DBLP Citation Network (version 7)dataset. This dataset is an extensive collection ofcomputer science papers. DBLP only providescitation-link information, abstract, and paper ti-tles. For the full text of these papers, we use thesame dataset as have been used by (Ganguly andPudi, 2017). This dataset is partly noisy with someduplicate paper information, and there is a lack ofunique one-to-one mapping from the DBLP pa-per ids to the actual text of that paper. During thecreation of our final dataset, we either pruned outambiguous papers or manually resolved the con-flicts. We came up with a final set of 465,355 pa-pers from the DBLP corpus for which we have fulltext available. Since we need papers that are con-nected via citation relations, we prune our datasetby taking only the largest connected component inthe citation network while considering the links tobe bidirectional. We get 332,793 papers having1,508,560 citation links. For extraction of cita-tion context, we used Parscit (Prasad et al., 2018).For the papers for which abstract was not availablein the DBLP dataset, we use the one extracted byParscit.Phrase embeddings: For vector representation ofa phrase, we use BERT: Pre-training of Deep Bidi-rectional Transformers for Language Understand-ing as proposed in (Devlin et al., 2018). We use thepre-trained model BERT-Base, Uncased: 12-layer,768-hidden, 12-heads, 110M parameters availablepublically. We fine tune the model on our DBLPresearch paper dataset. Complete text of papersafter cleaning has been used for the purpose of

    fine tuning. The model is fined tuned on total of20970300 sentences with max sequence length as128 and learning rate as 2× 10−5. For generatingthe phrase embedding we use second last layer asthe pooling layer with pooling strategy as reducedmean.Concept Extraction: (a) For the purpose of seedfeature generation we use the following two pub-licly available datasets :

    (i) SemEval 2017 Task 10 dataset (Augensteinet al., 2017): It contains 500 scientific paragraphsfrom physics, material science and computer sci-ence domain, each marked with keyphrases andeach keyphrase is labelled as TASK, PROCESSand MATERIAL. The concepts of TASK andPROCESS in this dataset closely relates to ourdefinition of AIM and METHOD. This completedataset is used for seed feature extraction.

    (ii) Gupta and Manning(2011) introduced adataset of titles and abstracts of 474 researchpublications from ACL Anthology annotated withphrases corresponding to FOCUS, TECHNIQUEand DOMAIN. Their definitions of FOCUS andTECHNIQUE closely relate to our definitions ofAIM and METHOD respectively. We divided thisdata into two parts- one is used as training data forseed features extraction having 277 papers and an-other as test data for evaluation purposes having197 papers.These two datasets helped to build seed featuresfor AIM and METHOD category. We removed thepapers from SemEval dataset which overlappedwith (Gupta and Manning, 2011).For RESULT, we manually annotated titles and ab-stracts of 100 research publications in computerscience domain.

    (b) While generating vector encoding for con-text phrases, we limit the length of the con-text phrase to 25 in-order to handle very longsentences. We used cosine distance to mea-sure distance between vector representation of thephrases.

    (c) It may be possible that there are more thanone concept mention in a sentence. To nullify theeffect of other concept mentions, we generated theseed features list in two ways:

    • Take the left context phrase and right contextphrase and generate their vector representa-tion. This is called as unmasked feature list.

    • We mask the other candidate phrases C in theleft and right context phrase of candidate ci

    15

  • k r t f1 score precision recall30 0.65 3 40.66 46.04 36.4160 0.65 3 40.47 52.60 32.8840 0.65 3 40.38 48.65 34.5140 0.60 4 40.06 47.12 34.8430 0.75 4 38.38 41.95 35.37

    Table 1: f1, precision & recall score for AIM concept

    k r t f1 score precision recall40 0.85 20 32.58 22.65 58.130 0.75 17 30.81 21.12 56.8930 0.90 14 30.87 23.78 4430 0.80 25 31.16 20.72 62.7730 0.65 15 30.69 21.35 54.6

    Table 2: f1, precision & recall score for METHODconcept

    before generating their embedding. This iscalled as masked feature list.

    Experiments were done for masked and unmaskedfeature lists separately.

    (d) As number of phrases added per iteration de-creased substantially after iteration 5, we ran only5 iterations of bootstrapping algorithm for all theexperiments.

    (e) We experimented with different values ofdistance r and k. We observed that in general pre-cision increases with increase in value of k andrecall increases with decrease in value of r.Evaluation: For evaluating our results, we usethe labeled dataset made available by (Gupta andManning, 2011). We used 197 out of 474 papersfor evaluation purpose. We calculate precision,recall and f1 score for each class. However, asResult phrases were not annotated in that dataset,we could evaluate only for Aim and Method. Wecompare our proposed approach with (Tsai et al.,2013) which ran the bootstrapping algorithm fora similar problem but used n-gram based features.They reported results for ACL Anthopology Net-work(AAN) Corpus (Radev et al., 2013). We rantheir algorithm on our dataset with parameter tun-ing as mentioned by them.

    4 Results and Discussion

    4.1 Concept Extraction

    We got the best results for parameter values, r =0.65 and k = 60. Our bootstrapping algorithm

    Approach f1 score precision recallGM (2011) 30.5 46.7 36.9(Tsai et al., 2013) 48.2 48.8 48.5Our Approach 32.58 22.65 58.1

    Table 3: Comparison with state-of-the-art forMETHOD Concept

    Approach f1 score precision recall(Tsai et al., 2013) 8.26 31.37 4.761Our Approach 40.66 46.04 36.41

    Table 4: Comparison with state-of-the-art for AIMConcept on DBLP dataset

    Approach f1 score precision recall(Tsai et al., 2013) 18.0 50.70 10.94Our Approach 32.58 22.65 58.1

    Table 5: Comparison with state-of-the-art for MethodConcept on DBLP dataset

    gave output for 332,242 out of 332,793 papers. InTable 1, we report the top five scores for Aim fordifferent parameters. Top ten scores for both aimand method concept were given by unmasked fea-ture list. Therefore mask feature list results havenot been shown. In Table 2 we report the top fivescores for Method on different parameters. Table3 and 4 compares our scores with that of (Guptaand Manning, 2011) and (Tsai et al., 2013). Table5 compares our scores with the score computed for(Tsai et al., 2013) approach on our dataset.

    Our proposed algorithm was able to extractphrases from scientific articles in a large datasetin semi-supervised manner with f1 score compa-rable to the state-of-the-art. Our f1 score waslower as compared to (Gupta and Manning, 2011)(Tsai et al., 2013). However, our recall was con-sistently higher. Our precision was perhaps lowas we were considering only the noun phraseswhereas such limitation was not there while anno-tating the test corpus. They (Gupta and Manning,2011) (Tsai et al., 2013) used hand crafted fea-tures for AAN Corpus whereas our features wereextracted algorithmically starting from a small an-notated dataset containing multiple domains suchas physics, material science and computer science.Table 5 shows the scalability of our approach. Tsaiet al. (2013) bootstrapping algorithm could notgive a decent score when ran on our multi-domain

    16

  • dataset because phrases could not be extracted formost of the papers.

    4.2 Graph Construction

    Total number of unique phrases produced by theproposed algorithm are 565,031. Using DB-SCAN we form 63,638 clusters having 266,015phrases. Our final graph contains 332,242 papernodes, 362654 entity nodes, 483899 aim relations,982396 relations and 661 result relations. We storeour graph in Neo4j database (Webber and Robin-son, 2018). A small sample from our constructedgraph is shown in figure 3. We can see that re-sult relations are quite few as compared to methodand aim relations. This is mainly because of lessnumber of seed features for Result due to less an-notated data as compared to Aim and Method.The constructed graph can summarize the researchcommunity in the following way:(i) All the papers on a particular topic can be ac-cessed by just finding the entity node correspond-ing to the topic in the graph. The associated paperscan also be differentiated on the basis of whetherthe topic appears as aim or method or result in thepaper. This can also help in academic search andrecommendation.(ii) A field can be summarized by finding all themethods used in the field and applications of fieldby finding all the aims where the field has beenused as method.(iii) Trend Analysis, conference proceedings sum-marization, or summarization of a particular au-thor’s work can be done using the meta data in thepaper node.Neo4j provides interface for all kind of queries re-quired for the above applications. The queries areout of scope of this paper.

    5 Trend Analysis

    We studied the field of computational linguisticsand computer vision.Computational Linguistics: We studied thegrowth and decline of following topics on the ba-sis of relative number of papers published on eachtopic over a period of years: summarization, wordsense disambiguation and machine translation.Papers are included from NAACL and ACL con-ferences from 1990 to 2012. Figure 4 and 6 showan example of trends as extracted from our con-structed knowledge graph. Figure 6 shows transi-

    tion of a topic from aim to method concept in thedomain.

    Computer Vision: We studied the growth anddecline of following topics on the basis of rela-tive number of papers published each topic over aperiod of years: human pose detection, image seg-mentation and 3d reconstruction. Papers are in-cluded from CVPR, ECCV, ICCV and ICPR con-ferences from 1990 to 2012. Figure 5 and 7 showan example of trends as extracted from our con-structed knowledge graph. Figure 7 shows transi-tion of a topic from aim to method concept in thedomain.

    Meaningful results in the analysis for both thecommunities show the scalability and domain in-dependence of our approach.

    6 Related Work

    There has been growing interest in studying au-tomatic methods of information extraction fromscientific articles. Our work maps to mainly twotypes of problems - Extracting keyphrases, con-cepts, and relations between them and extractingstructured information in the form of knowledgegraph from scientific literature.Keyphrase extraction specifically from scientificarticles started with SemEval 2010 Task 5 (Kimet al., 2010) which was focused on automatickeyphrase extraction from scientific articles andprepared a dataset of 284 articles marked withkeyphrases. Gollapalli and Caragea (2014) stud-ied the keyphrase extraction problem in an unsu-pervised setting. They extracted candidates fromthe title, abstracts and citation contexts and usedPage Rank (PAGE, 1998) to give a score to thecandidates. Gupta and Manning (2011) first pro-posed a task that defines scientific terms for 474abstracts from the ACL anthology (Radev et al.,2013) into three aspects: domain, technique, andfocus. They applied template-based bootstrappingon title and abstract of articles to tackle the prob-lem. They used handcrafted dependency basedfeatures. Based on this study, (Tsai et al., 2013)improved the performance by introducing hand-designed features to the bootstrapping framework.Our system beats their systems in terms of re-call for both aim and method concepts. Also, weworked on larger multi-domain dataset. SemEval2017 Task 10 (Augenstein et al., 2017) focusedon mention level keyphrase identification and theirclassification into three categories - TASK, PRO-

    17

  • Figure 3: Sample from our constructed graph. Green nodes correspond to research papers and brown nodescorrespond to extracted phrase entities.

    Figure 4: Growth and decline of research in differenttopics in computational linguistics

    CESS, and MATERIAL. They prepared an an-notated dataset comprising of 500 papers fromMaterial Science and Computer Science journals.Many systems (Ammar et al., 2017) (Tsujimuraet al., 2017) solved the problem in a supervisedmanner. Top system (Ammar et al., 2017) mod-eled the problem as a sequence labeling problem.(Tsujimura et al., 2017) trained LSTM-ER on thatdataset. However, these supervised systems re-quire a large amount of training data, in the ab-sence of which they tend to overfit. Our semi-

    Figure 5: Growth and decline of research in differenttopics in Computer Vision

    supervised method can work using a small set ofannotated documents for initial features.There is also an ongoing work on constructingknowledge graph from the scientific literature.Sinha et al. (2015) builds a heterogeneous graphconsisting of six types of entities: field of study,author, institution (the affiliation of the author),paper, venue (journal and conference series) andevent. Ammar et al. (2018) focussed on construct-ing literature graph consisting of papers, authors,entities nodes and various interactions between

    18

  • Figure 6: Transition from aim to method for 1. Summarization 2. Machine Translation 3. Word Sense Disam-biguation

    Figure 7: Transition from aim to method for 1. 3d reconstruction 2. Human pose-detection 3. Image Segmentation

    them (e.g., authorship, citations, entity mentions).Luan et al. (2018) developed a unified frameworkfor identifying entities, relations, and coreferenceclusters in scientific articles with shared span rep-resentations. They used supervised methods bycreating a dataset which included annotations forscientific entities, their relations, and coreferenceclusters for 500 scientific abstracts from AI con-ferences proceedings. Our knowledge graph ismore straightforward to build. Also, it is builtupon the citation graph due to which it retains thevital citation information which is an integral partof the research community.

    Conclusion

    This work propose semi-supervised techniquesfor identifying Aim, Method and Result con-cepts from scientific articles. We show howthese concepts can be introduced in the citationgraph to graphically summarize the researchcommunity and the various applications of thegraphical representation thus formed. We showthe domain-independence of our approach as:- a) Seed features from one domain (physics,material science from SemEval dataset) wereused to extract concepts for another domain(computer science papers from DBLP dataset), b)Meaningful results for two distinct communitiesas section 5. We also experimentally show the

    scalability of our approach and compared theresults with the state-of-the-art.

    ReferencesWaleed Ammar, Dirk Groeneveld, Chandra Bhagavat-

    ula, Iz Beltagy, Miles Crawford, Doug Downey, Ja-son Dunkelberger, Ahmed Elgohary, Sergey Feld-man, Vu Ha, Rodney Kinney, Sebastian Kohlmeier,Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew E.Peters, Joanna Power, Sam Skjonsberg, Lucy LuWang, Chris Wilhelm, Zheng Yuan, Madeleine vanZuylen, and Oren Etzioni. 2018. Construction ofthe literature graph in semantic scholar. CoRR,abs/1805.02262.

    Waleed Ammar, Matthew Peters, Chandra Bhagavat-ula, and Russell Power. 2017. The ai2 system atsemeval-2017 task 10 (scienceie): semi-supervisedend-to-end entity and relation extraction. In Pro-ceedings of the 11th International Workshop on Se-mantic Evaluation (SemEval-2017), pages 592–596,Vancouver, Canada. Association for ComputationalLinguistics.

    Isabelle Augenstein, Mrinal Das, Sebastian Riedel,Lakshmi Vikraman, and Andrew McCallum. 2017.Semeval 2017 task 10: Scienceie - extractingkeyphrases and relations from scientific publica-tions. In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation (SemEval-2017),pages 546–555. Association for Computational Lin-guistics.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

    19

  • Kristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xi-aowei Xu. 1996. A density-based algorithm fordiscovering clusters a density-based algorithm fordiscovering clusters in large spatial databases withnoise. In Proceedings of the Second InternationalConference on Knowledge Discovery and Data Min-ing, KDD’96, pages 226–231. AAAI Press.

    Soumyajit Ganguly and Vikram Pudi. 2016. Compet-ing algorithm detection from research papers. InProceedings of the 3rd IKDD Conference on DataScience, 2016, CODS ’16, pages 23:1–23:2, NewYork, NY, USA. ACM.

    Soumyajit Ganguly and Vikram Pudi. 2017. Pa-per2vec: Combining graph and text informationfor scientific paper representation. In Advancesin Information Retrieval, pages 383–395, Cham.Springer International Publishing.

    Sujatha Das Gollapalli and Cornelia Caragea. 2014.Extracting keyphrases from research papers usingcitation networks. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence,AAAI’14, pages 1629–1635. AAAI Press.

    Sonal Gupta and Christopher Manning. 2011. Analyz-ing the dynamics of research by extracting key as-pects of scientific papers. In Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing, pages 1–9. Asian Federation of NaturalLanguage Processing.

    Sonal Gupta and Christopher D. Manning. 2014. Im-proved pattern learning for bootstrapped entity ex-traction. In CoNLL.

    Kokil Jaidka, Muthu Kumar Chandrasekaran, SajalRustagi, and Min-Yen Kan. 2018. Insights fromcl-scisumm 2016: the faceted scientific documentsummarization shared task. International Journalon Digital Libraries, 19(2):163–171.

    Su Nam Kim, Olena Medelyan, Min-Yen Kan, andTimothy Baldwin. 2010. Semeval-2010 task 5: Au-tomatic keyphrase extraction from scientific articles.In Proceedings of the 5th International Workshopon Semantic Evaluation, SemEval ’10, pages 21–26, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

    Yi Luan, Luheng He, Mari Ostendorf, and HannanehHajishirzi. 2018. Multi-task identification of enti-ties, relations, and coreference for scientific knowl-edge graph construction. CoRR, abs/1808.09602.

    L. PAGE. 1998. The pagerank citation rank-ing : Bringing order to the web. http://www-db.stanford.edu/backrub/pageranksub.ps.

    Animesh Prasad, Manpreet Kaur, and Min-Yen Kan.2018. Neural parscit: a deep learning-based refer-ence string parser. International Journal on DigitalLibraries, 19(4):323–337.

    DragomirR. Radev, Pradeep Muthukrishnan, VahedQazvinian, and Amjad Abu-Jbara. 2013. The acl an-thology network corpus. Language Resources andEvaluation, pages 1–26.

    Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Dar-rin Eide, Bo-June (Paul) Hsu, and Kuansan Wang.2015. An overview of microsoft academic ser-vice (mas) and applications. In Proceedings of the24th International Conference on World Wide Web,WWW ’15 Companion, pages 243–246, New York,NY, USA. ACM.

    Chen-Tse Tsai, Gourab Kundu, and Dan Roth. 2013.Concept-based analysis of scientific literature. InProceedings of the 22nd ACM international confer-ence on Conference on information & knowl-edge management, CIKM ’13, pages 1733–1738,New York, NY, USA. ACM.

    Tomoki Tsujimura, Makoto Miwa, and Yutaka Sasaki.2017. Tti-coin at semeval-2017 task 10: Investi-gating embeddings for end-to-end relation extrac-tion from scientific papers. In Proceedings of the11th International Workshop on Semantic Evalua-tion (SemEval-2017), pages 985–989, Vancouver,Canada. Association for Computational Linguistics.

    Jim Webber and Ian Robinson. 2018. A ProgrammaticIntroduction to Neo4J, 1st edition. Addison-WesleyProfessional.

    20

  • Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications, pages 21–30Minneapolis, USA, June 6, 2019. c©2019 Association for Computational Linguistics

    Understanding the Polarity of Events in the Biomedical Literature:Deep Learning vs. Linguistically-informed Methods

    Enrique Noriega-Atala, Zhengzhong Liang,John A. Bachman†, Clayton T. Morrison, Mihai Surdeanu

    University of Arizona, Tucson, Arizona, USA†Harvard Medical School, Boston, Massachusetts, USA

    {enoriega,zhengzhongliang,claytonm,msurdeanu}@email.arizona.edujohn [email protected]

    Abstract

    An important task in the machine readingof biochemical events expressed in biomed-ical texts is correctly reading the polarity,i.e., attributing whether the biochemical eventis a promotion or an inhibition. Here wepresent a novel dataset for studying polar-ity attribution accuracy. We use this datasetto train and evaluate several deep learningmodels for polarity identification, and com-pare these to a linguistically-informed model.The best performing deep learning architec-ture achieves 0.968 average F1 performancein a five-fold cross-validation study, a consid-erable improvement over the linguistically in-formed model average F1 of 0.862.

    1 Introduction

    Recent advances in information extraction (IE)have resulted in high-precision, high-throughputsystems tailored to the reading of biomedical sci-entific publications (Valenzuela-Escárcega et al.,2018; Peng et al., 2017; Quirk and Poon, 2016;Kim et al., 2013; Björne and Salakoski, 2