learning relational representations by analogy …proceedings of naacl-hlt 2019 , pages 3235 3245...

11
Proceedings of NAACL-HLT 2019, pages 3235–3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics 3235 Learning Relational Representations by Analogy using Hierarchical Siamese Networks Gaetano Rossiello , Alfio Gliozzo , Robert Farrell , Nicolas Fauceglia , Michael Glass Department of Computer Science, University of Bari, Italy IBM Research AI, Yorktown Heights, NY, US [email protected], [email protected], [email protected], [email protected], [email protected] Abstract We address relation extraction as an analogy problem by proposing a novel approach to learn representations of relations expressed by their textual mentions. In our assumption, if two pairs of entities belong to the same rela- tion, then those two pairs are analogous. Fol- lowing this idea, we collect a large set of anal- ogous pairs by matching triples in knowledge bases with web-scale corpora through distant supervision. We leverage this dataset to train a hierarchical siamese network in order to learn entity-entity embeddings which encode rela- tional information through the different lin- guistic paraphrasing expressing the same re- lation. We evaluate our model in a one-shot learning task by showing a promising gen- eralization capability in order to classify un- seen relation types, which makes this approach suitable to perform automatic knowledge base population with minimal supervision. More- over, the model can be used to generate pre- trained embeddings which provide a valuable signal when integrated into an existing neural- based model by outperforming the state-of- the-art methods on a downstream relation ex- traction task. 1 Introduction The task of identifying semantic relationships be- tween entities in unstructured textual corpora, namely Relation Extraction (RE), is often a pre- requisite for many other natural language under- standing tasks, e.g. automatic knowledge base population, question answering, etc. RE is com- monly addressed as a classification task (Bunescu et al., 2005), where a model is trained to classify relation mentions in text among a predefined set of relation types. For instance, given the sentence “Robert Plant is the singer of the band Led Zep- pelin”, an effective RE system might extract the triple memberOf(ROBERT PLANT,LED ZEP- PELIN), where memberOf is a relation label ex- pressed by the linguistic context “is the singer of the band”. Since a given relation can be expressed us- ing different textual patterns surrounding entities, the state-of-the-art RE models which follow this approach need a considerable amount of exam- ples for each relation to reach satisfactory perfor- mance. Distant supervision (Mintz et al., 2009) instead uses training examples from a knowl- edge base, guaranteeing a large amount of (pop- ular) relation examples without human interven- tion, which can be used effectively by neural net- works (Lin et al., 2016; Glass et al., 2018). How- ever, even with this technique, approaching RE as a classification task presents several limitations: (1) distant supervision models are not accurate in extracting relations with a long-tailed distribution, because they typically have a small set of instances in knowledge bases; (2) in most domains, relation types are very specific and only a few examples of each relation are available; (3) these models can- not be applied to recognize new relation types not observed during training. In this paper, we address RE from a different perspective by reducing it to an analogy problem. Our assumption states that if two pairs of entities, (A, B) and (C, D), have at least one relation in common r, then those two pairs are analogous. Viceversa, solving proportional analogies, such as A : B = C : D, consists of identifying the im- plicit relations shared between two pairs of enti- ties. For example, ROME:I TALY=PARIS:FRANCE is a valid analogy because capitalOf is a rela- tion in common. Based on this idea, we propose an end-to-end neural model able to measure the degree of ana- logical similarity between two entity pairs, instead of predicting a confidence score for each relation type. An entity pair is represented through its

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

Proceedings of NAACL-HLT 2019, pages 3235–3245Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

3235

Learning Relational Representations by Analogy usingHierarchical Siamese Networks

Gaetano Rossiello†, Alfio Gliozzo‡, Robert Farrell‡, Nicolas Fauceglia‡, Michael Glass‡†Department of Computer Science, University of Bari, Italy

‡IBM Research AI, Yorktown Heights, NY, [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract

We address relation extraction as an analogyproblem by proposing a novel approach tolearn representations of relations expressed bytheir textual mentions. In our assumption, iftwo pairs of entities belong to the same rela-tion, then those two pairs are analogous. Fol-lowing this idea, we collect a large set of anal-ogous pairs by matching triples in knowledgebases with web-scale corpora through distantsupervision. We leverage this dataset to train ahierarchical siamese network in order to learnentity-entity embeddings which encode rela-tional information through the different lin-guistic paraphrasing expressing the same re-lation. We evaluate our model in a one-shotlearning task by showing a promising gen-eralization capability in order to classify un-seen relation types, which makes this approachsuitable to perform automatic knowledge basepopulation with minimal supervision. More-over, the model can be used to generate pre-trained embeddings which provide a valuablesignal when integrated into an existing neural-based model by outperforming the state-of-the-art methods on a downstream relation ex-traction task.

1 Introduction

The task of identifying semantic relationships be-tween entities in unstructured textual corpora,namely Relation Extraction (RE), is often a pre-requisite for many other natural language under-standing tasks, e.g. automatic knowledge basepopulation, question answering, etc. RE is com-monly addressed as a classification task (Bunescuet al., 2005), where a model is trained to classifyrelation mentions in text among a predefined setof relation types. For instance, given the sentence“Robert Plant is the singer of the band Led Zep-pelin”, an effective RE system might extract thetriple memberOf(ROBERT PLANT, LED ZEP-

PELIN), where memberOf is a relation label ex-pressed by the linguistic context “is the singer ofthe band”.

Since a given relation can be expressed us-ing different textual patterns surrounding entities,the state-of-the-art RE models which follow thisapproach need a considerable amount of exam-ples for each relation to reach satisfactory perfor-mance. Distant supervision (Mintz et al., 2009)instead uses training examples from a knowl-edge base, guaranteeing a large amount of (pop-ular) relation examples without human interven-tion, which can be used effectively by neural net-works (Lin et al., 2016; Glass et al., 2018). How-ever, even with this technique, approaching RE asa classification task presents several limitations:(1) distant supervision models are not accurate inextracting relations with a long-tailed distribution,because they typically have a small set of instancesin knowledge bases; (2) in most domains, relationtypes are very specific and only a few examples ofeach relation are available; (3) these models can-not be applied to recognize new relation types notobserved during training.

In this paper, we address RE from a differentperspective by reducing it to an analogy problem.Our assumption states that if two pairs of entities,(A,B) and (C,D), have at least one relation incommon r, then those two pairs are analogous.Viceversa, solving proportional analogies, such asA : B = C : D, consists of identifying the im-plicit relations shared between two pairs of enti-ties. For example, ROME:ITALY=PARIS:FRANCE

is a valid analogy because capitalOf is a rela-tion in common.

Based on this idea, we propose an end-to-endneural model able to measure the degree of ana-logical similarity between two entity pairs, insteadof predicting a confidence score for each relationtype. An entity pair is represented through its

Page 2: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3236

mentions in a textual corpus, sequences of sen-tences where entities in the pair co-occur. If amention represents a specific relation type, thenthis relationship is expressed by the linguistic con-text surrounding the two entities. E.g., “Rome isthe capital of Italy” or “The capital of France isParis” referring to the example above. Thus, giventwo analogous entity pairs represented by theirtextual mentions sets as input, the model is trainedto minimize the difference between the represen-tations of relations having the same linguistic pat-terns. In other words, the model learns the differ-ent paraphrases expressing the same relation. Inour research hypothesis, a model trained in suchway is able to recognize analogies between unseenentity pairs belonging to new unseen relation typesby: (1) generalizing over the sequence of words inthe mentions; (2) projecting the sequence of wordsin the mentions into a vector space representingrelational semantics. This approach poses severalresearch questions: (RQ1) How to collect and or-ganize a dataset for training? (RQ2) What kindsof models are effective for this task? (RQ3) Howshould the model be evaluated?

Knowledge bases, such as Wikidata orDBpedia, consist of large relational datasources organized in the form of triples,predicate(SUBJECT, OBJECT). We ex-ploit this information to build a reliable set ofanalogous facts used as ground truth. Then, weadopt distant supervision to retrieve relation men-tions in web-scale textual corpora by matchingthe subject-object entities which co-occur in thesame sentences (Riedel et al., 2010; ElSahar et al.,2018; Glass and Gliozzo, 2018a). Through thistechnique we can train our model on millions ofanalogy examples without human supervision.

Since our goal is to train a model able to com-pute the relational similarity given two sets oftextual mentions, we use siamese networks tolearn discriminative features between those twoinstances (Hadsell et al., 2006). This kind of neu-ral network has been used in both computer vi-sion (Koch et al., 2015) and natural language pro-cessing (Mueller and Thyagarajan, 2016; Necu-loiu et al., 2016) in order to map two similar in-stances close in a feature space. However, in oursetting each instance consists of a set of mentions,therefore it is inherently a multi-instance learningtask1. We propose a hierarchical siamese network

1Due to the weak supervision, the whole set of mentions

with an attention mechanism at both word level(Yang et al., 2016) and at the set level (Ilse et al.,2018) in order to select the textual mention whichbetter describes the relation. To the best of ourknowledge, this is the first application of a siamesenetwork by pairing sets of instances, so it can beconsidered a novelty of this work.

We evaluate the generalization capability ofour model in recognizing unseen relation typesthrough an one-shot relational classification taskintroduced in this paper. We train the parametersof the model on a subset of most frequent rela-tions of one of three different distantly superviseddatasets used in our experiments. Then, we eval-uate it on the long-tailed relations of each dataset.During the test phase, only a single example foreach unseen relation is provided. This example isnot used to update the parameters of the model asin a classification task, but rather to produce thevector representation of the relation itself. Entitypairs having mention sets close to this representa-tion are more likely to be analogous. The experi-ments show promising results of our approach onthis task, compared with the recent deep modelscommonly used for encoding textual representa-tions (Conneau et al., 2017). However, when thenumber of the unseen relation types increases, theperformance of our model become far from the re-sults obtained in the one-shot image classification(Koch et al., 2015), opening an interesting chal-lenge for future work.

Finally, our model shows a transfer capability inother tasks through the use of its pre-trained vec-tors. Indeed, a branch of the hierarchical siamesenetwork can be used to generate entity-entity rep-resentations given sets of mentions as input, thatwe call analogy embeddings. In our experiments,we integrate those representations into an exist-ing end-to-end model based on convolutional net-works (Glass and Gliozzo, 2018b), outperfomingthe state-of-the-art systems on two shared datasetscommonly used for distantly supervised relationextraction.

2 Related Work

Relation Extraction Several approaches havebeen proposed in the literature to address the prob-lem of extracting relations from text with minimalsupervision.

The bootstrapping method (Agichtein and Gra-

is labeled, but each individual mention in the set is unlabeled.

Page 3: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3237

vano, 2000) collects the textual patterns between afew example pairs of entities iteratively, and usesthem to retrieve other pairs of entities from a cor-pus. This method is limited by the semantic driftissue since wrong patterns might be collected.

OpenIE (Mausam et al., 2012) is an unsu-pervised method for extracting triples from text,where the relations are linguistic phrases. Thelack of a canonical form for the extracted rela-tions makes this approach not suitable to populateknowledge bases with a fixed schema.

Universal schema (Riedel et al., 2013) ad-dresses RE by combining the OpenIE and knowl-edge base relations through a matrix factoriza-tion technique typically adopted in the collabo-rative filtering approach of recommendation sys-tems. The column-less (Toutanova et al., 2015)and row-less (Verga and McCallum, 2016) exten-sions of this method can handle unseen entity pairsand textual relations when combined (Verga et al.,2017).

The one-shot RE has been addressed by (Yuanet al., 2017), who adopt a siamese network to ex-tract fine-grained relations which typically havefew training examples. This model has two mainlimitations. Firstly, it works only by pairing twosingle mentions and is not able to handle a wholeset of mentions referring to a relation instance.Our hierarchical siamese network overcomes thisissue by using an attention mechanism at bothword and mention level. Moreover, their one-shotevaluation mainly focuses on extracting the samerelation types seen during training. Instead, thegoal of our one-shot task is to evaluate the trans-fer capability in extracting unseen relation typesacross domains using a single pre-trained model.

Recently, (Levy et al., 2017) propose to reduceRE slot-filling to a question answering problem.The main idea is to build a set of question-answerpairs for the relations in knowledge bases and traina reading comprehension model using this dataset.This approach shows promising zero-shot capabil-ity in extracting unseen relation types. However,the schema querification phase requires a crowd-sourcing effort. Our method uses distant supervi-sion, so it does not need any kind of manual anno-tations.

Word Analogy The analogy problem, from acomputational linguistic perspective, was origi-nally addressed by (Turney, 2006) who investi-gate several similarity measures for solving word

analogy questions in the Scholastic Aptitude Testdataset. The authors provide an interesting argu-ment regarding the different types of similarities,attributional and relational, and their use in solv-ing word analogies. Attributional similarity, typi-cal of the word vector space models, is useful forsynonym detection, word sense disambiguationand so on. Instead, relational similarity is suitablefor understanding analogies between two pairs ofwords. Our neural-based analogy approach is in-spired by this finding.

Recently, word analogies, namely the propor-tional analogy between two word pairs such asa : b = c : d, have been used by (Mikolov et al.,2013; Pennington et al., 2014) to show the capabil-ity of word embeddings to discover linguistic reg-ularities in word contexts using vector offsets (e.g.king − man + woman = queen). The worksin (Gladkova et al., 2016; Vylomova et al., 2016)explore the use of word vectors to model the se-mantic relations. The proportional analogy is alsoadopted by (Liu et al., 2017) as analogical infer-ence in order to learn multi-relational embeddingswhich are evaluated on knowledge base comple-tion benchmarks.

However, in order to apply word embeddingmodels to proportional analogy, the model musthave seen the words during training. This ap-proach is unsuitable for computing the analogybetween out-of-vocabulary words. Our approachovercomes this limitation, since it works by con-sidering the contexts where the entities occur.

3 Learning Relations by Analogy

Given two pairs of entities, (A,B) and (C,D),their semantic relations can be expressed by theirmentions in text, (A,B) = {Si} and (C,D) ={Sj}. Specifically, {Si} and {Sj} are the setsof sentences where (A,B) and (C,D) co-occurin the same set. Two pairs of entities are analo-gous, A : B = C : D, iff their mentions sets, orpart of them, express the same relation r. Knowl-edge bases, such as Wikidata, contain millions oftrusted facts in form of triples, r(A,B), namelypairs of entities in known relationships. We lever-age these relational data sources as ground truth inorder to collect a set of proportional analogy state-ments. Then, we build a dataset through the distantsupervision technique by retrieving the mentionssets from web-scale corpora.

Our idea is to train a neural network to solve

Page 4: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3238

Figure 1: Learning relations by analogy through matching the facts from knowledge bases with textual corpora.

the analogy problem between any two entity pairsin this dataset, as long as they are described bythe textual contexts where they co-occur. In otherwords, this task is reduced to a binary classifica-tion of determining whether the relational similar-ity between the representations of two sets of men-tions exceeds a threshold. The network is trainedby feeding two sets of mentions related to two dif-ferent pairs, and it is optimized to return a positivelabel if the two entity pairs are analogous, namelythey share at least one relation, or a negative labelotherwise.

Figure 1 provides an example of this pro-cess. For the relation memberOf, the entity pairs(ROBERT PLANT, LED ZEPPELIN) and (DAVID

GILMOUR , PINK FLOYD) are sampled. The twoentity pairs are converted into their respectivelymentions sets gathered from a textual corpus, suchas Wikipedia. Since these two entity pairs areanalogous, the network is optimized to learn therepresentations of the two textual contexts to beclose into the feature space. In fact, the first sen-tences of both pairs represent the concept of mem-bership of a band, even if they are expressed usingdifferent words. Based on our assumption, the aimis to learn how to encode the relational representa-tions through the different paraphrases of the samerelation. However, the model also needs negativeexamples during the training phase. We randomlyselect an entity pair from a different relation foreach positive example, such as (ROBERT PLANT,LED ZEPPELIN) and (PARIS, FRANCE). Since wecast the problem as a binary classification task, wecreate a balanced dataset of positive and negativeexamples.

Siamese neural networks (Bromley et al., 1993)are well suited to this task because they are specif-ically designed to compute the similarity betweentwo instances. A siamese network has symmetrictwin sub-networks which share the same param-eters, but are joined by an energy function at thehead. Weight sharing forces the two similar in-stances to be mapped to very close locations infeature space because both of the sub-networksare optimized using the same function. In com-puter vision, siamese architectures based on con-volutional neural networks (Hadsell et al., 2006;Koch et al., 2015; Vinyals et al., 2016) have shownpromising performance in learning highly discrim-inative features by pairing images that belong tothe same class. Likewise, our hypothesis is that asiamese network trained by matching two distinctmentions sets that share the same relation is able tolearn how to map patterns of words across the sen-tences containing the two pairs of entities so as tocapture the semantics of the relation. For instance,given the example in Figure 1 an effective siamesenetwork should determine that the patterns for “isthe lead singer” and “was the guitarist” expressthe same relation, memberOf.

To train a siamese network based on our ap-proach, we have to face the following challenges:(1) the language may be highly variable and thesame relation expressed in a multitude of differ-ent ways; (2) the mentions set of an entity pairconsists of several sentences, each of which mightexpress different relations, hence this is a multi-instance learning problem; (3) distant supervisioncould provide a wrong labeling, namely sentenceswhich do not express any specific relations.

Page 5: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3239

Figure 2: Hierarchical siamese network.

4 Hierarchical Siamese Network

To face these challenges, we propose a Hierarchi-cal Siamese Network (HSN) architecture as shownin Figure 2. In the following paragraphs, we de-scribe the details of each component.

Input Representation The HSN takes as inputtwo entity pairs represented by their mentions sets.Since the twin sub-networks of the HSN are thesame, we focus only on one of these. Given atriple r(A,B) from a knowledge base, the rela-tion r can be expressed through the set of sen-tences in a textual corpus where the the two en-tities co-occur: r(A,B) = {S1, S2, . . . , Sn}, withSi = {wi1, wi2, . . . , wik}, where wij representsthe j-th word in the sentences Si, ∀i ∈ 1 ≤ i ≤ nand ∀j ∈ 1 ≤ i ≤ k. The purpose of a sub-network is to learn a low-rank vector representa-tion rA,B for the relation r expressed by the pair(A,B). This is done by hierarchically composingthe word and sentence representations.

Gated Recurrent Unit for Sentence EncoderGiven a sentence Si = {wi1, wi2, . . . , wik}, wemap each one-hot word representation of wij intoits word embedding xij = Ewij , where Ed,|V |

is a matrix of real-valued vectors of size d, andV is a (fixed) vocabulary. Word embeddings aredesigned to encode syntactic and semantic fea-tures of words and can be randomly initialized orpre-trained on large corpora. We use pre-trainedGloVe (Pennington et al., 2014) embeddings forour purposes. We encode the whole sentence Siinto a low-rank representation by composing itsconstituent word embeddings. An effective way toperform such an encoding is using recurrent neural

networks (RNN) which are able to compose wordembeddings by taking into account their positionsin the sentence conditioned on the previous words.For our model, this capability is critical in orderto detect sequences of words which express a par-ticular relation, such as “is the capital of”. Weuse a bidirectional GRU (Bahdanau et al., 2014)to gather the information from both directions forwords. Formally, given

−→hij =

−−−→GRU(xij) and

←−hij =

←−−−GRU(xij), the hidden state hij = [

−→hij ,←−hij ]

is a new dense representation of wij which en-codes also the information of the whole sentence.

Word Attention with Context Vector How-ever, only certain words in a sentence expressthe semantics of a relation, therefore we need astrategy to automatically identify them during thetraining. For example, the words “singer” and“guitarist” at both sides of Figure 1 are goodcandidates to express the relation member. Weuse the attention mechanism with a context vec-tor proposed in (Yang et al., 2016) to rewardsuch words which are important to the mean-ing of a relation and then aggregate their infor-mation in the sentence representation. In detail,

si =∑

k αikhik, where αij =exp(uT

ijuw)∑k exp(uT

ikuw), and

uij = tanh(Wwhij + bw). The vector si rep-resents the sentence Si and is computed as theweighted sum of the GRU-based word vectors hijusing the normalized attention weights αij . Theparameters for the attention mechanism are theweights and biases Ww, bw and uw, the contextvector, a global fixed vector which, independentlyfrom a specific word, represents a kind of querywhich helps to inform what is the most informa-tive word for each analogy. The context vectoruw essentially works like a memory mechanism,as described in (Sukhbaatar et al., 2015; Kumaret al., 2016).

Attention for Multi-instance Relation Repre-sentation Once all sentences in the mentions setare encoded, the aim of the last layer of a sub-network is to produce the vector rA,B which repre-sents the pair (A,B). However, while weak super-vision guarantees a large amount of training datawithout any human intervention, wrongly labeledsentences inevitably occur. For instance, the S2 ofthe pair on the right side in the Figure 1 does notexpress the relation member precisely, thereforea wrong bias could propagate during the trainingphase. This issue is typically addressed through

Page 6: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3240

a multi-instance setting, where a model shouldidentify the correct instance(s) from a bag. Re-cently, end-to-end neural architectures have beenproposed to address this multi-instance classifica-tion problem (Wang et al., 2018; Feng and Zhou,2017) by proposing several ways to aggregate theunlabeled instances, such as taking their average.Our goal is to have a model which is able to prop-erly select the most relevant sentences by ascribingdifferent weights to the encoded sentences. Forthis purpose, we adopt an attention mechanism atthe sentence level. It is important to point out thatthe sentences in the mentions set do not have anytemporal relationship, therefore we adapt the stan-dard attention strategies as described in (Ilse et al.,2018). In detail, rA,B =

∑i αisi is the embed-

ding of the relation r given the pair (A,B), with

αi =exp(uT

i us)∑k exp(uT

k us), ui = tanh(Wssi + bs), where

Ws and us are parameters.

Merging Layer and Training Strategy Thereare several ways to merge the output of the twosub-networks in order to learn the analogical sim-ilarity between them. For instance, (Hadsell et al.,2006) propose a contrastive loss with the aim ofdecreasing the distance between two instance rep-resentations. However, we adopt the strategy pro-posed in (Koch et al., 2015), in which the met-ric distance is induced by a fully-connected layerwith a sigmoidal output unit on the absolute differ-ence between the representations output by twinnetworks. Thus, given rA,B and rC,D the two re-lation embeddings which encode the whole men-tions sets of the two entity pairs, we can com-pute the degree of analogy between them withp = σ(Wr(|rA,B − rC,D|)), where the parame-ters Wr measure the importance of each elementof the difference vector, and they are learned in aend-to-end fashion, together with the relation rep-resentations. We build a training set by pairingthe mentions sets of the entity-entity pairs froma KB, following the idea discussed in the nextsection. Thus, we can reduce the analogy taskto a binary classification problem, so that p =P ((A,B), (C,D); Θ) is equal to 1 if A : B = C :D, 0 otherwise. We learn Θ (all the parameters ofHSN) using a gradient-based method which min-imizes a cross-entropy loss function with the L2regularization.

NYT-FB CC-DBP T-REXKnowledge Base Freebase DBpedia Wikidata

Corpus New York Times Common Crawl Wikipedia# words 239,877 8,445,417 4,062,498

# entity pairs 375,846 6,876,913 6,413,452# relations 57 298 685

avg. mentions 1.9 3.8 3.2avg. sent. length 41 37 25

Table 1: Statistics of the distantly supervised datasets.

5 Experiments

Once the analogy model is trained, it has two dif-ferent capabilities. First, the whole HSN architec-ture can be used as a binary classifier in order to in-fer if two entity pairs, expressed by their mentionssets, are analogous. Second, we can use its sub-network before the merge layer as a feature ex-tractor to generate entity-entity vectors given setsof sentences as input which can be used as pre-trained analogy embeddings in other tasks. Weevaluate our model on the one-shot relational clas-sification and distantly supervised relation extrac-tion benchmarks.

5.1 Datasets

In the entire experimental protocol we exploitthree different datasets (see the supplemental ma-terial for details). T-REX (ElSahar et al., 2018) isa large scale alignment dataset between Wikipediaabstracts and Wikidata triples, having 685 uniquerelations. NYT-FB (Riedel et al., 2010) is a stan-dard benchmark for distantly supervised relationextraction. The text of New York Times was pro-cessed with a named entity recognizer and theidentified entities linked by name to Freebase.CC-DBP (Glass and Gliozzo, 2018a) is a web-scale KB population benchmark. It combines thetext of Common Crawl with the entity-relation-entity triples from 298 frequent relations in DB-pedia. Mentions of DBpedia entities are locatedin text by matching the preferred label. This taskis similar to NYT-FB, but it has a much largernumber of relations, triples and textual contexts.The statistics of the three datasets are summa-rized in Table 1. Aside from the difference insize and KB adopted, it worth noting also the dif-ference in terms of corpus style of these datasets.For instance, T-REX has well-written textual men-tions, because the sentences are extracted fromWikipedia. Conversely, CC-DBP and NYT-FBcontain dirtier sentences which mean a high prob-ability of incurring wrong labeling.

Page 7: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3241

5.2 Training and Implementation Details

For both benchmarks, we use the same analogymodel trained only once on a subset of the rela-tions in T-REX. In detail, we discard all relationshaving less than 20 entity pairs, collecting 482 re-lations. We sort the relations by the number ofinstances, and we took the most frequent 60% ofthem to train the HSN. We use the remaining 20%of the relations for validation and the least frequent20% as a corpus to implement one of the three one-shot classification tasks. For the validation and testpartitions we randomly select only 20 entity pairsfor each relation. This becomes a useful test setfor the one-shot validation. To train the HSN, weselect a balanced number of positive and negativeexamples out of the training split based on theserules: (1) for each relation, we randomly extract aset of 20 entity pairs; (2) out of this set, we gen-erated all possibile combinations,

(202

)= 190, as

positive pairing examples; (3) for each combina-tion, we create a negative example by randomlyselecting an entity pair from another relation. Af-ter these steps, we collect a bucket of 109,820 pro-portional analogy training examples.

We iterate this process throughout the trainingphase by selecting a different buckets at each iter-ation to prevent overfitting. The training is mon-itored by computing the binary accuracy over afixed validation set, consisting of 36,480 analogyexamples, built by adopting the same criteria de-scribed above. We initialize our word embeddinglayer with the pre-trained GloVe vectors consist-ing of 6B tokens with 50 dimensions. The wordembedding weights are not updated during train-ing. The number of mentions for each entity pairis fixed to 3, based on their average on T-REX (seeTable 1).

5.3 One-shot Relational Classification

Task Given an unseen entity pair (At, Bt) andits mentions set 〈At, Bt〉, the one-shot relationclassification task is to categorize this test pair(At, Bt) into one of N relation types, with therestriction that for each relation type ri,∀i ∈ N ,we are given only one entity pair (Ai, Bi) togetherwith its mentions set 〈Ai, Bi〉 as training. We cancast the one-shot classification in terms of a rela-tional similarity as follows:

ri = arg maxi

simM (〈At, Bt〉, 〈Ai, Bi〉) (1)

where simM is a similarity score, using themethod M , which measures the analogy betweenthe train and test entity pairs through their men-tions sets. We implemented simHSN using theHSN trained as described above. The two men-tions sets are given as input to the network andtheir similarity is computed using the sigmoidaloutput of the last layer.

Baselines A method M should be robust in fac-ing the unseen entity pairs used for testing. Train-ing a standard RE model using just one examplecannot provide a suitable baseline. Furthermore,since the two new entities that we want to clas-sify might not be present in an existing knowledgegraph, we could not apply relational embeddings(Bordes et al., 2013) as well. Thus, the use of thecontexts surrounding the two entities in the men-tions sets to compute the relational similarity scoreis needed. In other words, we cast the task of one-shot relational classification to a problem of mea-suring textual (i.e. mentions) similarity with theaim to prove that our pre-trained siamese model isable to grasp the semantics of relations better thanthe other pre-trained text representation models.

We implemented five baselines commonly usedto encode textual representations. First, we usethe pre-trained Word2Vec (Mikolov et al., 2013)and GloVe (Pennington et al., 2014) embeddings.The score is given by the cosine similarity betweenthe bag-of-means for the two entity pairs, aver-aging the word vectors in the mentions sets. Wealso adopt Doc2Vec (Le and Mikolov, 2014) to de-rive entity pair vectors, and comparing them usingcosine similarity. For each entity pair, a pseudo-document embedding is created by concatenatingits mentions sets. Finally, we compare HSM withthe pre-trained Skip-Thought (Kiros et al., 2015)and InferSent (Conneau et al., 2017) sentence en-coders, which are the state-of-the-art in computingtextual similarity. An entity pair vector is obtainedby averaging the embeddings of each sentence inthe mentions set.

One-shot trials We follow the experimentalsetup described in (Koch et al., 2015) to createour one-shot benchmark. For each dataset, weselect the 20% of less frequent relations sortedby the number of entity pairs, having at least 20instances. Therefore, we collect three differentone-shot test sets of 92, 55 and 11 unseen re-lation types for T-REX, CC-DBP and NYT-FB,

Page 8: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3242

(a) T-REX (b) CC-DBP (c) NYT-FB

Figure 3: One-shot relational classification results for N-way unseen relation types.

respectively. The reason behind this criteria isto prevent the overlap of the semantic relationtypes between the train data and the different one-shot test sets. In fact, frequent relations, suchas location or birthPlace, are common inall three datasets, and they are used to train ourHSN. Moreover, using the long-tailed relations,such as portOfRegistry, we emulate a sce-nario where only a small set of relation exam-ples are available, making more challenging thetask. To evaluate the one-shot capabilities on N -way classes: (1) N different relation types are se-lected; (2) we sample one(shot) entity pair exam-ple for each of the selected N relation types; (3)we choose another entity pair used as test examplefrom one of the N relation types. All the selec-tions in these three steps are random. If the re-lation type returned by the Eq. 1 is equal to therelation type of the selected test example, then theone-shot trial is correct, otherwise it is incorrect.We repeated this operation k times for N from 2to 10, for each of the three datasets. We choose kequal to 10,000, so that the random baseline con-verges to 100/N , in order to create an unbiasedtestbed.

Results and Discussion The results are reportedin Figure 3. Our model outperforms all the base-lines on the test split of T-REX, reaching an ac-curacy range from 95.87% to 65.33% for N-wayone-shot trials. This behavior remains constantalso for the other two datasets, showing the so-lidity of HSN even though it has been trained ona different corpus using relations from an anotherontology. This result confirms that our model isable to generalize on the linguistic contexts ex-pressing relations, as well as the capability to learnhow to transfer this information to other relationsnot observed before. The supplemental file reportssome one-shot trial examples.

The lower accuracy on CC-DBP and NYT-FB

might be caused by the different style of the cor-pora (Wikipedia vs. Web pages). Indeed, the testset of T-REX is build using the same corpus whichHSN is trained on. The Wikipedia abstracts con-sist of well-written contents, typically the defini-tion of one of the two entities in the pairs. Thus,T-REX can be considered an easier dataset com-pared with the other two.

Surprisingly, the average vectors usingWord2Vec and GloVe obtain remarkable per-formance compared to state-of-the-art sentenceencoders. This might be due to the way howthese sentence models are trained. For instance,InferSent is trained using a natural languageinference dataset, which might be not suitable tolearn representations which represent relationsin text. Instead, HSN is trained and optimizedto learn and encode relational representations.However, this aspect deserves to be dealt withmore deeply, as does the comparison of HSNon the shared textual similarity benchmarks; wethink this is a clear path for future research.

5.4 Transfer Learning in Relation Extraction

We also evaluate the ability of the analogy modelto provide low-rank representations for entity pairswhich are useful for more traditional relation ex-traction tasks, where a corpus of text has to beprocessed and relevant relations in a predefinedschema have to be recognized.

Figure 4: Precision-Recall curves on NYT-FB.

Page 9: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3243

Relation / Score Entity pair / Best mention Entity pair / Best mentiondoctoralStudent VICTOR WEISSKOPF : MURRAY GELL-MANN JOHN BARDEEN : NICK HOLONYAK

0.95 Murray Gell-Mann, one of the principal discoverers of thequarks, is one of the distinguished pupils of Victor Weisskopf.

Professor Nick Holonyak jr. was the first phd student of NobelPrize winner John Bardeen.

approvedBy HUNDRED HORSE CHESTNUT : GUINNESS WORLD RECORDS NCSA OPEN SOURCE LICENSE : OPEN SOURCE INITIATIVE

0.83 Guinness World Records has listed Hundred Horse Chestnutfor the record of “greatest tree girth eve”.

NCSA was formally certified as an open-source license during aMarch 28, 2002 board meeting of the Open Source Initiative.

architecturalStyle ROCKEFELLER CENTER : ART DECO ST. MARK BASILICA : BYZANTINE ARCHITECTURE

0.63 Art Deco mural “wisdom” hangs over the entrance to the Rocke-feller Center and was designed and sculpted by artist Lee Lawrie.

St. Mark’s Basilica, the cathedral of Venice, is one of the bestknown examples of Byzantine architecture.

Table 2: Three examples of relational similarity between two pairs of entities computed by our HSN. For eachexample, we report the unseen relation type, the mentions related to each pair, and the similarity score. Wereport only the mention having the highest attention weight. The examples show the ability of the analogy modelin providing a high score to two mentions which represent the same relation, even if they are expressed usingdifferent textual contexts.

To this aim, we use the sub-network of our HSNbefore the merge layer, and we feed the mentionsset of each entity pair of instances as found inthe corpus to generate an analogy embedding asa vector of features. In detail, given a set of men-tions referring to an entity pair (A,B) as input,the pre-trained HSN generates an embedding rA,B

(see Figure 2) which represents the relation be-tween those two entities. Then, we concatenatethese embeddings to the penultimate layer of arelation extraction model, PCNN-KI (Glass andGliozzo, 2018b), based on a convolutional neu-ral network, which is the state-of-the-art for thisbenchmark. The final fully-connected layer usesthe representation from HSN in combination withits own learned multi-instanced vector representa-tion to predict a confidence score for each relation.During the training of this joint model, PCNN-KI+ANALOGY, we freeze our analogy embed-dings in order to avoid the loose the knowledgetransfer capability.

As for the one-shot setting described before, weuse the same pre-trained the HSN on the T-REXand we used it as a feature extractor for entity pairsin both train-test standard splits of NYT-FB, asused in (Zeng et al., 2015; Lin et al., 2016). Fig-ure 4 reports the results of our evaluation. Themodel which uses the features generated by HSNlargely improve the performances of PCNN-KI,despite the HSN is trained on a different corpusand using a different KB. In the same chart, wealso report a compared evaluation for other ap-proaches proposed in the literature for the NYT-FB benchmark: PCNN+ATT (Lin et al., 2016),CNN+ATT (Zeng et al., 2015), MIML-RE (Sur-deanu et al., 2012), HOFFMANN (Hoffmann et al.,2011), MINTZ (Mintz et al., 2009).

We run the evaluation also on CC-DBP, a larger

dataset for distantly supervised RE, using the sametrain-test setting adopted in (Glass and Gliozzo,2018b). As done for the NYT-FB dataset, we in-corporate the analogy embeddings generated bythe same HSN trained on the T-REX. The resultsconfirm the improvements obtained by PCNN-KImodel if it integrates our pre-trained embeddings(Table 3).

AUC F1PCNN-KI 0.437 0.468PCNN-KI+ANALOGY 0.500 0.504

Table 3: AUC and F1 results on CC-DBP.

6 Conclusion and Future Work

In this paper, we proposed a novel approach tolearn representations of relations in text. Align-ments between knowledge bases and textual cor-pora are used as ground truth in order to collect aset of analogies between entity pairs. We designeda hierarchical siamese network trained to recog-nize those analogies. The experiments showedthe two main advantages of our approach. First,the model can generalize on new unseen rela-tion types, obtaining promising results in one-shotlearning compared with the state-of-the-art sen-tence encoders. Second, the model can generatelow-rank representations can help existing neural-based models designed for other tasks. As futurework, we plan to continue our investigation by ex-tending the method with other ideas. For instance,the use of positional embeddings, as well as theuse of placeholders replacing the entities in thetextual mentions are promising future directions.Finally, we plan also to explore the use of anal-ogy embeddings in other tasks, such as questionanswering and knowledge base population.

Page 10: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3244

ReferencesEugene Agichtein and Luis Gravano. 2000. Snowball:

extracting relations from large plain-text collections.In ACM DL.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR,abs/1409.0473.

Antoine Bordes, Nicolas Usunier, Alberto Garcıa-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In NIPS.

Jane Bromley, Isabelle Guyon, Yann LeCun, EduardSackinger, and Roopak Shah. 1993. Signature ver-ification using a ”siamese” time delay neural net-work. In NIPS.

Razvan C. Bunescu, Ruifang Ge, Rohit J. Kate, Ed-ward M. Marcotte, Raymond J. Mooney, Arun K.Ramani, and Yuk Wah Wong. 2005. Comparativeexperiments on learning information extractors forproteins and their interactions. Artificial Intelligencein Medicine, 33(2):139–155.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In EMNLP.

Hady ElSahar, Pavlos Vougiouklis, Arslen Remaci,Christophe Gravier, Jonathon S. Hare, FrederiqueLaforest, and Elena Simperl. 2018. T-rex: A largescale alignment of natural language with knowledgebase triples. In LREC.

Ji Feng and Zhi-Hua Zhou. 2017. Deep MIML net-work. In AAAI.

Anna Gladkova, Aleksandr Drozd, and Satoshi Mat-suoka. 2016. Analogy-based detection of morpho-logical and semantic relations with word embed-dings: what works and what doesn’t. In SRW@HLT-NAACL.

Michael Glass and Alfio Gliozzo. 2018a. A dataset forweb-scale knowledge base population. In ESWC.

Michael Glass and Alfio Gliozzo. 2018b. Discoveringimplicit knowledge with unary relations. In ACL.

Michael Glass, Alfio Gliozzo, Oktie Hassanzadeh,Nandana Mihindukulasooriya, and GaetanoRossiello. 2018. Inducing implicit relations fromtext using distantly supervised deep nets. In ISWC.

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006.Dimensionality reduction by learning an invariantmapping. In CVPR.

Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke Zettlemoyer, and Daniel S. Weld. 2011.Knowledge-based weak supervision for informationextraction of overlapping relations. In HLT.

Maximilian Ilse, Jakub M. Tomczak, and Max Welling.2018. Attention-based deep multiple instance learn-ing. CoRR, abs/1802.04712.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Raquel Urtasun, Antonio Tor-ralba, and Sanja Fidler. 2015. Skip-thought vectors.In NIPS.

Gregory Koch, Richard Zemel, and Ruslan Salakhut-dinov. 2015. Siamese neural networks for one-shotimage recognition. In ICML Deep Learning Work-shop.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, MohitIyyer, James Bradbury, Ishaan Gulrajani, VictorZhong, Romain Paulus, and Richard Socher. 2016.Ask me anything: Dynamic memory networks fornatural language processing. In ICML.

Quoc V. Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In ICML.

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extraction viareading comprehension. In CoNLL.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In ACL.

Hanxiao Liu, Yuexin Wu, and Yiming Yang. 2017.Analogical inference for multi-relational embed-dings. In ICML.

Mausam, Michael Schmitz, Stephen Soderland, RobertBart, and Oren Etzioni. 2012. Open language learn-ing for information extraction. In EMNLP-CoNLL.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In NIPS.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In ACL.

Jonas Mueller and Aditya Thyagarajan. 2016. Siameserecurrent architectures for learning sentence similar-ity. In AAAI.

Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru.2016. Learning text similarity with siamese recur-rent networks. In Rep4NLP@ACL.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In ECML PKDD.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M. Marlin. 2013. Relation extractionwith matrix factorization and universal schemas. InNAACL-HLT.

Page 11: Learning Relational Representations by Analogy …Proceedings of NAACL-HLT 2019 , pages 3235 3245 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational

3245

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In NIPS.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallap-ati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction.In EMNLP-CoNLL.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In EMNLP.

Peter D. Turney. 2006. Similarity of semantic relations.Computational Linguistics, 32(3).

Patrick Verga and Andrew McCallum. 2016. Row-lessuniversal schema. In AKBC@NAACL-HLT.

Patrick Verga, Andrew McCallum, and Arvind Nee-lakantan. 2017. Generalizing to unseen entities andentity pairs with row-less universal schema. InEACL.

Oriol Vinyals, Charles Blundell, Tim Lillicrap, KorayKavukcuoglu, and Daan Wierstra. 2016. Matchingnetworks for one shot learning. In NIPS.

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, andTimothy Baldwin. 2016. Take and took, gaggle andgoose, book and read: Evaluating the utility of vec-tor differences for lexical relation learning. In ACL.

Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai,and Wenyu Liu. 2018. Revisiting multiple instanceneural networks. Pattern Recognition, 74.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In NAACL-HLT.

Jianbo Yuan, Han Guo, Zhiwei Jin, Hongxia Jin, Xian-chao Zhang, and Jiebo Luo. 2017. One-shot learn-ing for fine-grained relation extraction via convolu-tional siamese neural network. In BigData.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extractionvia piecewise convolutional neural networks. InEMNLP.