weakly-supervised relation extraction by pattern-enhanced ...pattern module distributional module...

Weakly-supervised Relation Extraction byPa�ern-enhanced Embedding Learning

Meng�1, Xiang Ren2, Yu Zhang1, Jiawei Han11University of Illinois at Urbana-Champaign, IL, USA

2University of Southern California, CA, USA1{mengqu2, yuz9, hanj}@illinois.edu [email protected]

ABSTRACTExtracting relations from text corpora is an important task in textmining. It becomes particularly challenging when focusing onweakly-supervised relation extraction, that is, utilizing a few re-lation instances (i.e., a pair of entities and their relation) as seedsto extract more instances from corpora. Existing distributionalapproaches leverage the corpus-level co-occurrence statistics ofentities to predict their relations, and require large number of la-beled instances to learn e�ective relation classi�ers. Alternatively,pa�ern-based approaches perform bootstrapping or apply neuralnetworks to model the local contexts, but still rely on large num-ber of labeled instances to build reliable models. In this paper, westudy integrating the distributional and pa�ern-based methods ina weakly-supervised se�ing, such that the two types of methodscan provide complementary supervision for each other to buildan e�ective, uni�ed model. We propose a novel co-training frame-work with a distributional module and a pa�ern module. Duringtraining, the distributional module helps the pa�ern module dis-criminate between the informative pa�erns and other pa�erns, andthe pa�ern module generates some highly-con�dent instances toimprove the distributional module. �e whole framework can bee�ectively optimized by iterating between improving the pa�ernmodule and updating the distributional module. We conduct exper-iments on two tasks: knowledge base completion with text corporaand corpus-level relation extraction. Experimental results prove thee�ectiveness of our framework in the weakly-supervised se�ing.ACM Reference format:Meng �1, Xiang Ren2, Yu Zhang1, Jiawei Han1. 2016. Weakly-supervisedRelation Extraction byPa�ern-enhanced Embedding Learning. In Proceedings of , , , 10 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONRelation extraction is an important task in data mining and naturallanguage processing. Given a text corpus, relation extraction aims atextracting a set of relation instances (i.e., a pair of entities and theirrelation) based on some given examples. Many e�orts [7, 22, 26]have been done on sentence-level relation extraction, where the

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permi�ed. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected].,© 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00DOI: 10.1145/nnnnnnn.nnnnnnn

Sentence

Beijing , the capital of China, is a megacity rich in history.Tokyo , Japan ’s capital , was originally a small village .

ID

1

2

Text Corpus

Bill Gates is a co-founder of the Microsoft Corporation .3

Apple is working closely with MS to fix an issue .4

5 Steve Jobs was an American businessman, inventor .

Seed Relation Instances

Weakly-supervised Relation Extraction

Extracted Relation Instances

Beijing ChinaCapital of

Bill Gates MicrosoftFounder of

Tokyo JapanCapital of

Steve Jobs AppleFounder of

Figure 1: Illustration of weakly-supervised relation extrac-tion. Given a text corpus and a few relation instances asseeds, the goal is to extract more instances from the corpus.

goal is to predict the relation for a pair of entities mentioned in asentence (e.g., predict the relation between “Beijing” and “China”in sentence 1 of Fig. 1). Despite its wide applications, these studiesusually require a large number of human-annotated sentences astraining data, which are expensive to obtain. In many cases (e.g.,knowledge base completion [39]), it is also desirable to extract aset of relation instances by consolidating evidences from multiplesentences in corpora, which cannot be directly achieved by thesestudies. Instead of looking at individual sentences, corpus-level re-lation extraction [2, 12, 21, 27, 43] identi�es relation instances fromtext corpora using evidences from multiple sentences. �is alsomakes it possible to apply weakly-supervised methods based oncorpus-level statistics [1, 8]. Such weakly-supervised approachesusually take a few relation instances as seeds, and extract moreinstances by consolidating redundant information collected fromlarge corpora. �e extracted instances can serve as extra knowl-edge in various downstream applications, including knowledgebase completion [27, 34], corpus-level relation extraction [16, 43],hypernym discovery [30, 31] and synonym discovery [25, 36].

In this paper, we focus on corpus-level relation extraction in theweakly-supervised se�ing. �ere are broadly two types of weakly-supervised approaches for corpus-level relation extraction. Amongthem, pa�ern-based approaches predict the relation of an entity pairfrom multiple sentences mentioning both entities. To do that, tradi-tional approaches [23, 28, 41] extract textual pa�erns (e.g., tokensbetween a pair of entities) and new relation instances in a boot-strapping manner. However, many relations could be expressed in avariety of ways. Due to such diversity, these approaches o�en havedi�culty matching the learned pa�erns to unseen contexts, leadingto the problem of semantic dri� [8] and inferior performance. Forexample, with the given instance “(Beijing, Capital of, China)” inFig. 1, “[Head] , the capital of [Tail]” will be extracted as a textual

arX

iv:1

711.

0322

6v2

[cs

.CL

] 2

6 D

ec 2

017

, , Meng�1, Xiang Ren2, Yu Zhang1, Jiawei Han1

pa�ern from sentence 1. But we have di�culty in matching thepa�ern to sentence 2 even though both sentences refer to the samerelation “Capital of ”. Recent approaches [17, 40] try to overcomethe sparsity issue of textual pa�erns by encoding textual pa�ernswith neural networks, so that pa�ern matching can be replacedby similarity measurement between vector representations. How-ever, these approaches typically rely on large amount of labeledinstances to train e�ective models [30], making it hard to deal withthe weakly-supervised se�ing.

Alternatively, distributional approaches resort to the corpus-level co-occurrence statistics of entities. �e basic idea is to learnlow-dimensional representations of entities to preserve such sta-tistics, so that entities with similar semantic meanings tend tohave similar representations. With entity representations, a rela-tion classi�er can be learned using the labeled relation instances,which takes entity representations as features and predicts the re-lation of a pair of entities. To learn entity representations, someapproaches [19, 24, 33] only consider the given text corpus. Despitethe unsupervised property, their performance is usually limiteddue to the lack of supervision [39]. To learn more e�ective repre-sentations for relation extraction, some other approaches [37, 39]jointly learn entity representations and relation classi�ers using thelabeled instances. However, similar to pa�ern-based approaches,distributional approaches also require considerable amount of rela-tion instances to achieve good performance [39], which are usuallyhard to obtain in the weakly-supervised se�ing.

�e pa�ern-based and the distributional approaches extract re-lations from di�erent perspectives, which are naturally comple-mentary to each other. Ideally, we would wish to integrate bothapproaches, so that they can mutually enhance and reduce the re-liance on the given relation instances. Towards integrating bothapproaches, several existing studies [25, 30, 34] try to jointly traina distributional model and a pa�ern model using the labeled in-stances. However, the supervision of their frameworks still totallycomes from the given relation instances, which is insu�cient inthe weakly-supervised se�ing. �erefore, their performance is yetfar from satisfaction, and we are seeking an approach that is morerobust to the scarcity of seed instances.

In this paper, we propose such an approach called REPEL (Rela-tion Extraction with Pa�ern-enhanced Embedding Learning) forweakly-supervised relation extraction. Our approach consists of apa�ern module and a distributional module (see Fig. 2). �e pa�ernmodule aims at learning a set of reliable textual pa�erns for relationextraction; while the distributional module tries to learn a relationclassi�er on entity representations for prediction. Di�erent from ex-isting studies, we follow the co-training [3] strategy and encourageboth modules to provide extra supervision for each other, whichis expected to complement the limited supervision from the givenseed instances (see Fig. 3). Speci�cally, the pa�ern module acts as agenerator, as it can extract some candidate instances based on thediscovered reliable pa�erns; whereas the distributional module istreated as a discriminator to evaluate the quality of each generatedinstance, that is, whether an instance is reasonable. To encouragethe collaboration of both modules, we formulate a joint optimiza-tion process, in which we iterate between two sub-processes. Inthe �rst sub-process, the discriminator (distributional module) willevaluate the instances generated by the generator (pa�ern module),

ChinaFrance

Germany

Berlin

ParisBeijing

[ENT] capital [ENT]

[ENT] [ENT] ’s capital

capital [ENT] [ENT]

Pattern Module Distributional Module

Capital of

Capital of

Score 0.9

Figure 2: Illustration of the modules. �e pattern moduleaims to learn reliable textual patterns for each relation. �edistributional module tries to learn entity representationsand a score function to estimate the quality of each instance.

Pattern Module

DistributionalModule

Seeds

Existing Integration Frameworks Our Co-training Framework

Pattern Module

DistributionalModule

Seeds

Figure 3: Comparisonwith existing integration frameworks.Existing frameworks totally rely on the seed instances toprovide supervision. Our framework encourages both mod-ules to provide extra supervision for each other.

and the results serve as extra signals to adjust the generator. Inthe second sub-process, the generator (pa�ern module) will in turngenerate a set of highly con�dent instances, which serve as extratraining seeds to improve the discriminator (distributional module).During training, we keep iterating between the two sub-processes,so that both modules can be consistently improved. Once the train-ing converges, both modules can be applied to relation extraction,which extract new relation instances from di�erent perspectives.

In summary, in this paper we make the following contributions:• We propose a principled framework to integrate the distribu-

tional and pa�ern-based methods for weakly-supervised relationextraction, which is e�ective in overcoming the scarcity of seeds.

• We develop a joint optimization algorithm for solving the uni�edobjective, alternating between adjusting the pa�ern module andimproving the distributional module.

• We conduct experiments on two downstream applications overtwo real-world datasets. Experimental results prove the e�ec-tiveness of our framework in the weakly-supervised se�ing.

2 PROBLEM DEFINITIONIn this section, we formally de�ne our problem.Entity Name. An entity name is a string referring to a real-worldentity, which usually appears in multiple sentences of a corpus. Forexample in Fig. 1, all strings with purple colors (e.g., Beijing, BillGates) are valid entity names.

To extract relations between di�erent entities, a prerequisite is todetect those entity names in text corpora. In this paper, for simplic-ity, we will not focus on entity name detection. Instead, we will useexisting tools to do that. Speci�cally, we �rst apply some namedentity recognition tools [18] to the corpus, which are able to detectentity names in text. In practice, many detected entity names can re-fer to the same entity. For example in Fig. 1, “Microso�” in sentence3 and “MS” in sentence 4 both refer to Microso� Corporation. Forentity names representing the same entity, since they have exactlythe same meaning, we may expect to treat them equally instead of

Weakly-supervised Relation Extraction byPa�ern-enhanced Embedding Learning , ,

treating them independently. �erefore, we further leverage someentity linking tools [9], which can link synonymous entity namesto the same entity in an external knowledge (e.g., Freebase). A�erentity linking, for each entity, we use a uni�ed id to replace allentity names referring to that entity. For example, we can use theFreebase id of the entityMicroso� Corporation to replace “Microso�”and “MS” in Fig. 1.Relation Instance. A relation instance describes the relation be-tween a pair of entities. Formally, a relation instance is composedof an entity pair (eh , et ) and a relation r , meaning that entity ehand entity et have the relation r .

Relation instances are ubiquitous. For example in Fig. 1 (Beijing,China) with capital of, (Bill Gates, Microso�) with founder of areboth valid relation instances. Extracting such instances from textcorpora is an essential task, which has wide applications.Problem Definition. In this paper, we study weakly-supervisedrelation extraction. Speci�cally, given a text corpus D and sometarget relations R, with each target relation r speci�ed by a set ofrelation instances {(er (k )h , e

r (k)t , r )}Nr

k=1, our goal is to leverage thegiven instances as seeds and extract more instances from the corpus(Fig. 1). Formally, we de�ne our problem as follows:

De�nition 2.1. (ProblemDe�nition)Given a text corpusD andsome target relations R, where each target relation r is characterizedby a few seed instances {(er (k )h , e

r (k )t , r )}Nr

k=1 or in other words a few

seed entity pairs {(er (k)h , er (k )t )}Nr

k=1, the weakly-supervised relation

extraction task aims to extract more instances {(e(i)h , e(i)t , r

(i))}Mi=1from the corpus. In other words, we aim at discovering more entitypairs {(er (i)h , e

r (i)t )}

Mri=1 under each target relation r ∈ R.

3 THE REPEL FRAMEWORK3.1 Framework OverviewIn this section, we introduce our approach to weakly-supervisedrelation extraction. �e major challenge comes from the de�ciencyof supervision, since we only have a few relation instances as seeds.�erefore, the performances of existing approaches, including thepa�ern-based [1, 16, 44] and the distributional approaches [4, 20,39], are not satisfactory. Although some studies [25, 30, 34] tryingto reduce the reliance on seeds by integrating both approaches,they simply employ a joint training framework, which still requiresconsiderable relation instances to train e�ective models.

To be�er overcome the challenge of seed scarcity, in this paperwe propose a framework called REPEL based on the co-trainingstrategy [3]. Our framework consists of two modules, a pa�ernmodule and a distributional module (see Fig. 2), which extract re-lations from di�erent perspectives. �e pa�ern module aims at�nding a set of reliable textual pa�erns for relation extraction.Meanwhile, the distributional module tries to learn entity represen-tations and train a score function, which measures the quality ofa relation instance. Di�erent from existing studies, both modulesare encouraged to provide extra supervision to each other, whichis expected to complement the limited supervision from seed in-stances (see Fig. 3). Speci�cally, the pa�ern module is treated asa generator since it can extract some candidate relation instances,and meanwhile the distributional module acts as a discriminator to

Beijing is a big city in the northern China .

BeijingTarget Relation

Capital of

1/3

[Head] city [Tail]Sentence

Dependency Parsing Tree

Seed PairBeijing China

Path-based Pattern

[Head] city [Tail]

Extracted PairsChina

Shanghai China

Chicago USA

Pattern Reliability

[Head] is a big city in the northern [Tail]Meta Pattern

Figure 4: Illustration of the pattern module. We considerboth the path-based pattern and meta pattern. We infer pat-tern reliability using the seed entity pairs.

evaluate each instance. During training, the discriminator evalu-ates the instances generated by the generator, and the results serveas extra signals to adjust the generator. On the other hand, thegenerator will in turn generate some highly con�dent instances,which act as extra seeds to improve the discriminator. We keepiterating between adjusting the pa�ern module and improving thedistributional module. Once the training process converges, bothmodules can be utilized to discover more instances.

�e overall objective is summarized below:

maxP,D

O = maxP,D{Op +Od + λOi }. (1)

In the objective, P represents the parameters of the pa�ern module,that is, a given number of reliable pa�erns for each target relation.D denotes the parameters of the distributional module, that is, en-tity representations and a score function. �e objective functionconsists of three terms. Among them, Op is the objective of thepa�ern module, in which we leverage the given seed instances forpa�ern selection. Od is the objective of the distributional mod-ule, which learns relevant parameters under the guidance of seedinstances. Finally, Oi models the interactions of both modules.

Next, we introduce the model details. Note that for simplicity,we only consider one relation when introducing the model. To dealwith multiple relations, we can simply combine their objectives.

3.2 Pattern ModuleIn the pa�ern module, our goal is to select a given number of themost reliable pa�erns P for the target relation, and further leveragethem to discover more relation instances from corpora.

Following previous studies on pa�ern-based approaches, weleverage both the path-based pa�erns [5, 23, 40] and the metapa�erns [14]. For a pair of entities in a sentence, the path-basedpa�ern is de�ned as the tokens along the shortest dependency pathbetween the two entities. Whereas the meta pa�ern is de�ned as asequence of context words around the entities. Fig. 4 presents anexample of both pa�erns. Given the de�nition of pa�erns, we cango back to the corpus and extract pa�erns for every pair of entitiesin a sentence, forming a set of candidate pa�erns and many entitypairs linked to each pa�ern.

Among all the candidate pa�erns, we hope to extract the mostreliable ones for the target relation. Towards this goal, we leveragethe seed relation instances as guidance, and estimate the reliability


of a pa�ern π with the following measurement R(π ):

R(π ) =|G(π ) ∩ Spair ||G(π )| , (2)

where G(π ) represents all the entity pairs extracted by the pa�ernπ , and Spair is the set of seed entity pairs under the target relation.�e numerator of R(π ) is the number of seed entity pairs whichcan be discovered by the pa�ern π , and the denominator counts allextracted entity pairs. For example in the right part of Fig. 4, wefocus on the relation capital of, and the pa�ern [Head] city [Tail]extracts three entity pairs. Among them, the red pair is in the seedset, and therefore the reliability is 1/3. Such de�nition of R(π ) isquite intuitive. Basically, if a pa�ern can extract many seed entitypairs under the target relation, then it will be considered reliable.

Based on the measurement, we try to select the top K reliablepa�erns, in which K is a given number. Such goal can be achievedby optimizing the following objective function with respect to P :

Op =∑π ∈P

R(π ), (3)

where P is the pa�ern set with size K .Once the most reliable pa�erns P are learned for the target

relation, we can leverage them to extract new entity pairs under thetarget relation. Formally, we denote the set of entity pairs extractedby the pa�ern set P as G(P), which is calculated as follows:

G(P) = ∪π ∈PG(π ), (4)where G(π ) is the set of entity pairs extracted by pa�ern π .

3.3 Distributional Module�e distributional module of our approach focuses on the globaldistributional information of entities. Speci�cally, it aims at learn-ing distributed entity representations from corpora, so that similarentities are likely to have similar representations. Meanwhile, weutilize the given relation instances as seeds to train a score function,which takes entity representations as features to estimate whethera relation instance is reasonable.

To learn entity representations from text corpora, we follow [32]and build a bipartite network between all the entities and words.�e weight between an entity and a word is de�ned as the numberof sentences in which they co-occur. �en for an entity e and awordw , we infer the conditional probability P(w |e) as follows:

P(w |e) = exp(xe · cw )Z

, (5)

where xe is the vector representation of entity e , cw is the embed-ding vector of wordw and Z is a normalization term.

Given the estimated conditional probability p(·|e), we try tominimize its KL divergence from the empirical distribution p′(·|e)for every entity e , so that the distributional information can bepreserved into the learned entity representations. Speci�cally, theempirical distribution is de�ned as p′(w |e) ∝ nw,e , where nw,e isthe weight of the edge between word w and entity e . A�er somesimpli�cation, we obtain the following objective function:

Otext =∑w,e

nw,e log P(w |e), (6)

�e above objective function can be e�ciently optimized with thenegative sampling [20] and edge sampling [33] techniques. In each

epoch, a positive edge and several negative edges are sampled foroptimization. For details, readers may refer to [32, 33].

Meanwhile, we also leverage the given seed instances to learna score function, which estimates the quality of a instance, thatis, how likely an entity pair has the target relation. Following theprevious work [4], for an entity pair f = (eh , et ), its score underthe target relation is de�ned as follows:

LD (f |r ) = 1 − ||xeh + yr − xet | |22 , (7)

where | | · | |2 is the Euclidean norm of a vector, xe is the representa-tion of entity e , r is the target relation and yr is a parameter vectorfor the target relation.

Intuitively, we expect a seed entity pair could have larger scoresthan some randomly sampled pairs under the target relation. �ere-fore, we adopt the following ranking based objective for training:

Oseed =∑

f ∈Spair

∑f ′=(e ′h,e

′t )min{1,LD (f |r ) − LD (f ′ |r )}. (8)

Spair is all seed pairs, e ′h and e ′t are randomly sampled entities.Finally, we integrate Eqn. 6 and Eqn. 8 as the objective of the

distributional module, and we try to optimize it with respect to D.

Od = Otext + ηOseed , (9)

where η is used to control the weights of the two parts, D repre-sents all parameters of the distributional module, including entityrepresentations xe and the parameter vector yr of the relation.

Once the representations are learned, we can use the score func-tion LD to measure the score of each entity pair under the targetrelation, and thus discover some highly con�dent relation instances.

3.4 Modeling the Module InteractionSo far, the supervision of both modules totally comes from the givenrelation instances, which is insu�cient in the weakly-supervisedse�ing. To solve this problem, we follow the co-training strategy [3],and encourage both modules to provide extra supervision for eachother.

Speci�cally, we introduce the following objective function, andtry to maximize it with respect to both of P and D:

Oi = Ef ∈G(P )[LD (f |r )], (10)

where f ∈ G(P) is an entity pair extracted by the reliable pa�ernset P with G(P) de�ned in Eqn. 4, LD (f |r ) is the score of pair funder the target relation. From the objective function, we see thatthe selected pa�erns P acts as a generator, since it generates somecandidate entity pairs under the target relation; whereas the dis-tributional module serves as a discriminator, trying to score thegenerated entity pairs under the target relation. �e goal of theobjective function is to encourage the agreement of the pa�ernmodule and the distributional module. More speci�cally, we hopethat the entity pairs generated by the pa�ern module can be consid-ered reasonable by the distributional module. �e intuition behindthe objective comes from the co-training algorithms [3], where ithas been proved that the error rate of two predictive models canbe decreased by minimizing their disagreement [6].

To intuitively understand how this objective function will im-prove both modules, let us consider how to optimize with respectto both modules. For the pa�ern module, to maximize the above


objective, the pa�ern set P should include pa�erns which are con-sidered reliable by the distributional module. �at is, the entitypairs generated by those pa�erns should obtain large scores fromthe distributional score function LD . In this way, the distributionalmodule provides extra supervision to estimate the pa�ern relia-bility. Meanwhile, for the distributional module, to maximize theobjective function, it should assign larger scores to the entity pairsgenerated by the pa�ern module. �erefore, the highly con�dententity pairs generated by the pa�ern module serve as extra seedsto help improve the distributional module.

With the above objective function, both modules can tightlyinteract with each other, and provide extra supervision to overcomethe challenge of seed scarcity.

4 THE JOINT OPTIMIZATION PROBLEMTo optimize the overall objective function (Eqn. 1), we leverage thecoordinate gradient descent algorithm [38], by iterating betweentwo sub-processes. In the �rst sub-process, we �x the pa�ernmodule, and update the distributional module under the guidanceof the given seeds and the highly con�dent instances generated bythe pa�ern module. In the second sub-process, the distributionalmodule is �xed, and we update the selected pa�erns with the givenseed instances and the supervision provided by the distributionalmodule. During training, we keep iterating between the two sub-processes, so that both modules can be consistently improved.1. Optimizing the Distributional Module. In this step, we �xthe selected pa�ern set P to update the parameters D of the distri-butional module. Formally, maximizing the objective function withrespect to D can be transformed as the following problem:

maxD{Od + λOi } = max

D{Od + λEf ∈G (P )[LD (f |r )]}, (11)

which is a continuous optimization problem. We use the stochasticgradient descent algorithm for optimization. On the one hand, weadjust all parametersD to maximize theOd part. On the other hand,some entity pairs f will be sampled based on the selected pa�ernsP , which are treated as extra instances to update D.2. Optimizing the Pa�ern Module. In this this, we �x the pa-rameters D of the distributional module and adjust the reliablepa�ern set P . Formally, maximizing the objective function withrespect to P is equivalent to the following optimization problem:

maxP{Op + λOi } = max

P{∑π ∈P

(R(π ) + λEf ∈G (π )[LD (f |r )]

)}, (12)

which is a discrete optimization problem, with the goal as select-ing a given number of pa�erns P with the largest reliability. �ereliability of a pa�ern π is calculated from two sources: Op andOi . In the Op part, the reliability is measured with R(π ) de�ned inEqn. 2, which leverages the given seeds for reliability estimation.In theOi part, we utilize the score function LD to score each entitypair f extracted by pa�ern π , and further average them to obtainanother reliability estimation Ef ∈G(π )[LD (f |r )]. Finally, the twoestimations are weighted as the overall reliability. In practice, wecan �rst calculate the overall reliability of each pa�ern, and thenselect the top K pa�erns to form the reliable pa�ern set P .

Finally, we summarize the optimization algorithm into Alg. 1.Once the training converges, our approach will return a set ofdiscovered reliable pa�erns from the pa�ern module and a distri-butional score function from the distributional module. Both the

learned pa�erns and score function can be leveraged for relationextraction, which extract new instances from di�erent perspectives.Speci�cally, the learned reliable pa�erns extract relations from lo-cal contexts by matching the contexts with the pa�erns, whichusually have high precision but low recall. �is is because for a pairof entities, the local contexts mentioning both entities are usuallymore reliable for predicting their relations, leading to high preci-sion. However, for many pairs of entities, they may never co-occurin any local contexts, and thus using local contexts can result in lowrecall. In practice, the learned reliable pa�erns can be applied toapplications such as corpus-level relation extraction (see the detailsin Sec. 5.1.3 (2)). On the other hand, the learned distributional scorefunction predict entity relation from corpus-level statistics, leadingto relatively low precision but high recall, and is more suitable fortasks like knowledge base completion with text corpora (see thedetails in Sec. 5.1.3 (1)).

Algorithm 1 Optimization algorithm of REPEL.Input: A text corpus, a few seed relation instances, the number of reliable

pa�erns K , the parameter λ, the parameter η.Output: A set of reliable pa�erns P from pa�ern module, a score function

LD from distributional module, extracted relation instances.1: Generate pa�erns and entity pairs extracted by each pa�ern.2: Build the bipartite network between entities and words.3: while not converge do4: � Update the distributional module:5: Extract some instances by using the set of reliable pa�erns P .6: Optimize D with both the seeds and extracted instances (Eqn. 11).7: � Update the pa�ern module:8: Calculate pa�ern reliability with the seeds and LD (Eqn. 12).9: Select the top K most reliable pa�erns to form the pa�ern set P .10: end while11: � Extract relation instances:12: Utilize the reliable pa�erns P to extract instances from local contexts.13: Utilize the distributional score function LD to extract instances.

5 EXPERIMENTIn this section, we evaluate our approach on two downstreamapplications: knowledge base completion with text corpora (KBC)and corpus-level relation extraction (RE).

In knowledge base completion with text corpora, the key task isto predict the missing relationships between each pair of entities inknowledge bases. Since some pairs of entities may not co-occur inany sentences in the given corpus, the learned pa�ern module cannot provide information for predicting their relations. �erefore,for KBC we only use the entity representations and score functionlearned by the distributional module for extraction, and we expectto show that the pa�ern module can provide extra seeds duringtraining, yielding amore e�ective distributional module. For corpus-level RE, it aims at predicting the relation of a pair of entities fromseveral sentences mentioning both of them. In this case, the reliablepa�erns learned by the pa�ernmodule can capture the local contextinformation from the sentences. �erefore, we focus on utilizingthe learned pa�ern module for prediction in RE, and we expectto show that the distributional module can enhance the pa�ernmodule by providing extra supervision to select reliable pa�erns.


5.1 Experiment Setup1. Datasets. In experiment, we leverage existing NER tool [18]for entity detection. Since the NER tool can only detect entitiesof several major types such as location, person and organization,we thus sample 10 common relations 1 related to person, locationand organization from Freebase 2 as our target relations. �en twodatasets are constructed based on the selected relations.

Table 1: Statistics of the Datasets.Dataset Wiki + Freebase NYT + Freebase

# Documents 150,000 118,664# Entities 92,443 23,120

# Candidate Pa�erns 621,782 232,892# Seed Instances per Relation 50 50

# Relations in KBC 10 10# Test Instances in KBC 10,734 6,094

# Relations in RE 5 6# Test Entity Pairs in RE 131 222

(1)Wiki: �e �rst 150K articles in Wikipedia 3 are used as thecorpus. For each target relation, we randomly sample 50 relationinstances from Freebase as seeds. In the knowledge base completiontask, we select all the above 10 relations as the target relations, andwe sample 10,734 extra instances from Freebase for prediction. Inthe corpus-level relation extraction task, the manually annotatedsentences from [10] are used for evaluation. Among all relations inthe annotated sentences, 5 relations 4 can be mapped to our selected10 Freebase relations, and thus we only focus on these 5 relations.�ere are totally 194 manually annotated sentences and 131 entitypairs related to the relations.

(2) NYT: �e 118,664 documents from 2013 New York Timesnews articles. Similar to the Wiki dataset, for each target relationwe randomly sample 50 relation instances from Freebase as seeds.In the knowledge base completion task, we select all the above 10relations as the target relations, and totally 6,094 extra instancesare sampled for evaluation. In the corpus-level relation extractiontask, we leverage the manually annotated sentences from [11] forevaluation. Among all relations in the sentences, 6 relations 5 can bemapped to the selected 10 Freebase relations, so we focus on these6 relations. �ere are totally 322 manually annotated sentences and222 entity pairs related to the relations.

For each text corpus, we adopt Stanford CoreNLP package [18]6to do preprocessing. �en we leverage DBpedia Spotlight [9]7 tolink the detected entity names to the Freebase.2. Compared Algorithms. In the knowledge base completiontask, we select the following baseline algorithms to compare:1location.country.capital, people.person.parents, people.person.children, lo-cation.administrative division.country, people.person.place of birth, loca-tion.neighborhood.neighborhood of, people.person.nationality, people.deceasedperson.place of death, location.location.contains, organization.organization.founders.2 h�ps://developers.google.com/freebase/3 h�ps://www.wikipedia.org/4people.person.children,people.person.place of birth,people.person.nationality,people.deceasedperson.place of death,organization.organization.founders.5people.person.children,people.person.nationality,location.location.contains,people.deceased person.place of death,organization.organization.founders,location.administrative division.country.6 h�p://stanfordnlp.github.io/CoreNLP/7 h�ps://github.com/dbpedia-spotlight/dbpedia-spotlight

(1) word2vec [20]: A distributional approach for word embeddinglearning, which can learn entity representations from text corpora.Once the representations are learned, we utilize the seed instancesto train a relation classi�er (Eqn. 7) for extraction. (2) TransE [4]:A distributional approach for knowledge base completion, whichonly uses the given seed instances for training. (3) RK [36]: Adistributional approach for knowledge base completion, whichleverages both the text corpus and the given relation instancesto learn entity representations. (4) DPE [25]: An approach thatintegrates the distributional and pa�ern-based methods. It jointlymodels the distributional information in text corpora, the givenrelation instances and the textual pa�erns. (5) CONV [34]: Aknowledge base completion approach, which integrates the distri-butional and pa�ern-based methods by jointly optimizing the givenseed instances and the instances extracted by textual pa�erns.

In the corpus-level relation extraction task, the following ap-proaches are selected to compare:(1) SnowBall [1]: A pa�ern approach for relation extraction, whichdiscovers reliable pa�erns with the seed instances in a bootstrap-ping way. (2) PATTY [23]: A pa�ern approach which can applyto relation extraction. We leverage the seed instances to selectrelevant pa�erns in a bootstrapping way [1]. (3) CNN-ATT [16]:A pa�ern approach for corpus-level relation extraction. It leveragesconvolutional neural networks to encode and classify each sentence,and then consolidates the results of di�erent sentences using ana�ention mechanism. (4) PCNN-ATT [16]: A pa�ern approachfor corpus-level relation extraction. Compared with CNN-ATT, italso introduces the position embedding for each word and entity.(5) PathCNN [45]: A pa�ern approach for corpus-level relationextraction. For each entity pair, besides sentences mentioning bothentities, it also considers some other sentences mentioning only oneof them. (6) LexNET [29, 30]: An approach combining the distribu-tional and pa�ern-based methods for relation extraction. Formally,it uses a recurrent layer to encode local textual pa�erns, and thenuses the encoding vector together with entity representations forprediction.

For our proposed approach, we consider the following variants:(1) REPEL-P: A variant of our approach with only the pa�ernmodule (Op ). (2) REPEL-D: A variant of our approach with onlythe distributional module (Od ). (3)REPEL: Our proposed approach,which encourages the collaboration of bothmodules during training.Once the training converges, we leverage the entity representationsand score function learned by the distributional module for the KBCtask; whereas the reliable pa�erns discovered by the pa�ernmoduleare used for the RE task.

3. Evaluation Setup. (1) Knowledge Base Completion: Foreach compared algorithm, we �rst learn entity representations andrelation classi�ers (or score function for our approach) by using thegiven text corpus and relation instances. �en the learned repre-sentations and classi�ers are leveraged for evaluation. Speci�cally,for each test instance (eh , et , r ), we remove its head entity or tailentity, obtaining two incomplete instances, including (eh , ?, r ) and(?, et , r ), and our goal is to select the correct entity from the entityset to �ll the incomplete instances. To do that, for each candidateentity in the entity set, we calculate its score by measuring thequality of the formed instance using the relation classi�er. �en we

https://developers.google.com/freebase/

https://www.wikipedia.org/

http://stanfordnlp.github.io/CoreNLP/

https://github.com/dbpedia-spotlight/dbpedia-spotlight


sort di�erent entities in the descending order based on their scores,and calculate the rank of the correct entity. Finally, we report themean value of those ranks (i.e., MR) and also the proportion of thecorrect entities ranked within top 10 (i.e., Hits@10).(2) Corpus-level Relation Extraction: For each compared al-gorithm, we �rst use it to predict the relation expressed in eachtest sentence. Speci�cally, for neural network based approaches(PathCNN, CNN-ATT, PCNN-ATT, LexNET), the test sentences canbe directly classi�ed based on the learned neural classi�ers. Forapproaches based on textual pa�erns (PATTY, Snowball, REPEL,REPEL-P), we �rst match the local context of the test sentenceto a discovered reliable pa�ern π∗, then we classify the sentencebased on the relation expressed by pa�ern π∗. To do such matching,we represent each learned reliable pa�ern and the local pa�ernsof the test sentences with a low-dimensional vector. �e pa�ernvector is calculated by averaging the embeddings of tokens in eachpa�ern, with the token embeddings learned by our approach inEqn. 5. Once the pa�ern vectors are obtained, each local pa�ernin test sentences is matched to its most similar reliable pa�ern, inwhich the similarity is measured as the cosine similarity betweenthe pa�ern vectors. A�er all test sentences are classi�ed, for eachtest entity pair, we consolidate the prediction results from the testsentences mentioning both entities, and return the predicted rela-tion together with the con�dence score. During consolidate, weeither average the prediction results of all test sentences (LexNET,PathCNN, PATTY, Snowball, REPEL, REPEL-P), or leverage thelearned a�ention mechanism (CNN-ATT, PCNN-ATT). Finally, wesort all test entity pairs in the descending order based on the cal-culated con�dence scores, and compare the ranked list with theground-truth. Based on the results, we report both the precision atposition K (i.e., P@K), recall at position K (i.e., R@K), f1 score atposition K (i.e., F1@K) and the precision-recall curve.4. Parameter Se�ings. For all knowledge base completion meth-ods and the distributional module of our approach, we set thedimension of all representations as 100. �e number of iterationsfor TransE, word2vec, RK, DPE are set as 1000, 20, 20, 3B respec-tively to ensure the convergence. Other parameters are set as thedefault values suggested in the original papers. For the neuralbased approaches to corpus-level relation extraction, the dimensionof the embedding layer and the hidden layer is set as 100. Otherparameters are set as the default values suggested in the originalpapers. For our proposed approach, the parameter λ for controllingthe weight of the interaction term is set as 1 by default. For thedistributional module, the learning rate is set as 0.01, the parameterη is set as 0.005, the number of training edges in each iteration isset as 3B. For the pa�ern module, we set the number of reliablepa�erns K for each relation as 100.

5.2 Performance Comparison1. Knowledge Base Completion with Text Corpora (KBC).We present the quantitative results in Table 2, and the hits curve inFig. 5. For the approach only considering the given seed instances(TransE), we see the performance is very limited due to the scarcityof seeds. Along the other line, the approach considering text cor-pora (word2vec) achieves relatively be�er results, but are still farfrom satisfactory, since it ignores the supervision from the seed

instances. If we consider both the text corpus and seed instances forentity representation learning (RK), we obtain much be�er results.Moreover, by further jointly training a pa�ern model (DPE, CONV),the hits ratio can be further signi�cantly improved.

Table 2: �antitative results on the KBC task.

Algorithm Wiki + Freebase NYT + FreebaseHits@10 MR Hits@10 MR

TransE [4] 7.13 4328.40 15.94 3833.47word2vec [20] 32.12 203.53 15.56 913.04

RK [36] 41.49 72.87 29.01 307.89DPE [25] 45.45 78.87 32.47 279.99CONV [34] 46.84 139.81 31.51 903.38REPEL-D 47.49 67.28 35.79 234.23REPEL 51.18 62.18 38.98 199.44

*****

*********

*************

******************

*******************************************************

0 20 40 60 80 100

020

4060

80

@K

Hits

*

********

****

****

****

******

******

********

**********

*************

****************

********************

*

*

********

****

****

*****

******

********

*********

************

*************

*******************

**********

*

*

*******

****

******

********

**********

***************

********************

************************

****

*

*

*******

****

*****

******

*******

*********

***********

*************

*******************

*****************

*

*

*******

****

****

******

*******

********

***********

***************

********************

****************

*

*

********

****

*****

******

*******

*********

************

***************

*********************

***********

*******

TransEword2vecRKDPECONVREPEL−DREPEL

(a) Wiki

****

******

*********

************

*******************

**************************************************

0 20 40 60 80 100

010

2030

4050

6070

@K

Hits

****

****

******

*******

*********

***********

*************

*****************

*******************

**********

*

*******

****

****

****

*****

*******

*******

**********

**********

************

**************

***************

*

******

****

******

*******

*********

***********

***************

******************

***********************

*

*******

****

****

******

******

*******

*********

***********

**************

*****************

**************

*

*

******

****

****

******

*******

*********

***********

*************

***************

*********************

**

*

*******

****

****

******

********

*********

**********

*************

****************

******************

****

*******

TransEword2vecRKDPECONVREPEL−DREPEL

(b) NYT

Figure 5: Hits curves on the KBC task.

For our proposed approach, with only the distributional mod-ule (REPEL-D), it already outperforms all the baseline approaches.Compared with DPE, the performance gain of REPEL-D mainlycomes from the usage of the score function in Eqn. 7, which canbe�er model di�erent relations. Compared with CONV, REPEL-Dachieves be�er results, as the distributional information in text cor-pora can be be�er captured with Eqn. 6. Moreover, by encouragingthe collaboration of both modules (REPEL), the results are furthersigni�cantly improved. �is observation demonstrates that thepa�ern module can indeed help improve the distributional moduleby providing some highly con�dent instances.

Overall, our approach achieves quite impressive results on theknowledge base completion task compared with several strongbaseline approaches. Also, the pa�ern module can indeed enhancethe distributional module with our co-training framework.2. Corpus-level Relation Extraction (RE). Next, we show theresults on the corpus-level relation extraction task. We present thequantitative results in Table 3 and the precision-recall in Fig. 6. Forthe approaches using textual pa�erns (PATTY, Snowball), we seethe results are quite limited especially on the NYT dataset. �is isbecause it discovers informative pa�erns in a bootstrapping way,which can lead to the semantic dri� problem [8] and thus harm theperformance. For other neural network based pa�ern approaches(PathCNN, CNN-ATT, PCNN-ATT), although they are proved tobe very e�ective when the given instances are abundant, theirperformance in the weakly-supervised se�ing is not satisfactory.


Table 3: �antitative results on the RE task.

Algorithm Wiki + Freebase NYT + FreebaseP@50 R@50 F1@50 P@100 R@100 F1@100 P@50 R@50 F1@50 P@100 R@100 F1@100

Snowball [1] 58.00 22.14 32.05 65.00 49.62 56.28 20.00 4.50 7.35 21.00 9.46 13.04PATTY [23] 60.00 22.90 33.15 61.00 46.56 52.81 28.00 6.31 10.30 20.00 9.01 12.42

CNN-ATT [16] 26.00 9.92 14.36 22.00 16.79 19.05 24.00 5.41 8.83 29.00 13.06 18.01PCNN-ATT [16] 58.00 22.14 32.05 36.00 27.48 31.17 46.00 10.36 16.91 26.00 11.71 16.15PathCNN [45] 36.00 13.74 19.89 38.00 29.01 32.90 42.00 9.46 15.44 26.00 11.71 16.15LexNET [29, 30] 74.00 28.24 40.88 61.00 46.56 52.81 32.00 7.21 11.77 26.00 11.71 16.15

REPEL-D 14.00 5.34 7.73 17.00 12.98 14.72 6.00 1.35 2.20 7.00 3.15 4.34REPEL-P 64.00 24.43 35.36 70.00 53.44 60.61 32.00 7.21 11.77 33.00 14.86 20.49REPEL 78.00 29.77 43.09 76.00 58.02 65.80 48.00 10.81 17.65 43.00 19.37 26.71

�e reason is that they typically deploy complicated convolutionallayers or recurrent layers in their model, which rely on massiverelation instances to tune. However, in our se�ing, the instances arevery limited, leading to their poor performance. For the integrationapproach (LexNET), although it incorporates the distributionalinformation, the performance is still quite limited especially onthe NYT dataset. �is is because the joint training framework ofLexNET also requires considerable training instances.

●●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●●●●●●●●●

●

●

●

●

●

●

●

●●●●●●●●

●●

●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50 60 70

020

4060

8010

0

Recall

Pre

cisi

on

●●●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●●●●●●●●●●

●

●

●

●

●

●

●

●

●●●

●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●

●

●●

●●

●●

●●

●●

●●

●

●

●●

●●

●

●●

●●●●

●●

●●●

●●●

●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●●

●●●●●

●

●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●

●

●

●

●●

●●

●●

●●

●

●

●●

●●

●

●

●●

●●

●

●●

●●

●●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●

●●●

●●●●●●●

●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●●●

●

●●

●

●

●●●●●●●

●●

●●●●

●●

●●

●●

●●●●●

●●●●

●●●●●●●●●●●●

●●

●●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●●

●●

●●

●●●●●●●

●

●●●

●

●

●

●●●●

●●●●●

●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●

●●

●

●

●

●

●●●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

SnowballPATTYCNN−ATTPCNN−ATTPathCNNLexNETREPEL−PREPEL

(a) Wiki

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50 60

020

4060

8010

0

Recall

Pre

cisi

on

●

●

●

●

●

●

●

●●

●

●

●

●●●●●●

●●●●●●●

●

●

●●●●●

●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●●

●●●●

●●

●●●●

●●

●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●

●

●●●

●●

●

●

●●●●

●

●●●●●

●●●●●●●●

●

●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●

●●●

●●●●●●●●

●●●

●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●●●●

●●●●

●●

●●●

●

●●●

●●●

●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●●

●●●●●●●●

●●

●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●●●●●

●●●●●●

●●

●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●

●●●●

●●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●●●●●●●●

●●●

●●●●●

●●●●

●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

SnowballPATTYCNN−ATTPCNN−ATTPathCNNLexNETREPEL−PREPEL

(b) NYT

Figure 6: Precision-Recall curves on the RE task.

For our proposed approach, the performance of the distribu-tional module (REPEL-D) is very bad. �is is because each testentity is mentioned in only few test sentences, and thus the learnedentity representations are not so e�ective due to the sparsity of thedistributional information. On the other hand, the pa�ern mod-ule (REPEL-P) of our approach achieves surprisingly good results,which are comparable to the neural models. �is is because werepresent each pa�ern using the average embedding of tokens inthe pa�ern for pa�ern matching, where the token embedding islearned from the given text corpus. Although such strategy is verynaive compared with the neural encoding methods, it does notinvolve any extra parameters to learn. In the weakly-supervisedse�ing, the neural methods are usually hard to train due to thelarge number of parameters, leading to inferior results. Whereasour approach achieves impressive results because of its simplicity.Furthermore, comparing the pa�ern module (REPEL-P) with thecomplete framework (REPEL), we see that the complete frameworkfurther outperforms the pa�ern module, which demonstrates thatthe distributional module can also enhance the pa�ern module byhelping estimate pa�ern reliability.

Overall, in the weakly-supervised se�ing, our approach is ableto achieve comparable results compared with the neural methods.Besides, the distributional module can indeed improve the pa�ernmodule with our co-training framework.

5.3 Performance Analysis1. Performance w.r.t. the Number of Seed Instances. To over-come the challenge of seed scarcity, our approach encourages bothmodules to provide extra supervision for each other. In this sec-tion, we thoroughly study whether our framework is indeed robustto the scarcity of seed instances. We take the Wiki dataset as anexample, and report the performance of di�erent methods underdi�er number of seed instances.

*

*

*

*

**

** * *

20 40 60 80 100

3540

4550

55

# Seed Instances

Hits

@10

* *

*

*

*

**

* * *

* *

*

*

*

*

*

* * *

***

CONVREPEL−DREPEL

(a) KBC

*

** *

**

* * * *

20 40 60 80 100

3040

5060

7080

# Seed Instances

P@

50

*

**

*

**

*

* **

*

*

*

*

*

*

* ** *

***

LexNETREPEL−PREPEL

(b) RE

Figure 7: Performance w.r.t. # relation instances. Our ap-proach consistently outperforms the compared algorithmsespecially when the given seeds are very limited.

Fig. 7 presents the results on the KBC and RE tasks. We see thatour approach (REPEL) consistently outperforms other approaches(CONV, LexNET) integrating both the distributional and pa�ern-based methods. Besides, our approach (REPEL) also achieves be�erresults than its variants (REPEL-P, REPEL-D), which deploy onlyone module. Moreover, we observe that when the given seed in-stances are quite su�cient, the results of di�erent approaches arepre�y close. Whereas under very limited seed instances, our ap-proach (REPEL) signi�cantly outperforms its variants (REPEL-P,REPEL-D) and the baseline approaches (CONV, LexNET). Based onthe observation, we see that with the co-training framework, ourapproach is more robust to seed scarcity compared with existingintegration approaches (CONV, LexNET).2. Convergences Analysis. In our approach, we leverage thecoordinate gradient descent algorithm for optimization, alternatingbetween updating the distributional module and improving thepa�ern module. Next, we examine the optimization algorithmand study whether it converges during training. We take the Wikidataset as an example, and present the performance of our approachat each iteration.


* * * * * *

1 2 3 4 5 6

4849

5051

# Iteration

Hits

@10

*

*

*

* * *

**

REPEL−DREPEL

(a) KBC

* * * * * *

1 2 3 4 5 670

7274

7678

# Iteration

P@

100

*

*

*

*

* *

**

REPEL−PREPEL

(b) RE

Figure 8: Convergence curves of our approach. Our ap-proach quickly converges a�er several iterations.

Fig. 8 presents the results. In both tasks, the performance ofour approach is consistently improved at the �rst several itera-tions, which shows that both modules can keep improving eachother in our framework. Besides, we see that our approach quicklyconverges a�er several (3∼4) iterations, which demonstrates thee�ciency of the optimization algorithm.3. Performance w.r.t. λ. In our framework, the parameter λcontrols the weight of the interaction term Oi (Eqn. 10). A large λencourages strong interactions of both modules, whereas a smallλ corresponds to weak interactions. In this part, we study theperformance of our approach under di�erent λ. We take the Wikidataset as an example, and report the results on both tasks.

●

●

● ●

●

●

4042

4446

4850

52

lambda

Hits

0 0.25 0.5 1 2 4

● Hits@10

(a) KBC

●

● ●

●

●

●

6065

7075

80

lambda

Pre

cisi

on

0 0.25 0.5 1 2 4

● P@50

(b) RE

Figure 9: Performances w.r.t. λ. Encouraging the interactionof the both modules (λ > 0) improves the performance.

Fig. 9 presents the results. When λ is set as 0, meaning thatthere is no interaction between the modules, we see that the resultsare quite limited. �en the results are quickly improved as wegradually increase λ, which further remain stable in the range(0.5, 1). If we further increase λ, the results begin to drop in theknowledge base completion task, as a large λ emphasizes too muchon the supervision provided by the modules, and thus ignores thesupervision from the given seed instances.4. Case Study. In our co-training framework, both modules willcollaborate with each other to overcome the seed scarcity problem.Speci�cally, the distributional module provides extra signals toselect reliable pa�erns, whereas the pa�ern module discovers somehighly con�dent instances to improve the distributional module.Next, we show some case study results to intuitively illustrate thatboth modules can indeed mutually enhance each other.

Table 4: �e most reliable patterns discovered by our ap-proach. Blue patterns are incorrect ones by human.

Relation: people.person.place-of-birthREPEL-P REPEL

[Tail] [Head] birthplace [Head] born place [Tail][Head] born place [Tail] [Head] born city [Tail][Head] father move [Tail] [Tail] [Head] birthplace[Head] born city [Tail] [Head] born June [Tail][Head] born January [Tail] [Head] live return [Tail]

Relation: people.person.parentsREPEL-P REPEL

[Tail] die succeed son [Head] [Tail] die succeed son [Head]Babur [Tail] [Head] [Head] daughter [Tail][Tail] give boy [Head] [Head] son [Tail][Head] son [Tail] descendant [Tail] [Head]

have relationship [Head] [Tail] [Tail] marry have son [Head]

We �rst present the most reliable path-based pa�erns (i.e., to-kens along the shortest dependency path between two entities)discovered by our approach and its variant on the Wiki datasetin Table 4. Blue pa�erns are unreliable ones based on the human.Comparing our approach with its variant (REPEL-P), we see thatby considering the supervision signals from the distributional mod-ule (REPEL), some unreliable pa�erns can be �ltered out from thepa�ern list, and the pa�erns discovered by our approach (REPEL)are more reliable. �erefore, the distributional module can indeedhelp the pa�ern module for reliable pa�ern selection.

Table 5: Top ranked instances extracted by the reliable pat-terns. Blue instances are incorrect ones by human.

Relation: people.person.nationality(Charles IV of France, France) (Benjamin Franklin, USA)

(Adolf Hitler, German) (�omas Je�erson, USA) (Pol Pot, �ailand)Relation: location.country.capital

(Denmark, Copenhagen) (Vietnam, Ho Chi Minh City)(Norway, Oslo) (Yugoslavia, Belgrade) (Guinea, Conakry)

Meanwhile, we also randomly sample some instances extractedby the discovered reliable pa�erns, and we show them in Table 5,where the blue instances are the incorrect ones by human. Fromthe results, we see that most instances extracted by the reliablepa�erns are correct and reasonable. �erefore, the pa�ern modulecan in turn bene�t the distributional module by providing somereasonable relation instances.

6 RELATEDWORKOur work is related to pa�ern-based approaches for relation ex-traction. Given two entities, the pa�ern-based approaches predicttheir relation from sentences mentioning both entities. Traditionalapproaches [1, 14, 23, 28, 41] try to �nd some informative textualpa�erns using the given instances, and utilize the pa�erns forextraction. However, these approaches ignore the semantic corre-lations of pa�erns, and thus su�er from semantic dri� [8]. Recentapproaches [16, 17, 30, 34, 40, 45] address the problem by encodingtextual pa�erns with neural networks. Despite their success, theseapproaches rely on considerable labeled instances to train e�ectivemodels, which su�er from the seed scarcity problem in the weakly-supervised se�ing. Our approach solves the problem by le�ing thedistributional module provide extra supervision.


Our work is also related to the distributional approaches. Typi-cally, these approaches learn entity representations from corpus-level statistics, and meanwhile a relation classi�er is trained withthe relation instances, which takes entity representations as fea-tures for relation prediction. Some approaches learn entity rep-resentations from only text corpora [20, 24, 33]. However, theirperformances are usually limited due to the lack of supervision.Some other approaches [4, 13, 15, 36, 37, 39, 42] learn more predic-tive entity representations by using the given relation instances assupervision, achieving superior results. However, they also requireabundant relation instances to learn e�ective relation classi�ers,which are hard to obtain in the weakly-supervised se�ing. Ourapproach alleviates the problem by le�ing the pa�ern module togenerate some highly con�dent instances as extra seeds.

�ere are also handful studies [25, 27, 30, 34, 35] trying to inte-grate the distributional and pa�ern-based approaches. Typically,they jointly train a distributional model and a pa�ern model. How-ever, the supervision of each model totally comes from the givenrelation instances, which is insu�cient in the weakly-supervisedse�ing. Our approach solves the seed scarcity problem with aco-training framework, which encourages both models to provideextra supervision for each other.

7 CONCLUSIONSIn this paper, we studied corpus-level relation extraction in theweakly-supervised se�ing. We proposed a novel co-training frame-work called REPEL to integrate a pa�ern module and a distribu-tional module. Our framework encouraged both modules to provideextra supervision for each other, so that they can collaborate toovercome the scarcity of seeds. Experimental results proved thee�ectiveness of our framework. In the future, we plan to enhancethe pa�ern module by using neural models for pa�ern encoding.

REFERENCES[1] E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text

collections. In ACM DL’00, pages 85–94, 2000.[2] L. Bing, S. Chaudhari, R. Wang, and W. Cohen. Improving distant supervision

for information extraction using label propagation through lists. In EMNLP’15,pages 524–529, 2015.

[3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training.In COLT’98, pages 92–100, 1998.

[4] A. Bordes, N. Usunier, A. Garcia-Duran, J.Weston, and O. Yakhnenko. Translatingembeddings for modeling multi-relational data. In NIPS’13, pages 2787–2795,2013.

[5] R. C. Bunescu and R. J. Mooney. A shortest path dependency kernel for relationextraction. In HLT-EMNLP’05, pages 724–731, 2005.

[6] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning (chapelle, o. etal., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.

[7] A. Culo�a and J. Sorensen. Dependency tree kernels for relation extraction. InACL’04, pages 423–429, 2004.

[8] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic dri� with mutualexclusion bootstrapping. In PACLING’07, pages 172–180, 2007.

[9] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. Improving e�ciency andaccuracy in multilingual entity extraction. In I-SEMANTICS’13, pages 121–124,2013.

[10] J. Ellis, J. Getman, J. Mo�, X. Li, K. Gri��, S. Strassel, and J. Wright. Linguisticresources for 2013 knowledge base population evaluations. In TAC’13, 2013.

[11] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. Open infor-mation extraction: �e second generation. In IJCAI’11, pages 3–10, 2011.

[12] R. Ho�mann, C. Zhang, X. Ling, L. Ze�lemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. InACL’11, pages 541–550, 2011.

[13] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao. Knowledge graph embedding via dynamicmapping matrix. In ACL’15, pages 687–696, 2015.

[14] M. Jiang, J. Shang, T. Cassidy, X. Ren, L. M. Kaplan, T. P. Hanra�y, and J. Han.Metapad: Meta pa�ern discovery from massive text corpora. In KDD’17, pages877–886, 2017.

[15] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddingsfor knowledge graph completion. In AAAI’15, pages 2181–2187. AAAI, 2015.

[16] Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun. Neural relation extraction withselective a�ention over instances. In ACL’16, pages 2124–2133, 2016.

[17] Y. Liu, F. Wei, S. Li, H. Ji, M. Zhou, and H. Wang. A dependency-based neuralnetwork for relation classi�cation. In ACL’15, pages 285–290, 2015.

[18] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky.�e Stanford CoreNLP natural language processing toolkit. In ACL’14 (SystemDemonstrations), pages 55–60, 2014.

[19] T. Mikolov, K. Chen, G. Corrado, and J. Dean. E�cient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-sentations of words and phrases and their compositionality. In NIPS’13, pages3111–3119. MIT Press, 2013.

[21] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relationextraction without labeled data. In ACL-IJCNLP’09, pages 1003–1011, 2009.

[22] R. J. Mooney and R. C. Bunescu. Subsequence kernels for relation extraction. InNIPS’06, pages 171–178. MIT Press, 2006.

[23] N. Nakashole, G. Weikum, and F. Suchanek. Pa�y: A taxonomy of relationalpa�erns with semantic types. In EMNLP’12, pages 1135–1145, 2012.

[24] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for wordrepresentation. In EMNLP’14, pages 1532–1543, 2014.

[25] M.�, X. Ren, and J. Han. Automatic synonym discovery with knowledge bases.In KDD’17, pages 997–1005, 2017.

[26] X. Ren, Z. Wu, W. He, M. �, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han.Cotype: Joint extraction of typed entities and relations with knowledge bases.InWWW’17, pages 1015–1024, 2017.

[27] S. Riedel, L. Yao, A. McCallum, and B. M. Marlin. Relation extraction with matrixfactorization and universal schemas. In HLT-NAACL’13, pages 74–84, 2013.

[28] M. Schmitz, R. Bart, S. Soderland, O. Etzioni, et al. Open language learning forinformation extraction. In EMNLP-CoNLL’12, pages 523–534, 2012.

[29] V. Shwartz and I. Dagan. Path-based vs. distributional information in recognizinglexical semantic relations. arXiv preprint arXiv:1608.05014, 2016.

[30] V. Shwartz, Y. Goldberg, and I. Dagan. Improving hypernymy detection with anintegrated path-based and distributional method. In ACL’16, pages 2389–2398,2016.

[31] R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic pa�erns for automatichypernym discovery. In NIPS’05, pages 1297–1304, 2005.

[32] J. Tang, M. �, and Q. Mei. Pte: Predictive text embedding through large-scaleheterogeneous text networks. In KDD’15, pages 1165–1174, 2015.

[33] J. Tang, M. �, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scaleinformation network embedding. In WWW’15, pages 1067–1077, 2015.

[34] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Rep-resenting text for joint embedding of text and knowledge bases. In EMNLP’15,pages 1499–1509, 2015.

[35] P. Verga, A. Neelakantan, and A. McCallum. Generalizing to unseen entities andentity pairs with row-less universal schema. In EACL’17, pages 613–622, 2017.

[36] H. Wang, F. Tian, B. Gao, C. Zhu, J. Bian, and T.-Y. Liu. Solving verbal questionsin iq test by knowledge-powered word embedding. In EMNLP’16, pages 541–550,2016.

[37] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointlyembedding. In EMNLP’14, pages 1591–1601, 2014.

[38] S. J. Wright. Coordinate descent algorithms. Mathematical Programming,151(1):3–34, 2015.

[39] C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and T.-Y. Liu. Rc-net: A generalframework for incorporating knowledge into word representations. In CIKM’14,pages 1219–1228, 2014.

[40] Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin. Classifying relations via longshort term memory networks along shortest dependency paths. In EMNLP’15,pages 1785–1794, 2015.

[41] M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy. Renoun: Fact extraction fornominal a�ributes. In EMNLP’14, pages 325–335. ACL, 2014.

[42] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relationsfor learning and inference in knowledge bases. In ICLR’15, 2015.

[43] D. Zeng, K. Liu, Y. Chen, and J. Zhao. Distant supervision for relation extractionvia piecewise convolutional neural networks. In EMNLP’15, pages 1753–1762,2015.

[44] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao. Relation classi�cation via convolu-tional deep neural network. In COLING’14, pages 2335–2344, 2014.

[45] W. Zeng, Y. Lin, Z. Liu, and M. Sun. Incorporating relation paths in neuralrelation extraction. In EMNLP’17, pages 1769–1778, 2017.