learning hierarchical relationships among partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf ·...

12
Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous Attributes and Links Chi Wang, Jiawei Han University of Illinois at Urbana-Champaign {chiwang1,hanj}@illinois.edu Qi Li, Xiang Li, Wen-Pin Lin, Heng Ji City University of New York {hengji}@cs.qc.cuny.edu Abstract Objects linking with many other objects in an informa- tion network may imply various semantic relationships. Uncovering such knowledge is essential for role discov- ery, data cleaning, and better organization of informa- tion networks, especially when the semantically mean- ingful relationships are hidden or mingled with noisy links and attributes. In this paper we study a generic form of relationship along which objects can form a tree- like structure, a pervasive structure in various domains. We formalize the problem of uncovering hierarchical re- lationships in a supervised setting. In general, local features of object attributes, their interaction patterns, as well as rules and constraints for knowledge propa- gation can be used to infer such relationships. Exist- ing approaches, designed for specific applications, ei- ther cannot handle dependency rules together with local features, or cannot leverage labeled data to differenti- ate their importance. In this study, we propose a dis- criminative undirected graphical model. It integrates a wide range of features and rules by defining potential functions with simple forms. These functions are also summarized and categorized. Our experiments on three quite different domains demonstrate how to apply the method to encode domain knowledge. The efficacy is measured with both traditional and our newly designed metrics in the evaluation of discovered tree structures. 1 Introduction In an information network, linked objects have different roles defined by their relationships with other objects. Many relationships are not explicitly specified, but hid- den in the observed network or mingled with noisy links. Uncovering such knowledge has great importance in cleaning and reorganizing the information network for better utility. When the linked objects with such re- lationship can be organized into a tree-like structure, we name it hierarchical relationship. There are many instances of this kind of relationship in the real world. Parent-child, manager-subordinate, advisor-advisee are examples of such relationships among people in social networks and organizations. Many recent studies re- ported various applications with the help of certain hi- erarchical relationships in various domains, such as news dynamics tracking (Leskovec et al. [19]), information re- trieval from online discussions (Seo et al. [26]), inference of search intent (Yin and Shah [33]), and information cascade discovery in social networks (Gomez-Rodriguez et al. [23]). However, no one has formalized the prob- lem in a domain-independent generic setting, where the relationship is not detectable from single clear patterns, but needs to be learned from multiple features and con- straints with the presence of labeled data. In this paper, we cast the task of hierarchical relationship reconstruc- tion as a learning problem, and attempt a solution that is generally applicable for partially ordered objects. The supervised scenario is worth studying because it is often the case that we need to discover the in- stantiated hierarchical relationship among a set of data entities in a noisy network, given attributes on the enti- ties and interaction patterns. Also, we can use com- monsense assumptions to constrain the prediction or propagate the knowledge in the network according to the dependencies existing among linked objects. With- out labeled instances as training data, we are unable to determine their relative importance and make cor- rect predictions in a generic case. On the other hand, it is challenging to handle both local features and de- pendency rules simultaneously in a learning framework, specifically for the tree-like structure prediction. Tradi- tional machine learning methods designed for individual prediction cannot capture the dependencies naturally, while generic learning methods designed for structured output do not tell how to encode the specific dependen- cies in a tree-like structure prediction and is hard to be applied directly. Thus it is necessary to study the prop- erty of the tree-like structure prediction and develop a method that can be easily applied. Contributions First, the problem is novel because no one has stud-

Upload: others

Post on 24-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

Learning Hierarchical Relationships among Partially Ordered Objects

with Heterogeneous Attributes and Links

Chi Wang, Jiawei HanUniversity of Illinois at Urbana-Champaign

{chiwang1,hanj}@illinois.edu

Qi Li, Xiang Li, Wen-Pin Lin, Heng JiCity University of New York{hengji}@cs.qc.cuny.edu

Abstract

Objects linking with many other objects in an informa-tion network may imply various semantic relationships.Uncovering such knowledge is essential for role discov-ery, data cleaning, and better organization of informa-tion networks, especially when the semantically mean-ingful relationships are hidden or mingled with noisylinks and attributes. In this paper we study a genericform of relationship along which objects can form a tree-like structure, a pervasive structure in various domains.We formalize the problem of uncovering hierarchical re-lationships in a supervised setting. In general, localfeatures of object attributes, their interaction patterns,as well as rules and constraints for knowledge propa-gation can be used to infer such relationships. Exist-ing approaches, designed for specific applications, ei-ther cannot handle dependency rules together with localfeatures, or cannot leverage labeled data to differenti-ate their importance. In this study, we propose a dis-criminative undirected graphical model. It integrates awide range of features and rules by defining potentialfunctions with simple forms. These functions are alsosummarized and categorized. Our experiments on threequite different domains demonstrate how to apply themethod to encode domain knowledge. The efficacy ismeasured with both traditional and our newly designedmetrics in the evaluation of discovered tree structures.

1 Introduction

In an information network, linked objects have differentroles defined by their relationships with other objects.Many relationships are not explicitly specified, but hid-den in the observed network or mingled with noisy links.Uncovering such knowledge has great importance incleaning and reorganizing the information network forbetter utility. When the linked objects with such re-lationship can be organized into a tree-like structure,we name it hierarchical relationship. There are manyinstances of this kind of relationship in the real world.Parent-child, manager-subordinate, advisor-advisee are

examples of such relationships among people in socialnetworks and organizations. Many recent studies re-ported various applications with the help of certain hi-erarchical relationships in various domains, such as newsdynamics tracking (Leskovec et al. [19]), information re-trieval from online discussions (Seo et al. [26]), inferenceof search intent (Yin and Shah [33]), and informationcascade discovery in social networks (Gomez-Rodriguezet al. [23]). However, no one has formalized the prob-lem in a domain-independent generic setting, where therelationship is not detectable from single clear patterns,but needs to be learned from multiple features and con-straints with the presence of labeled data. In this paper,we cast the task of hierarchical relationship reconstruc-tion as a learning problem, and attempt a solution thatis generally applicable for partially ordered objects.

The supervised scenario is worth studying becauseit is often the case that we need to discover the in-stantiated hierarchical relationship among a set of dataentities in a noisy network, given attributes on the enti-ties and interaction patterns. Also, we can use com-monsense assumptions to constrain the prediction orpropagate the knowledge in the network according tothe dependencies existing among linked objects. With-out labeled instances as training data, we are unableto determine their relative importance and make cor-rect predictions in a generic case. On the other hand,it is challenging to handle both local features and de-pendency rules simultaneously in a learning framework,specifically for the tree-like structure prediction. Tradi-tional machine learning methods designed for individualprediction cannot capture the dependencies naturally,while generic learning methods designed for structuredoutput do not tell how to encode the specific dependen-cies in a tree-like structure prediction and is hard to beapplied directly. Thus it is necessary to study the prop-erty of the tree-like structure prediction and develop amethod that can be easily applied.Contributions

First, the problem is novel because no one has stud-

Page 2: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

ied the generic supervised tree construction problemwith the same kind of input and output.

Second, our method can compromise complicatedinteractions and the local features in a unified modeland learn their importance jointly, rather than arbi-trarily assigning weights or using hard constraints todo postprocessing. For example, when predicting thefamily relationship of two named entities, not only thecontext they occur in a document, but also their ageand residence location, etc., provide clues. We can de-fine different features from these clues. “Local” featuresare indicative of whether two objects have a certain re-lationship, and have nothing to do with the relation-ship among others. The last name equivalence of twopeople is an exemplar local feature for filiation. Wecan also define features or rules to capture the cor-relation of the relationships between different pairs ofobjects. There are simple propagation rules, such assiblings share the same parent. There are also morecomplicated ones, such as the constraint that one mustbe born (and grow old enough) before s/he can givebirth to others, which may involve complex interactionsbetween unknown variables because one’s parent andchildren both need to be inferred.

Third, we study the generally useful features andrules that offer clues at solving real problems. Theyare categorized into two major classes and eight minortypes. We show that many complex dependency rulescan be factorized into potential functions that are onlydependent on two variables. Thus one only needsto materialize these potential functions, either in theform of singleton potential or pairwise potential, toapply our approach. The generality of our approach isvalidated by experimenting with different applications.We duplicated several state-of-the-art approaches andshow that our method constantly beat them in theseapplications.

2 Related Work

In a broad view of relationship identification, there arestudies in different domains. One category of such workis relation mining from text data. Commonly referredentity relationship extraction belongs to this category,such as those developed around the Knowledge BasePopulation task [15]. While most of them explore low-level textual features [1, 13], some also exploit relationalpatterns for reasoning [24, 7]. Another line of studiesfocus primarily on processing the interaction events ina social network, and try to discover relationships fromtraffic or content patterns of the interaction, e.g., fromthe email communication data [22, 9]. The centralproblem for most of these studies is judging whethera pair of objects have a certain relationship. They do

not have special requirement for discovered relationshipto form certain structures.

The particular relationship considered in this paperis asymmetric among a set of linked objects, and the ob-jects can be organized in a tree-like structure along thisrelationship. In a few recent studies, finding such a rela-tionship is an essential task. Leskovec et al. [19] definesthe DAG partitioning problem, which is NP-hard, andproposes a class of heuristic methods of finding a par-ent for each non-rooted node, which can be regarded asfinding hierarchical relationships among the phrase quo-tations in news articles. Kemp and Tenenbaum [17] pro-pose to use a generative model to find structure in data,and Maiya and Berger-Wolf [21] apply the similar ideain inferring the social network hierarchy with the maxi-mum likelihood of observing the interactions among thepeople. In the Web search domain, Yin and Shah [33]studies the problem of building taxonomies of searchintents for entity queries based on inferring the “be-longing” relationships between them with unsupervisedapproaches. There is other related work for buildingtaxonomies fromWeb tags with similar methodology [8].NLP Researchers have also studied entity hyponymy (oris-a) relation from web documents, among whom Zhanget al.’s unsupervised approach [34] claimed the state-of-the-art performance, though the tree structures alongthe relation are not exploited. In [30], advisor-adviseerelationships are mined and academic family trees arebuilt from coauthor network in an unsupervised way.All of these unsupervised approaches rely on one kindof observation data, either links or attributes, and aclear standard of tree construction. We handle hetero-geneous data with multiple attributes and links whileno single factor determines the hierarchy, and we learnthe important factors from training data. In Informa-tion Retrieval community, researchers developed super-vised methods to discover the “reply” relationship foronline conversations, which is also a hierarchical rela-tionship [26, 31]. They use domain-specific features andcould not be directly applied in more general scenario.To the best of our knowledge, this work is the first at-tempt to formalize and solve the general hierarchicalrelationship learning problem.

The term “social hierarchy” or “organizational hi-erarchy” are referred to stratified node ranking resultsin some work [11, 25], and “hierarchical structure” isreferred to hierarchical grouping/clustering by a lot ofresearchers including physicists and biologists [6]. Theyare essentially distinct concepts from what we are study-ing. Our output is a network where each node is anobject from the input data and each link represents theexistence of the target relationship between a pair ofobjects.

Page 3: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

p1

p2

p3p4

Candidate DAG

v2

v1

v3v4

v1

v4 v2

v3

One possible result

v1

v4

v2

v3

Another possible result

Figure 1: An illustration of the problem definition

3 Problem Formulation

We first give two real-world examples.

Example 1. Given a set of named people and withknowledge of their ages, nationality and mentions fromtext, we want to find the family relation among them.Besides text features such as the coreferential linkbetween two names with family keywords, we haveconstraints like: (1) for two persons to be siblings, theyshould have the same parent; and (2) if A is B’s parent,then A is unlikely to be C’s parent if B and C havedifferent nationalities. The output should be a familytree or forest of these people.Example 2. In an online forum, we want to predict thereplying relationship among the posts within the samethread, with knowledge of the content, author and theposting time. Every post replies to one earlier post inthe same thread, except for the first one. One intuitionis that a post will have similar content with the one itreplies to; yet another possibility is two similar postsmay reply to a common post. The output is a treestructure of each threaded discussion.

As an abstract of many such kind of problems, wegive the following formalization. Given a set of objectsV = {v1, . . . , vn} and their directed links E ∈ V × V ,a relation R ⊂ E is a hierarchical relationship if (1)for every object u ∈ V , there exists exactly one objectv ∈ V such that (u, v) ∈ R; and (2) there does not exista cycle along R, i.e., no sequence (u1, . . . , ut), t > 1,such that u1Ru2, . . . , ut−1Rut, utRu1. In the relationinstance viRvj , we name vi as a child and vj a parent.For convenience we denote (vi, vj) to be eij . We definethe out-neighbors of one node as Yi = {vj |(vi, vj) ∈ E}.

In a general setting, there might exist multiplehierarchical relationships among the same set of nodes.In this paper we study the task of uncovering one user-specified hierarchical relationship according to labeledinstance pairs with this relationship. Let the labeledpairs be L ⊂ R, we aim to uncover the remaining setR \ L. In other words, we need to predict for everypair of directly linked objects (vi, vj) ∈ E, whether thestatement “(vi, vj) ∈ R” is true. This defines the generichierarchical relationship learning problem. Along with

each learned relation Rk, the objects form a tree ora forest. So we can also name it tree-like structureprediction.

Figure 1 gives one example. If we instantiate it as afamily tree prediction problem, each node vi representsa person, and each link points from one person tohis potential parent. v1 and v2 each has a link tothemselves, implying they can be the root of a tree. Wecall G = (V,E) a candidate graph. Suppose we knowthe ages of these people, we can make sure the candidategraph has no directed cycles by always linking a youngerperson to an older one. Many other real problems alsohave this property.

Assumption 1. The candidate graph is a directedacyclic graph (DAG).

With this assumption, given a set of objects, a tree-like structure can be learned in two steps. First, weextract a partial order for the objects and build a DAG.Next, we learn a model with given labels on some linksin the DAG, and conduct prediction for the remaininglinks.

When the candidate graph is not a DAG in originaldata, additional effort needs to be made. In fact,it is an NP-hard problem to remove as few edges aspossible to break cycles of a directed graph, known asminimum feedback arc set [16]. In this paper, instead ofexamining that issue, we simply assume such an ordercan be constructed and focus on how to design featuresand learn a joint model to handle the dependency andpropagate knowledge. We will demonstrate that it isnontrivial to learn the tree structure even when thecandidate graph is a DAG.

4 Our Approach

The relation of each pair of nodes can be determinedone by one; or be determined together, by taking theirinteractions into consideration. We argue that a jointmodel should be more powerful. For instance, in Exam-ple 1 (Figure 2(b)), parents and siblings are inter-relatedand mutually constrained in a network. If we decom-pose the inference task as a classification problem forevery individual pair of nodes, it is difficult to lever-age intra-dependency among nodes because one predic-tion depends on some other predictions which are alsounknown. On the other hand, the internal dependen-cies within the structure provide the regularization overeach individual prediction, which helps the overall pre-dictions in the network. When we have high confidenceon discovering some parts of the structure, the predic-tion on the less certain parts will become easier withthe assumption that the knowledge can be propagated.One rescue is postprocessing. For example, after the

Page 4: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

Bob Smith

Ada Jerry

Sibling S

Bob Smith

Ada Jerry

S

American Swedish

(a) Two soft dependency rules on

family tree: the relative impor-tance need be learnt.

p1

p2

p1

p2

Similar

p2p1

p0

ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp111111111111111111111111

ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp222222222222222222

Post 2

Post 1 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp2222222222222222222

pppppppppppppppppppppppppppppppppppppppppppppp1111111111111111111111111111

Post 1

Post 2

ppppppppppppppppppppppppppppppppppppppppppppppppppppp000000000000000000000000

Post 0

ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp1111111111111111111111111 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp222222222222222222222222222

Post 1 Post 2

(b) Conflict rules on forum

reply structure: similar postsmay be filiation or siblings.

Figure 2: Examples for the dependency among theinference

prediction, if we find some conflicts such as two siblingshaving different parents, we change the prediction forone of them to avoid mistakes. However, the correctparent is already missing, and we even do not knowwhich prediction is wrong.

The second challenge of our problem in generalcases is that the heterogeneous information and theirinterplay can give different constraints, some harder,some softer, and some even inconsistent with each other.In Example 2 (Figure 2(b)), the two rules do notagree with each other if both are applied to makingpredictions. Moreover, rules like “the more similar twoobjects, the more probable they share the same parent”cannot be fulfilled by hard constraints. To propagatethe knowledge in a systematic way, we need a modelthat can unify these different kinds of features and rules.

In this paper, we resort to a probabilistic graph-ical model to handle the uncertain dependency rules.Specifically, we develop a discriminative model CRF-Hier to solve the hierarchical relationship learning prob-lem. As a conditionally trained, undirected graphicalmodel, it is able to accommodate multiple, overlapping,non-independent features without having to generatethe observed variables or modeling their dependencies.Compared to the commonly used linear chain CRFs [18]or its extension 2D CRFs [35], our model is designedtowards handling the more complicated dependencieswhen inferring the tree structure.

4.1 Conditional Random Field for HierarchicalRelationshipWe model the joint probability of every possible rela-tionship (vi, vj) ∈ E being a truly existent relationshipin R. We use an indicator variable xij for the event“(vi, vj) ∈ R”, i.e., xij = 1 if (vi, vj) ∈ R, and 0 if not.

va

vd

vb vd

vc va

va=vc

vb=vc

vb=vd

vc

vb

va=vd

xab

xcd

(a) 4 cases for vaRvb and

vcRvd to be directly de-pendent

p1

p2

p3p4

Candidate DAG

v2

v1

v3v4

x11

x41

x21

x22

x32x43

Markov Network

(b) The Markov network from the ex-

ample candidate DAG.

Figure 3: Illustration of the Markov assumption

As we have analyzed, the inference of the relationshipfor some pairs are not independent. Suppose we haveevidence that two people v2 and v4 are not siblings,we may not expect the two events “(v2, v1) ∈ R” and“(v4, v1) ∈ R” to happen together. We formalize thatkind of intuition as a Markov assumption that eventsinvolving common objects are dependent.

Assumption 2. Two events “(va, vb) ∈ R” and“(vc, vd) ∈ R” correspond to connected random vari-ables in a Markov network if and only if they share acommon node, i.e., one of the following is true: va = vc,vb = vc, va = vd or vb = vd. (Figure 3)

It immediately follows that the Markov network can bederived from the candidate graph G by having a nodefor every edge in G and connecting nodes that representtwo adjacent edges in G, namely the line graph of G.

The conditional joint probability is formulated as

(4.1) p(X|G,Θ) ∝ exp( K∑

k=1

θkFk(X,G) +H(X,G))

where {Fk(X,G)}Kk=1 is a set of features defined on thegiven candidate graph G and the indicator variablesX = {xij}; {θk}Kk=1 are the weights of the correspondingfeatures. H is a special feature function to enforce thehierarchy constraint

∑(vi,vj)∈E xij ≤ 1.

(4.2) H(X,G) =

{−∞ ∃i,

∑(vi,vj)∈E xij > 1

0 o.w.

Any other hard constraints can be encoded in the samemanner.

Thus, once we learn the weights {θk}Kk=1 fromtraining data, the relation inference task could beformulated as a maximum a posterior (MAP) inferenceproblem: for each given candidate DAG G, we target forthe optimal configuration of the relationship indicatorX∗, such that,

(4.3) X∗ = arg maxX∈X

p(X|G,Θ)

Page 5: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

where X is the set of all the possible configurations,i.e., the search space. Since every xij can take 0 or 1,the size of the space is 2|E|.

Such a formulation keeps the maximal generality,while it poses great challenge to solve the combinatorialoptimization problem. We improve it in two ways.

First, we explore the form of feature functions Fk.Each of them can be defined on all variables, and thenthe computation of every function value relies on theenumeration of X in X . However, since we have madethe Markov assumptions, the dependency can be repre-sented by a factor graph, and each feature function canbe decomposed into the potential functions over cliquesof the graph (Hammersley-Clifford theorem [14]). More-over, we can restrict the potential functions to have in-teractions between at most a pair of random variablesxij and xst. In fact, we have the following claim: Anyfactor graph over discrete variables can be converted toa factor graph restricting to pairwise interactions, by in-troducing auxiliary random variables. This property isfound by Weiss and Freeman [32]. Although the genericprocedure of conversion may introduce additional vari-ables, we will show that quite a broad range of featurescan be materialized in as simple forms as pairwise poten-tials without the help of auxiliary variables in the nextsection. Here we factorize the constraint H to exemplifythe philosophy: H(X,G) =

∏eis,eit∈E h(xis, xit), where

h(xis, xit) = −∞ if xis = xit = 1, and 0 otherwise.Next, we try to reduce the number of variables and

constraints. To leverage the constraint that one nodehas at only one parent and the assumption that thecandidate graph G is a DAG, we introduce a variable yito represent the parent of vi, i.e., yi = j if and only ifeij = (vi, vj) is an instance in a hierarchical relationshipR. Given Assumption 1, the problem is equivalent tothe task of predicting yi’s value from Yi.

Assumption 2 implies the existence of two kinds ofdependencies: two variables yi and yj where vj is acandidate parent of vi; two variables ys and yt suchthat vs and vt share a common candidate parent vm.With this formulation, the constraint H is not neededany more, and the objective function has the followingform:

p(Y |G,Θ) ∝ exp( ∑

k∈I1

∑ys∈S′

k

θkgk(ys|Ys)

+∑k∈I2

∑(yi,yj)∈P ′

k

θkfk(yi, yj |Yi, Yj))

(4.4)

where I1 and I2 are the index sets of features that canbe decomposed into singleton potential and pairwisepotential functions respectively, and Sk and Pk are thedecomposed singleton and pairwise cliques for the k-thfeature.

y1

y2

y3

y4

g1

y21

1

g1

y10.51

g3

y3 2

g1

y4

0.1

1

0.2

3

g2 1

0

2

g2 0 1

p1

p2

p4

CRF-Hier for G

G - Candidate

DAG

v2

v1

v3v4

0.8

f41

y11

y2

0.3 1 2

0.3f4

1y2

2y3

0.1 2 2

0.7

f42

y31

y4

0.9 2 3

0.7f5

1y2

1y4

0.9 1 30.5 2 10.6 2 3

Figure 4: The graphical representation of the proposedCRF-Hier model for the example in Figure 1. All thesingleton potentials defined on each variable are listed inthe table connected to the related variables by dotted line;pairwise potentials are represented by solid rectangles andtables listing the pairwise function values when the pair ofvariables take different configurations

As an example, for the candidate graph G inFigure 4 we will build a factor graph like the oneon its right. Four nodes v1, v2, v3, and v4 have fourparent variables y1 to y4, while the range for them isY1 = {v1}, Y2 = {v1, v2}, Y3 = {v3}, and Y4 = {v1, v3}.For each variable there can be one or more singletonpotential functions. For each pair of directly linkednodes, we have a pairwise potential function f4 in thisexample (we omit the one between y1 and y4). v2 andv4 have a common parent candidate, so there is onepairwise potential f5 defined on y2 and y4.

We name our model CRF-Hier as it is a conditionalrandom field optimally designed for the hierarchicalrelationship learning problem.

We give several examples for specific potential def-initions.

Example 3. The names of a parent and a child shouldappear in the same sentence (support sentence) togetherwith family relationship words like “son, father”.

g(yi|Yi) = #support sentences of (vi, vyi).

Example 4. Siblings have common parent.f(yi, yj |Yi, Yj) = [yi = yj ]p(vi and vj are siblings).

where we use Iverson bracket to denote a number thatis 1 if the condition in square brackets is satisfied, and0 otherwise. In this case, [yi = yj ] = 1 if yi = yj and 0otherwise.

Example 5. Two people of the same age are unlikelysiblings.

f(yi, yj |Yi, Yj) = [yi = yj ][vi’s age = vj ’s age].

Note that these statements are not always true,e.g., twins can be siblings. But the advantage ofour framework is that it does not enforce them tobe true. The indicator in Example 5 takes value

Page 6: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

Table 1: Potential categorization and illustration

Type Cognitive Description Potential Definition Example

homophyly Parent and child are similar g(yi) = sim(vi, vyi)textual similarity, interaction inten-sity, spatial and temporal locality

polarity Parent is superior to child g(yi) = asim(vi → vyi)authority, interaction tendency, con-ceptual extension and inclusion

supportpattern

Patterns frequently occur-ring with child-parent pairs

g(yi) = |SP (vi, vyi)|contextual pattern, interaction pat-tern, preference to certain attributes

forbiddenpattern

Patterns rarely occurringwith child-parent pairs

g(yi) = −|FP (vi, vyi)|forbidden attribute to share, forbid-den distinction

attributeaugment

Use inherited attributes fromparents or children

f(yi, yj) = [yi = j]sim(vi, vyj )content propagation for documents,authority propagation for entities

labelpropagate

Similar nodes share similarparents (children)

f(yi, yj) = sim(vyi , vyj )sim(vi, vj)siblings share common parents, col-leagues share similar supervisors

reciprocityPatterns altering in child-parent and parent-child pairs

f(yi, yj) = sim(vi, vyj )sim(vj , vyi)author reciprocity in online conversa-tion - forth-and-back;

constraints Restrict certain patterns f(yi, yj) = −|CP (vi, vj , vyi , vyj )| consistency of transitive property

For conciseness we omit the conditional variable in g and f .

SP = support pattern set, FP = forbidden pattern set, CP = constraint pattern set.

0 or 1, and the weight of the corresponding featurecontrols how much we trust this rule. Eventually thecompatibility with this rule plays a factor as the productof potential function and feature weight in the additivelog-likelihood.

We see that these different features, constraints andpropagation rules can be encoded in a unified form; wejust need to define the potential for each of them. Wediscuss how to systematically design potentials in thenext subsection, followed by the inference and learningalgorithm.

4.2 Potential Function DesignWe have restricted our focus to the features that can bedecomposed into either singleton potential or pairwisepotential functions. We summarize the potential typeswith domain-independent cognitive meanings so thatone can design domain-specific potentials with the map.

We start from important singleton potentials.

• homophyly. The first kind, and probably most widelyapplicable, is a similarity measure between two ob-jects. The assumption here is that the filiation con-nects to homophily, e.g., content similarity, interac-tion intensity (e.g., the frequency of telephone callsin unit time), location adjacency, time proximity(e.g., whether two documents are published withina short period). This kind of similarity measuresim is symmetric, and there are numerous metricswe can use. The potential function has the formg(yi|Yi) = sim(vi, vyi

).

• polarity. The second kind, almost equally important,

is an asymmetric similarity measure. It is used tomeasure the dominance of certain attributes of theparent on the child, e.g., authority difference, biasof interaction tendency (e.g., whether A writes manymore emails to B than B to A), the degree of con-ceptual generalization/specialization. Such measureasim quantifies the partial order in terms of polar-ity between linked nodes, in the form of g(yi|Yi) =asim(vi → vyi).

• support pattern. The third kind of potentials char-acterize the preference to certain patterns involvinga pair of nodes with filiation. We can define a po-tential based on the number of pattern occurrences:g(yi|Yi) = |SP (vi, vyi)|, where SP denotes the sup-port pattern set.

The pairwise potentials are responsible for theknowledge propagation as well as the restrictive depen-dencies.

• attribute augmentation. One intuition for the knowl-edge propagation is that one node can inherit at-tributes from its parent or child to augment its own.In our model this can be realized by defining a pair-wise potential f(yi, yj |Yi, Yj) = [yi = j]sim(vi, vyj ).It can be elaborated in two ways: knowing that theparent of vi is vj , vj ’s parent will tend to choose asimilar one with vi; or given that the parent of vjis vyj , it affects the decision of its child towards in-heriting attributes from its parent. By replacing theboolean indicator [yi = j] with a weighting functionof vi and vj we can control the extent to which wepropagate the attribute.

Page 7: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

• label propagation. Sometimes we can measure howlikely two nodes share the same parent, so the label ofone’s parent can be propagated to similar nodes witha function like f(yi, yj |Yi, Yj) = [yi = yj ]sim(vi, vj).Given vi’s parent vyi

, the more similar vj is withvi, the larger contribution to the joint likelihood thisfunction will have via setting vj ’s parent to be thesame, i.e., yj = yi.

• reciprocity. This kind of potentials can handle amore complicated pattern that occurs in child-parentand parent-child pairs alternatively. For example, “ifauthor A often replies author B, then author B ismore likely to reply author A”. For such kind of rule,we seem to need a big factor function like F (yi|G,Y \yi) =

∑A,B([ai = B][ayi = A]

∑aj=A[ayj = B]),

where ai stands for vi’s author. It requires all labelsto be known. Fortunately, we find that rules inthis form have equivalent decomposed representation.The above rule can be decomposed into pairwisepotentials f(yi, yj |Yi, Yj) = [ai = ayj ][ayi = aj ].

For a specific application we can encode an arbi-trary number of potentials of each type. In experimentpart we will use three real-world examples to demon-strate that. In principle, one can apply frequent patternmining or statistical methods to find distinguishing fea-tures in each category. When features in multiple cate-gories are used, the model needs to be learned to reacha compromise.

4.3 Model Inference and LearningGiven a training set of network G = {V,E} withlabeled instances with hierarchical relationship L, weneed to find the optimal model setting Θ = {θk}Kk=1,which maximizes the conditional likelihood defined in(4.4) over the training set. Let Y (o) and T be thevariable sets and assignments corresponding to labeledrelation instances L, and Y (u) be the unknown labelsin training data. Their union is the full variable setY in training data. The log-likelihood of the labeledvariables could be written as:

LΘ = log p(Y (o) = T |G,Θ)

= log∑Y (u)

exp(ΘtF(Y |Y (o)=L, G)

)− logZ(G,Θ)(4.5)

where F(Y,G) is the vector representation of the featurefunctions, Z(G,Θ) =

∑Y exp(ΘtF(Y,G)).

To avoid overfitting, we penalize the likelihood byL2 regularization. Taking the derivative of this objectfunction, we could get:

∇LΘ = Ep(Y (u)|G,Θ,Y (o)=T )F(Y |Y (o)=T , G)

− Ep(Y |G,Θ)F(Y,G)− λΘ(4.6)

When the training data are fully labeled, Y (u) = ∅, andEquation (4.6) becomes simply:

∇LΘ = F(Y = T,G)− Ep(Y |G,Θ)F(Y,G)− λΘ

The first part is the empirical feature value inthe training set, the second part is the expectationof the feature value in the given training data, andthe last part is the L2-regularization. Given thatthe expectation of the feature value can be computedefficiently, L-BFGS [4] algorithm can be employed tooptimize the objective function Equation (4.5) by thegradient, although it is not convex with respect to Θ,and we can expect to find a local maximum of it. Whenthere are multiple training networks, we train the modelby optimizing the sum of their log likelihood.

When the feature weights are fixed, the learn-ing process requires an efficient inference algorithm formarginal probability of every clique: singletons andedges where we define our potentials. The prediction,however, requires MAP inference to find maximal jointprobability. Loopy belief propagation (LBP) [12], andits variants Tree-Based Reparameterization (TRP) [29],residual belief propagation [10] etc., have been showneffective to achieve empirically good inference andalso more scalable than general-purpose LP solvers(CPLEX).

In the toy model in Figure 4, there are 5 differentpotentials g1, g2, g3, f4, f5. We will learn the weightvector Θ = {θ1, . . . , θ5} from training data, and thenfind the optimal value of Y = {y1, . . . , y4} to predictthe structure.

5 Experimental Results

We perform the experimental study on different tasks,varying from extraction of named entity relationships todiscovery of document relationships1.Evaluation Measure. Few studies have been carriedout on how we should evaluate the quality of hierarchicalrelationship prediction. Accuracy on the predicted par-ent yi (Apar), and the accuracy on the predicted relationpairs xij (Apair), are two most natural evaluation crite-ria; they or their variants (Precision, Recall, etc.) areemployed by most previous studies [26, 33, 30]. How-ever, such measures only evaluate the prediction vari-ables on each node or each edge in an isolated view,missing some aspects of the comprehensive goodness ofstructure. We take an example to illustrate.

Figure 5 lists a ground-truth structure and severaldifferent reconstruction results. Both (b) and (c) havethe same Apar and Apair because only one node has

1Our code and data will be online available from http://www.

cs.illinois.edu/homes/chiwang1/

Page 8: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

v2v1 v3 v4

v0 v0p0

p1

p2

p3

p4

v3

v2

v1

v4

v0

v1

v4

v2

v3

v0

v1

v4

v2

v3

Figure 5: Comparison of inferred structures and thegold standard structure. (a)(b)(c) are three possi-ble prediction results for the true structure in (d).Green(light) nodes have right prediction for their par-ents, while red(dark) nodes have wrong predictions

Table 2: Measurement for structures in Figure 5

Structure Panc Ranc F1anc Apath Apar

(a) flat tree 1.00 0.60 0.75 0.60 0.60(b) chain 0.60 1.00 0.75 0.40 0.80(c) inferred 0.80 0.80 0.80 0.80 0.80

incorrect predicted parent. However, the chain is quitedifferent from the gold standard tree with two branches,and result (c) should be regarded closer to the groundtruth. So we can see that one mistake may not onlyaffect the parent of one node, but also deviate theshaping statistics of other nodes (e.g., in their degrees,number of ancestors and descendants). In other words,different edges have different importance in preservingthe shape of the tree, which is not reflected by theunweighted judgment of each prediction. Tree similaritymeasures for ontologies such as tree edit distance [2]are neither desirable because the edit operations donot apply here, and the computation of them hasbiquadratic complexity w.r.t. the tree size.

We define a set of novel measures for quantitativeevaluation of the quality of hierarchical relationshipprediction, which can be computed in linear time.Formally, let T be the ground-truth structure for nlinked objects V = {v1, . . . , vn}, and Y the prediction.We evaluate how well the structure is preserved inprediction by examining two additional aspects: theancestors and the path to root.

• Precision and recall of ancestors Panc, Ranc.

Panc =

∑ni=1 δ[Anci(Y ) ⊂ Anci(T )]

n(5.7)

Ranc =

∑ni=1 δ[Anci(T ) ⊂ Anci(Y )]

n(5.8)

where Anci(Y ) and Anci(T ) stand for the ancestorset of node vi in prediction and in ground truthrespectively.

• Accuracy of the path to root, Apath.

(5.9) Apath =

∑ni=1 δ[pathi(Y ) = pathi(T )]

n

where pathi(Y ) and pathi(T ) is the path from node vito its root in prediction and in ground truth respec-tively. Apath measures whether we can trace fromeach particular node to root without any mistake,thus it is the most strict measure.

As a commonly used compromise between precisionand recall, we can also define F-value for the proposedmeasure for ancestors as the harmonic mean of precisionand recall. For the example in Figure 5, the proposedmetrics have the values as shown in Table 2. We can seethat although (b) and (c) have the same accuracy on theparent prediction, three of the four proposed measuresall imply the inferred structure in (c) has better qualitythan the chain. Also, we notice that Apath is a moststrict measure, and in that measure (b) is even worsethan (a). That implies one predicted structure may begood in some aspect but bad in others. This reaffirmsthe necessity of using multiple measures for the treestructure evaluation.

The last thing we want to point out is that thehardness of the inference problem highly depends onthe number of candidate parents of each node. Therandom guess will have much lower performance than0.5 in most cases, even when each node only has two orthree candidate parents.Algorithm Setting. We observe no obvious differencefor the several variants of belief propagation algorithms.The learning results are based on LBP, and the L2penalty parameter is λ = 2 when there is no furtherspecification. All the features are normalized into[−1, 1].

5.1 Uncovering Family Tree StructureWe apply our method to an entity relation discoveryproblem. In the Natural Language Processing (NLP)community, it is sometimes studied as a slot fillingtask [15], i.e., to answer questions like “who are thetop employees of IBM”. Some relation types satisfyour definition of hierarchical relationship, e.g., manager-subordinate, parent organization-subsidiary, country-state-city. We take the family relation (parent-child)as the case to study, and try to answer the followingtwo questions: 1) whether the proposed method worksbetter than the state-of-the-art NLP approaches togeneric entity relation mining; and 2) how good ajoint model is compared to a model that does nothandle dependency rules, or uses the rules just for postprocessing.

Page 9: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

Table 3: Potentials used in family tree construction task

Type Potential Description

homophilyvi and vyi

live in the same location?mutual information of two names andfamily keywords from web snippets

polarity suffix comparison: Junior, I, III, IV etc.

supportpattern

co-occurrence of two names and somefamily keywords from web snippets;parent-child implied by Wikipedia in-foboxes

forbiddenpattern

child’s birth year after parent’s death+1;non parent-child implied by Wikipedia

attributeaugment

same residence location of one andgrandparent

labelprop

people who are likely to be siblings sharethe same parent

constraintspeople with the same age get lower pos-sibility to share the same parent

For clear demonstration, we define the task asautomatically assembling the family tree from a setof named person entities. These named entities wereextracted from two famous American families, Kennedyfamily and Roosevelt family, as listed in Wikipedia andFreebase [3]. We design potentials according to the mapin Section 4.2. Table 3 lists the potentials we used.Given any pair of named entities we first collectedall the Web context sentences (snippets) returned byYahoo! search engine. Then we encoded features basedon analysis of these snippets such as co-occurrencestatistics. We also encoded additional features ofvarious discovered attributes including residence, age,birth-date, death-date and other family members. Theresults of random guess reflect the hardness of theproblem.

We compare our method with three baselines: (1)NLP, the general relation mining approaches based onNLP techniques described in [5]. More specifically, weapplied a pattern matching algorithm to discoveringparent-children relation links among entities by analyz-ing the Web snippets. (2) Ranking SVM, a robust ma-chine learning technique developed from support vectormachine (SVM) for ranking problem, but only singletonfeatures can be handled. (3) Ranking SVM + post pro-cessing (PP). For Ranking SVM, we treat each node asa query, and its parent as relevant “document” and allthe other candidates as non-relevant “documents”. Forpost processing, we encode some pairwise potentials asglobal constraints (e.g., one person cannot have multi-ple parents; siblings should share the same parents) in

Table 4: Prediction performance for family tree

Train/Test Method F1anc Apath Apar

Train onKennedy,test onRoosevelt

Random < 0.01 < 0.01 0.0943NLP 0.1146 0.0333 0.1833RankingSVM 0.0667 0.0500 0.5000RankingSVM+PP 0.0667 0.0500 0.5833CRF-Hier 0.3439 0.2167 0.7167

Trainon Roo-sevelt,test onKennedy

Random < 0.01 < 0.01 0.1313NLP 0.2625 0.1750 0.2500RankingSVM 0.4371 0.1500 0.3250RankingSVM+PP 0.4750 0.1500 0.3500CRF-Hier 0.4846 0.3500 0.4000

Integer Linear Programming (ILP), in order to maxi-mize the summation of confidence values from RankingSVM subject to these constraints. The detailed imple-mentation is described in [20]. Table 4 shows the re-sults on two families with 60 and 40 members respec-tively. We can see that the optimized model for hier-archical relationship discovery performs 2 to 3-fold bet-ter in most measures than the general purpose relationminer. Compared with the two-stage method that usespost processing rules, the joint model can better inte-grate them into the learning and inference framework.The margin is the largest in the most strict measurepath accuracy (333%, 133%). That implies our methodmakes fewer mistakes at the key positions of the treestructure, where the chance of absorbing knowledge anddoing regularization is higher. We also find that Rank-ingSVM and RankingSVM + PP beat NLP in Apar, butare no better than NLP in F1anc and Apath in the firsttest case. Also, the post processing does not help muchbecause the confidence estimation from the predictionis not always reliable due to noises in prediction fea-tures. As one example, the prediction component suc-cessfully identified “William Emlen Roosevelt” as theparent of “Philip James Roosevelt” but with a low con-fidence; while it mistakenly identified “Theodore Roo-sevelt” as the parent of “George Emlen Roosevelt” butwith a much higher confidence due to their high mutualinformation value in Web snippets. Therefore, in thepost-processing stage, based on the fact that “PhilipJames Roosevelt” and “George Emlen Roosevelt” aresiblings, the label propagation rule mistakenly changedthe parent of “Philip James Roosevelt” to “TheodoreRoosevelt”. In our model, the weights of the local fea-tures and the propagation rules are learned from thedata, and the global optimization prevents this mis-take.

5.2 Uncovering Online Discussion StructureOnline conversation structure is a popular topic stud-

Page 10: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

Table 5: Potentials used in post reply structure task

Type Features, Rules and Constraints

homophilytf-idf cosine similarity cos(vi, vyi);recency of posting time

polaritywhether vyi is the first post of the thread;whether vyi is the last post before vi

supportpattern

whether ayi ’s name appears in post vi, whereayi is the author of vyi

forbiddenpattern

an author does not reply to himself

attributeaugment

the average content of one post’s children issimilar to its parent: [yi = j] cos(vi, vyj )

labelpropagate

similar posts reply to the same post:[yi = yj ] cos(vi, vj)

reciprocityauthor A replies B, motivated by B replyingA: [ayj = ai][ayi = aj ]

constraintsone author does not repeat reply to a post;B replying A’s post, A does not reply to B’searlier post: −[tyi < tj ][ayi = aj ][ayj = ai]

ied by researchers in information retrieval, since thestructure benefits many tasks including keyword-basedretrieval, expert finding and question-answer discov-ery. We perform the study on the problem of find-ing reply relationship among posts in online forums.The data are crawled from Apple Discussion forum(http://discussions.apple.com/) and Google EarthCommunity (http://bbs.keyhole.com/). From eachforum we crawled around 8000 threads and each threadcontains 6-7 posts on average, although some threadscan be as long as containing 250 posts. The posts ineach thread can be organized as a tree based on theirreply relationship. The task is to reconstruct the re-ply structure for the threads with no labels, given afew threaded posts with labeled reply relationship. Al-though we use the data from the forums where the replyrelationship is actually recorded by the system, the realapplication scenario will be predicting the online con-versations whose structure is unknown. Therefore, wewant to study the following questions: 1) How many la-beled data are needed to achieve good performance; and2) How is the adaptability of the model when trainedon one forum while testing on another.

We list the features of each type in Table 5. Moreintuition behind the features can be found in [31]. Thecompetitor we choose is Ranking SVM, used in [26] forthis task. Again, it can only handle singleton features.We also compare with a naive baseline that alwayspredicts chain structure.

To answer the first question, we fix the test dataof 2000 threads and vary the training data size in twodifferent ways. First, we use all the labels from eachthread, but vary the size of training threads from 50

0 500 1000 1500 2000 25000.65

0.7

0.75

0.8

0.85

0.9

# threads in training set

training on fully labeled threads

(a)

0 5 10 15 200.65

0.7

0.75

0.8

0.85

0.9

0.95

1

# samples used from each training thread

training on partially labeled threads

CRF−Hier F1Anc

Ranking SVM F1Anc

CRF−Hier Apath

RankingSVM Apath

(b)

Chain structure F1anc = 0.7435, Apath = 0.5647

Figure 6: Performance with varying training data size(Apple)

to 2500. Second, we fix the number of training threadsas 1000, and change the number of labels we use foreach thread from 3 to 11. From Figure 6(a), we findthat with a small training set, 50 labeled threads, CRF-Hier already achieves encouraging performance. Themargin is significant because even the naive baselineof predicting every post to reply to the last post gives0.74 in F1anc — compared with Ranking SVM’s 0.80and CRF-Hier’s 0.86, CRF-Hier doubles the margin ofwhat can be achieved by Ranking SVM from the naivebaseline. As more training data is added, the testingperformance is relatively more stable than RankingSVM. Figure 6(b) shows that when the labels foreach tree is very incomplete, CRF-Hier degrades itsperformance to its competitor because the pairwisefeatures cannot be well exploited. When the labelsbecome reasonably sufficient (5 posts in this case) tocharacterize the structural dependency among them,CRF-Hier presents superiority (increase the marginfrom baseline by 32% in Apath). Although CRF-Hierhas more feature weights to learn, the L2 regularizationmitigates overfitting, and the model works well evenwhen the training data size is small.

To answer the second question, we randomly select2,000 threads for training and 2,000 threads for testingfrom each of the two datasets, and perform a cross do-main experiment with the 4 combinations of train/testsets. Apple Discussion is a computer technical forums,while Google Earth focuses on entertainments. Fromthe comparative results in Table 6, we can find thatwith the help of pairwise features, CRF-Hier generalizesbetter than Ranking SVM which relies on the singletonfeatures only.

In conclusion, CRF-Hier constantly outperformsthe baseline when the size or the domain of the trainingdata vary. It has good adaptability to generalize.

Page 11: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

Table 6: Cross domain evaluation (CRF-Hier/RankingSVM)

(a) F1anc

PPPPPPPPTrainTest

Apple Google Earth

Apple 0.8476/0.8136 0.8233/0.7855Google Earth 0.8383/0.8017 0.8186/0.8099

(b) Apath

PPPPPPPPTrainTest

Apple Google Earth

Apple 0.7326/0.6722 0.6909/0.6325Google Earth 0.7143/0.6548 0.6797/0.6610

All improvements are statistically significant with p < 0.05

5.3 Uncovering Academic Family TreeAs a final experiment, we show an example in a do-main involving no text data. We consider the taskof academic family tree discovery according to advisor-advisee relationship which need to be inferred from re-search publication networks. We use the DBLP Com-puter Science Bibliography Database, which consistsof 871,004 authors and 2,365,362 papers with timeprovided (from 1970 to 2009). To test the accu-racy of the discovered advisor-advisee relationships, weadopt labeled data from [30], which are crawled fromthe Mathematics Genealogy (http://www.genealogy.math.ndsu.nodak.edu) and AI Genealogy (http://aigp.eecs.umich.edu).

We define the potentials according to the featuresused in [30]: average Kulczynski and IR measure inestimated advising period as homophily and polaritypotentials, and the constraint that one cannot adviseanother before graduation as a pairwise potential. Wecompare with the unsupervised method TPFG in [30],which is equivalent to giving every potential equalweight without learning from labeled data. We alsocompare to a CRF model with singleton features only.So we can have a decomposed study on the importanceof the learning and the constraint respectively.

The results are shown in Table 7. Note that we nowevaluate with more strict measures than the Apair usedin [30]. The result shows that learning is helpful whenthe same set of features are used, yet the constraintencoded by pairwise potentials is critical: without theconstraint, CRF has even worse performance comparedto the unsupervised model TPFG which can handle theconstraint. When the constraint is added, CRF-Hiercan increase Apath, Apar and F1anc of TPFG by 42%,27% and 8%, respectively.

Table 7: Performance for academic family tree

Method F1anc Apath Apar

TPFG 0.5241 0.2276 0.3548CRF-Singleton 0.4684 0.1648 0.3226CRF-Hier 0.5681 0.3226 0.4516

The overall performance for this dataset is lowerthan that of the Forum dataset. One reason is that weuse only a small number of features for fair comparisonwith the unsupervised method. Another reason is thedataset is not fully labeled, i.e., only a scattered partof the forest is known and we do not have even onefully labeled tree. The interaction of the variables isthus limited within a short range in the training data,and the trained model has less power in utilizing theinteraction.

6 Conclusions and Future Work

We define and tackle with the problem of hierarchicalrelationship discovery for linked objects. We study itin the supervised setting. We propose a discriminativeprobabilistic model that is optimized for the tree struc-ture learning and prediction, which can handle both lo-cal features and knowledge propagation rules. We sortout the common features and rules that can be encodedin simple and unified forms. By demonstrating with var-ious specific applications, we validate the effectivenessand generality of our approach.

While we have categorized the features for treestructures, it remains a promising problem how todo feature extraction and selection automatically forthese different types of features. Second, one mayexplore other learning framework such as max-marginmarkov networks [27] and structured SVM [28]. Finally,more research issues need to be studied when someassumption is violated, e.g., the linked objects do nothave a cycle-free order, or the relations do not form astrict tree structure.

7 Acknowledgements

This work was supported by the U.S. National Sci-ence Foundation grants IIS-0905215, U.S. Army Re-search Laboratory under Cooperative Agreement No.W911NF- 09-2-0053 (NS-CTA), the U.S. NSFCAREERAward under Grant IIS-0953149, the U.S. NSF EA-GER Award under Grant No. IIS-1144111 and the U.S.DARPA Broad Operational Language Translations pro-gram. Chi Wang was supported by Microsoft Ph.D. Fel-lowship. The authors want to thank Junfeng He for hishelpful discussion.

Page 12: Learning Hierarchical Relationships among Partially ...hanj.cs.illinois.edu/pdf/sdm12_cwang.pdf · Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous

References

[1] E. Agichtein and L. Gravano. Snowball: extractingrelations from large plain-text collections. In Proc.2000 ACM conf. on Digital libraries (DL ’00), 2000.

[2] P. Bille. A survey on tree edit distance and relatedproblems. Theor. Comput. Sci., 337:217–239, June2005.

[3] K. Bollacker, R. Cook, and P. Tufts. Freebase: Ashared database of structured general human knowl-edge. In AAAI ’08, 2008.

[4] R. H. Byrd, J. Nocedal, and R. B. Schnabel. Rep-resentations of quasi-newton matrices and their use inlimited memory methods. Math. Program., 63:129–156,1994.

[5] Z. Chen, S. Tamang, A. Lee, X. Li, W.-P. Lin, J. Ar-tiles, M. Snover, M. Passantino, and H. Ji. Cuny-blender tac-kbp2010 entity linking and slot filling sys-tem description. In Proc. 2010 NIST Text AnalyticsConference (TAC ’10), 2010.

[6] A. Clauset, C. Moore, and M. E. J. Newman. Hierar-chical structure and the prediction of missing links innetworks. Nature, 453:98–101, May 2008.

[7] A. Culotta, A. McCallum, and J. Betz. Integratingprobabilistic extraction models and data mining todiscover relations and patterns in text. In Proc. HLTConf. the North American Chapter of the Associationof Computational Linguistics (HLT-NAACL ’06), 2006.

[8] L. Di Caro, K. S. Candan, and M. L. Sapino. Usingtagflake for condensing navigable tag hierarchies fromtag clouds. In KDD ’08, 2008.

[9] C. P. Diehl, G. Namata, and L. Getoor. Relationshipidentification for social network discovery. In AAAI’07,2007.

[10] G. Elidan, I. McGraw, and D. Koller. Residual beliefpropagation: Informed scheduling for asynchronousmessage passing. In Proc. 2006 Conf. on Uncertaintyin AI (UAI ’06), 2006.

[11] L. C. Freeman. Uncovering organizational hierarchies.Comput. Math. Organ. Theory, 3:5–18, March 1997.

[12] B. J. Frey. Graphical models for machine learning anddigital communication. MIT Press, Cambridge, MA,USA, 1998.

[13] Z. GuoDong, S. Jian, Z. Jie, and Z. Min. Exploringvarious knowledge in relation extraction. In ACL ’05,2005.

[14] J. Hammersley and P. Clifford. Markov field on finitegraphs and lattices, 1971. Unpublished.

[15] H. Ji and R. Grishman. Knowledge base population:Successful approaches and challenges. In ACL ’11,2011.

[16] R. M. Karp. Reducibility among combinatorial prob-lems. In Complexity of Computer Computations, pages85–103. 1972.

[17] C. Kemp and J. B. B. Tenenbaum. The discovery ofstructural form. Proceedings of National Academy ofSciences of the United States of America, July 2008.

[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.

Conditional random fields: Probabilistic models forsegmenting and labeling sequence data. In ICML’01,pages 282–289, 2001.

[19] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD’09, 2009.

[20] Q. Li, S. Anzaroot, W. Lin, X. Li, and H. Ji. JointInference for Cross-document Information Extraction.In CIKM ’11, 2011.

[21] A. S. Maiya and T. Y. Berger-Wolf. Inferring themaximum likelihood hierarchy in social networks. InProc. 2009 Int. Conf. on Computational Science andEngineering, 2009.

[22] A. McCallum, X. Wang, and A. Corrada-Emmanuel.Topic and role discovery in social networks with exper-iments on enron and academic email. Journal of Arti-ficial Intelligence Research (JAIR), 30:249–272, 2007.

[23] M. G. Rodriguez, J. Leskovec, and A. Krause. Inferringnetworks of diffusion and influence. In KDD ’10, 2010.

[24] D. Roth and W.-t. Yih. Probabilistic reasoning forentity & relation recognition. In Proc. 2002 ICCLConf. on Computational Linguistics (COLING ’02),2002.

[25] R. Rowe and S. Hershkop. Automated social hierarchydetection through email network analysis. In Proc.2007 ACM workshop on Web mining and social networkanalysis (WebKDD/SNA-KDD ’07), 2007.

[26] J. Seo, W. Croft, and D. Smith. Online communitysearch using thread structure. In CIKM ’09, 2009.

[27] B. Taskar, C. Guestrin, and D. Koller. Max marginMarkov networks. In NIPS ’03, 2003.

[28] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Al-tun. Support vector machine learning for interdepen-dent and structured output spaces. In ICML ’04, 2004.

[29] M. J. Wainwright, T. Jaakkola, and A. S. Willsky.Tree-based reparameterization for approximate estima-tion on loopy graphs. In NIPS ’01, 2001.

[30] C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu,and J. Guo. Mining advisor-advisee relationships fromresearch publication networks. In KDD’10, 2010.

[31] H. Wang, C. Wang, C. Zhai, and J. Han. Learningonline discussion structures by conditional randomfields. In SIGIR ’11, 2011.

[32] Y. Weiss and W. T. Freeman. On the optimalityof solutions of the max-product belief propagationalgorithm in arbitrary graphs. IEEE Transactions onInformation Theory, 47:723–735, 2001.

[33] X. Yin and S. Shah. Building taxonomy of web searchintents for name entity queries. In WWW ’10, 2010.

[34] F. Zhang, S. Shi, J. Liu, S. Sun, and C.-Y. Lin. Non-linear evidence fusion and propagation for hyponymyrelation mining. In Proc. 2011 ACL Conf. Human Lan-guage Technologies (HLT ’11), 2011.

[35] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y.Ma. 2d conditional random fields for web informationextraction. In ICML ’05, 2005.