relational machine-learning

48
Relational Machine Learning Applications and Models Bhushan Kotnis Heidelberg University

Upload: bhushan-kotnis

Post on 12-Apr-2017

67 views

Category:

Technology


0 download

TRANSCRIPT

Relational Machine LearningApplications and Models

Bhushan Kotnis

Heidelberg University

Table of contents

1. Introduction

2. Models

1

Introduction

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Social Networks

• Problem: Rank Ads/Feeds, suggest relevant articles.

• Users are connected to one another, share interests,demographic data, news preferences.

• Linked Machine Learning problem: Predict ads, articlerecommendation, feed, etc using a unified model.

3

Social Networks

• Problem: Rank Ads/Feeds, suggest relevant articles.

• Users are connected to one another, share interests,demographic data, news preferences.

• Linked Machine Learning problem: Predict ads, articlerecommendation, feed, etc using a unified model.

3

Social Networks

• Problem: Rank Ads/Feeds, suggest relevant articles.

• Users are connected to one another, share interests,demographic data, news preferences.

• Linked Machine Learning problem: Predict ads, articlerecommendation, feed, etc using a unified model. 3

Genetic Regulatory Network

• Genes Regulatory Network: Molecular interaction network,Genes interacting with proteins and other molecules.

• Problem: Infer family, function of the Gene based on itsinteractions. Mutations leading to diseases.

• Link prediction problem: Linked ML problem because aprediction depends on other predictions.

4

Genetic Regulatory Network

• Genes Regulatory Network: Molecular interaction network,Genes interacting with proteins and other molecules.

• Problem: Infer family, function of the Gene based on itsinteractions. Mutations leading to diseases.

• Link prediction problem: Linked ML problem because aprediction depends on other predictions.

4

Genetic Regulatory Network

• Genes Regulatory Network: Molecular interaction network,Genes interacting with proteins and other molecules.

• Problem: Infer family, function of the Gene based on itsinteractions. Mutations leading to diseases.

• Link prediction problem: Linked ML problem because aprediction depends on other predictions.

4

Financial Networks

• Interconnected banks, companies, commodities, products,events, people, locations.

• Problem: Infer missing connections for estimating exposure.

• Problem: Reasoning using path correlations.

5

Financial Networks

• Interconnected banks, companies, commodities, products,events, people, locations.

• Problem: Infer missing connections for estimating exposure.

• Problem: Reasoning using path correlations.

5

Financial Networks

• Interconnected banks, companies, commodities, products,events, people, locations.

• Problem: Infer missing connections for estimating exposure.

• Problem: Reasoning using path correlations.5

Knowledge Graphs

6

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

Models

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

Interpretation

score(s,r,t)

Figure 1: RESCAL as Neural Network: Three latent factors (d =3).

• xs ⊕ xTt : all possible latent factor interactions (d× d) matrix.Matrix Wr acts like a mask, boosting or suppressing pairwiseinteractions.

• Entities appear in multiple relations as subjects or objects.Information Sharing!

10

Interpretation

score(s,r,t)

Figure 1: RESCAL as Neural Network: Three latent factors (d =3).

• xs ⊕ xTt : all possible latent factor interactions (d× d) matrix.Matrix Wr acts like a mask, boosting or suppressing pairwiseinteractions.

• Entities appear in multiple relations as subjects or objects.Information Sharing!

10

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Deep Learning

countryofHQ(target relation)

Similarity metric

0.94

d

Q

Microsoft isBasedIn

Seattle locatedIn

USA (dummy_rel)

Washington locatedIn

(Path Vector)

Figure 2: At each step, the RNN consumes both entity and relation vectors of the path. The entityrepresentation can be obtained from its types. The path vector yπ is the last hidden state. The parametersof the RNN and relation embeddings are shared across all query relations. The dot product betweenthe final representation of the path and the query relation gives a confidence score, with higher scoresindicating that the query relation exists between the entity pair.

2 Background

In this section, we introduce the compositionalmodel (Path-RNN) of Neelakantan et al. (2015).The Path-RNN model takes in input a pathbetween two entities and infers new relationsbetween them. Reasoning is performed non-atomically about conjunctions of relations in anarbitrary length path by composing them with arecurrent neural network (RNN). The representa-tion of the path is given by the last hidden state ofthe RNN obtained after processing all the relationsin the path.

Let pes, etq be an entity pair and S denotethe set of paths between them. The set S isobtained by doing random walks in the knowl-edge graph starting from es till et. Let π “

tes, r1, e1, r2, . . . , rk, etu P S denote a path be-tween pes, etq. The length of a path is the num-ber of relations in it, hence, plenpπq “ kq. Letyrt P Rd denote the vector representation of rt.The Path-RNN model combines all the relationsin π sequentially using a RNN with an intermedi-ate representation ht P Rh at step t given by

ht “ fpWrhhht´1 `Wr

ihyrrtq. (1)

Wrhh P Rhˆh and Wr

ih P Rdˆh are the param-eters of the RNN. Here r denotes the query rela-tion. Path-RNN has a specialized model for pre-dicting each query relation r, with separate param-eters pyr

rt ,Wrhh,W

rihq for each r. f is the sig-

moid function. The vector representation of pathπ pyπq is the last hidden state hk. The similarity of

yπ with the query relation vector yr is computedas the dot product between them:

scorepπ, rq “ yπ ¨ yr (2)

Pairs of entities may have several paths connectingthem in the knowledge graph (Figure 1b). Path-RNN computes the probability that the entity pairpes, etq participates in the query relation prq by,

Ppr|es, etq “ maxσpscorepπ, rqq,@π P S (3)

where σ is the sigmoid function.Path-RNN and other models such as the Path

Ranking Algorithm (PRA) and its extensions (Laoet al., 2011; Lao et al., 2012; Gardner et al., 2013;Gardner et al., 2014) makes it impractical to beused in downstream applications, since it requirestraining and maintaining a model for each relationtype. Moreover, parameters are not shared acrossmultiple target relation types leading to large num-ber of parameters to be learned from the trainingdata.

In (??), the Path-RNN model selects the maxi-mum scoring path between an entity pair to make aprediction, possibly ignoring evidence from otherimportant paths. Not only is this a waste of com-putation (since we have to compute a forward passfor all the paths anyway), but also the relations inall other paths do not get any gradients updatesduring training as the max operation returns zerogradient for all other paths except the maximumscoring one. This is especially ineffective duringthe initial stages of the training since the maxi-mum probable path will be random.

Figure 2: Source : Das et al. (2016). RNN generates a representation for thepath. Similarity between path representation and query relation indicateswhether the path supports the query.

13

Questions

I am convinced that the crux of the problem of learning isrecognizing relationships and being able to use them.

Christopher Strachey in a letter to Alan Turing, 1954.

14

References I

A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, andO. Yakhnenko.Translating embeddings for modeling multi-relational data.In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.Weinberger, editors, Advances in Neural Information ProcessingSystems 26, pages 2787–2795. Curran Associates, Inc., 2013.

M. Nickel, V. Tresp, and H.-P. Kriegel.A three-way model for collective learning on multi-relationaldata.In ICML, 2011.

15

References II

T. Trouillon, C. R. Dance, J. Welbl, S. Riedel, É. Gaussier, andG. Bouchard.Knowledge graph completion via complex tensor factorization.arXiv preprint arXiv:1702.06879, 2017.

B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng.Embedding entities and relations for learning and inference inknowledge bases.arXiv preprint arXiv:1412.6575, 2014.

16