link prediction - eth zürich - homepage | eth zürichlink prediction karsten borgwardt, christoph...

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis,

Link Prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze

Interdepartmental Bioinformatics GroupMPI for Biological CyberneticsMPI for Developmental Biology

Link prediction


DefinitionGiven two nodes x and x′, should they be connected by anedge?

Unsupervised versus supervised

Supervised: We are given a training set of edges.Unsupervised: No such training set is available.

Similarity-score versus cluster-based

Similarity-based: Nodes are connected if they are similar.Cluster-based: Nodes from the same cluster show similarconnectivity patterns.


Section 1:Similarity-score based link prediction

Similarity-based link prediction


Unsupervised link predictionDirect methodUnsupervised link prediction using kernel methods

Supervised link predictionBasic schemeProtein interaction prediction

Unsupervised link prediction


Introduction to unsupervised network inference

Direct approachStatistical interpretation

Network inference by kernel-based dependence maximizationNETHSIC

ExperimentsSocial network analysis

Conclusions

Unsupervised network inference


Given set of objects described by their attributes xi ∈ X

Find a set E of m edges e(i, j) that corresponds to interactions

Example: social network

Objects are people

Attribute is the occupation

Target network:

Who is friends with whom?

Unsupervised network inference





A direct approach:

Measure the pairwise di-stances d(xi, xj)

Iteratively connect the leastdistant pair by an edge

Direct approach


Measure the pairwise distances d(xi, xj) induced by a kernel k̃(xi, xj) on the centeredattributes

Iteratively connect the least distant pair by an edge

argmine′

∑(i,j)∈E∪{e′}

d(xi, xj)

argmine′

∑(i,j)∈E∪{e′}

k̃(i, i) + k̃(j, j)− 2k̃(i, j)

argmine′

∑i,j

K̃. ∗ (D − AE∪{e′})

argmine′

tr(K̃ LE∪{e′})

argmine′

tr(HKH LE∪{e′})

argmaxe′

tr(HKH(aI − LE∪{e′})1)

argmaxe′

1

(n− 1)2tr(HKHL1−step

E∪{e′})

argmaxe′

HSIC(K,L1−stepE∪{e′})

A: adjacency matrixD: diagonal matrix holding the degree of each node; D(i, i) =

∑j A(i, j)

L: graph LaplacianLp−step: p-step random walk kernel

Direct approach


Measure the pairwise distances d(xi, xj) induced by a kernel k̃(xi, xj) on the centeredattributes

Iteratively connect the least distant pair by an edge

argmine′

∑(i,j)∈E∪{e′}

d(xi, xj)

argmine′

∑(i,j)∈E∪{e′}

k̃(i, i) + k̃(j, j)− 2k̃(i, j)

argmine′

∑i,j

K̃. ∗ (D − AE∪{e′})

argmine′

tr(K̃ LE∪{e′})

argmine′

tr(HKH LE∪{e′})

argmaxe′

tr(HKH(aI − LE∪{e′})1)

argmaxe′

1

(n− 1)2tr(HKHL1−step

E∪{e′})

argmaxe′

HSIC(K,L1−stepE∪{e′})

So the direct approach iteratively maximizes the Hilbert-Schmidt indepen-dence criterion between a kernel on the attributes and a 1-step random walkkernel on the nodes in the network.

HSIC


Hilbert-Schmidt independence criterion (Gretton et al., 2005)

Let F and G be RKHS on X and Yand mappings φ : X→ F and ψ : Y→ G

HSIC is a measure of dependence between F and G

HSIC(F,G, Prxy) := ‖Cxy ‖2HS

For pairs of finite samples X, Y an empirical estimate of HSICcan be computed in terms of kernels

HSIC(K,L) :=1

(n− 1)2tr(HKHL)

Kij = 〈φ(xi), φ(xj)〉Lij = 〈ψ(yi), ψ(yj)〉Hij = δij − n−1

HSIC


Hilbert-Schmidt independence criterion (Gretton et al., 2005)

Let F and G be RKHS on X and Yand mappings φ : X→ F and ψ : Y→ G

HSIC is a measure of dependence between F and G

HSIC(F,G, Prxy) := ‖Cxy ‖2HS

For pairs of finite samples X, Y an empirical estimate of HSICcan be computed in terms of kernels

HSIC(K,L) :=1

(n− 1)2tr(HKHL)

Kij = 〈φ(xi), φ(xj)〉Lij = 〈ψ(yi), ψ(yj)〉Hij = δij − n−1

⇒ The direct approach maximizes the dependence between repre-sentations of the objects in the spaces induced by a kernel on theattributes and a 1-step random walk kernel on the network.

Overview


Introduction to unsupervised network inferenceDirect approachStatistical interpretation

Network inference by kernel-based dependencemaximization

NETHSIC


Conclusions

NETHSIC


exploits the fact that HSIC can be estimated using kernels.

K(i, j) = k(xi, xj) attribute kernel

LE(i, j) = lE(xi, xj) node kernel

argmaxE⊂(V×V )∧|E|=m

1

(n− 1)2tr(HKHLE)

O(nm) number of choices⇒ use greedy selection of m edges

real-world networks are often sparse⇒ do greedy forward selection of edges

NETHSIC


exploits the fact that HSIC can be estimated using kernels.

K(i, j) = k(xi, xj) attribute kernel

LE(i, j) = lE(xi, xj) node kernel

Input: the set of nodes V ,number of edges m,attribute kernel k and node kernel l

Output: a subset E of V × V of size mE ← ∅repeate = argmaxe′∈V×V tr(HKHLE∪{e′})E ← E ∪ {e}

until |E| = m

Algorithm 1: NETHSIC forward selection

NETHSIC – node kernels LE





What happens if we lookat a different kind of relationbetween the objects?

Who has a trade relationwith whom?





Who has a trade relation withwhom?

NETHSIC is kernel based

Network topology definedby node kernel LEHere 1-step random walkdoes not fit

Define node kernel LE ex-pressing prior knowledgeabout the network structure



choice of node kernel LE defines the topology of the network


1

(n− 1)2tr(HKHLE)

1-step Laplacian degree A2 closeness betweenness(aI − L)1 L = D − A 〈δ(i), δ(j)〉 A2 〈CC(i), CC(j)〉〈CB(i), CB(j)〉⇓ ⇓ ⇓ ⇓ ⇓ ⇓

similar xi dissimilar xi similar xi similar xi similar xi similar xiare connected are connected have similar share many have similar have similar

degrees neighbors closeness betweennesscentrality centrality

CC(i) = (n− 1)−1∑t∈V \{i} dG(i, t) Average shortest path length dG between i and

all other nodes in G.CB(i) =

∑s6=i 6=t∈V

s6=t

σst(i)σst

Number of shortest paths σst passing through i

Overview




Experiments

Social network analysis

Conclusions

Experiments


Countries trade data (Wassermann et al., 1994)

24 countries

3 attributes:population sizeGNP per capitaenergy usage

reference network:trade relations ofbasic manufactured goods

experimental setup:

linear kernel on each attribute

set m to max number of edges

rank edges by order of insertion

compute area under ROC curveusing reference network

Experiments


Countries trade data (Wassermann et al., 1994)

population size GNP per capita energy consumption0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

area

und

er R

OC

cur

ve

1−step random walkLaplaciandegreemutual information< 5% quantile> 95% quantile

Degree kernel of-ten shows best re-sults

Some results arebelow 5% quantile

Often it is not de-sirable to connectmost similar nodes

Overview





Conclusions

Conclusions


Kernel method for unsupervised network inference (NETHSIC)

Statistically motivated

High flexibility by choice of node kernelLE that can define com-plex network topologies

Allows for a statistical interpretation to direct approaches

In real-world networks it is not always desirable to connect themost similar objects

Future work: Use NETHISC for network completion


1

(n− 1)2 tr(HKHLE)

Supervised approaches


SettingWe are now given a training set of edges Etraining

We try to infer a rule, a classifier from this set Etraining thatallows us to predict edges on the test set Etest.

Ingredientsa similarity measure or metric for two pairs of nodesa set of negative examples of non-interacting nodesa classifier that turns these similarity scores into predictions

Pairwise similarity measures


Tensor pairwise kernel (Ben-Hur and Noble, ISMB 2005)Given two pairs of nodes (a, b) and (c, d).

ktensor((a, b), (c, d)) =knodes(a, c)knodes(b, d)

+knodes(a, d)knodes(b, c); (1)

This kernel quantifies the similarity of the source and targetnodes in both edges, for both directions.knodes is a kernel that measures the similarity of two nodes,just like the ones that are used for unsupervised link predicti-on.



Method 1: Direct similarity-based prediction

Motivation: “connect similar genes”Connect a and b if d(a, b) is below a threshold.This is an unsupervised approach (no use of the knownsubnetwork).

J.-P. Vert (Ecole des Mines) Supervised network inference 5 / 19



Method 1: Direct similarity-based prediction

Motivation: “connect similar genes”Connect a and b if d(a, b) is below a threshold.This is an unsupervised approach (no use of the knownsubnetwork).




Method 2: metric learning

Metric learningMotivation: use the known subnetwork to refine the distancemeasure, before applying the similarity-based methodBased on kernel CCA (Yamanishi et al., 2004) or kernel metriclearning (V. and Yamanishi, 2005).




Metric learning pairwise kernel (Vert et al., 2007)Given two pairs of nodes (a, b) and (c, d).

kml((a, b), (c, d)) =(knodes(a, c)− knodes(a, d)−knodes(b, c) + knodes(b, d))

2

=[(φ(a)− φ(b))>(φ(c)− φ(d)]2; (2)

knodes is a kernel that measures the similarity of two nodes,just like the ones that are used for unsupervised link predicti-on.a pair (a, b) is similar to a pair (c, d)

if a− b is similar to c− d, or . . .if a− b is similar to d− c.

Protein interaction prediction


SettingProtein-protein interactions (PPI) from yeast-two hybridscreens and mass spectrometry measurements provide onlypartial view of the interactomeGoal of protein interaction prediction is to complete the inter-actome by link prediction

Sequence-based PPI predictiondomain or motif-based interaction prediction (Sprinzak andMargalit, 2001; Deng et al., 2002; Gomez et al., 2003; Wanget al., 2004)3-mer sequence kernel (Martin et al., 2005)phylogenetic trees (Ramani and Marcotte, 2003), correlatedmutations (Pazos and Valencia, 2002) derived from sequence

Protein interaction prediction


Negative examplesJansen et al., 2003: pairs of proteins from different cellularlocationsBen-Hur & Noble, 2005: select random pairs of non-interacting proteins

Ben-Hur & Noble, 2005Used 3-mer kernel, kernel based on sequence and domain mo-tifs, kernel based on GO annotation, interactions in other spe-cies and common neighboursPPI prediction on BIND physical interaction dataset via SVM:AUC of 0.97, ROC50 of 0.58


Section 2:Cluster-based link prediction

Cluster-based link prediction


ApproachSimilar nodes form a clusterNodes from the same cluster exhibit a similar connectivitypattern

Problems to be solvedHow to find clusters on a graph?→ graph-based clusteringHow to define a connectivity pattern of a cluster?

Graph-based clustering I


Data representationdataset D is given in terms of a graph G = (V,E)

a data objects vi is a node in Gedge e(i, j) from node vi to node vj has weight w(i, j)

Graph-based clusteringDefine a threshold θRemove all edges e(i, j) from G with weight w(i, j) > θ

Each connected component of the graph now corresponds toone clusterTwo nodes are in the same connected component if there is apath between themGraph components can be found by depth-first search in agraph ((O(|V | + |E|))

Graph-based clustering II


Original graph

Graph-based clustering III


Thresholded graph (θ = 0.5)

Graph-based clustering IV


But how to get the graph in the first place?Think of the weights as a similarity measure.If two nodes are not connected, then their similarity measureis 0.Graph-based clustering creates clusters of similar objectsFor any object vi in a cluster, there is a second object vj suchthat similarity(vi, vj) is larger than θ.

DBScan I


Noise-robust graph-based clusteringGraph-based clustering can suffer from the fact that one noisyedge connects two clustersDBScan (Ester et al., 1996) is a noise-robust extension ofgraph-based clusteringDBScan is short for Density-Based Spatial Clustering ofApplications with Noise

Core objectTwo objects vi and vj with distance d(vi, vj) < ε belong tothe same cluster if either vi or vj are a core object.vi is a core object iff there are MinPoints points within adistance of ε from vi.A cluster is defined by iteratively checking this core objectproperty.

DBScan II


DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UN-CLASSIFIEDClusterId := nextId(NOISE);for i FROM 1 TO SetOfPoints.size doPoint := SetOfPoints.get(i);if Point.ClId = UNCLASSIFIED thenif ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts)thenClusterId := nextId(ClusterId)

end ifend if

end for

DBScan III


Code: ExpandClusterExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boo-lean;seeds:=SetOfPoints.regionQuery(Point,Eps);if seeds.size < MinPts then

SetOfPoint.changeClId(Point,NOISE);RETURN False;

elseSetOfPoints.changeClIds(seeds,ClId);seeds.delete(Point);while seeds <> Empty

currentP := seeds.first();result := SetOfPoints.regionQuery(currentP, Eps);

DBScan IV


if result.size >= MinPts thenfor i FROM 1 TO result.size do

resultP := result.get(i);if resultP.ClId IN (UNCLASSIFIED, NOISE) then

if resultP.ClId = UNCLASSIFIED thenseeds.append(resultP);

end ifSetOfPoints.changeClId(resultP,ClId);

end if // UNCLASSIFIED or NOISEend for;

end if; // result.size >= MinPtsseeds.delete(currentP);

end while; // seeds <> EmptyRETURN True;

end ifend // ExpandCluster

DBScan V


Original graph

DBScan VI


DBScan-clustered graph (MinPts = 2, Eps = 0.5)

DBScan VII


Original graph

DBScan VIII


DBScan-clustered graph (MinPts = 3, Eps = 0.5)

DBScan IX


PropertiesCluster assignment of border points is order-dependentUnlike k-means, one does not have to specify the number ofclusters a prioriBut one has to set MinPts and EpsEster et al. report that for 2D examples MinPts=4 is sufficientfor good resultsThey determine Eps by visual inspection of a k-distance plotTransfer question: How to kernelise DBScan?

Relational learning


PropertiesRepresents the graph as a probability distribution, in terms ofa graphical model.A graphical model is a probabilistic model for which a graphdenotes the conditional independence structure between thenodes, that is the random variables.A link r is a random variable in this model, typically a binaryvariable:r = 1; link does existr = 0; link does not exist

Relational learning


Link prediction based on node atrributes

Relational learning


Link prediction based on cluster membership

Relational learning


Variants of cluster-based relational learningLinks between all members of the same cluster, no links bet-ween members of different clusters:P (r = 1|za, zb) = 1 if za = zbP (r = 1|za, zb) = 0 if za 6= zb

Links between all members of the same cluster, fixed link pro-bability between members of different clusters:P (r = 1|za, zb) = 1 if za = zbP (r = 1|za, zb) = c if za 6= zb and 0 ≤ c ≤ 1

Link probability η(a, b) between members of clusters a and b:P (r = 1|za, zb) = η(a, b) and η(a, b) ∼ Beta(β, β)

Relational learning


Infinite (Hidden) Relational Model (IRM, IHRM)developed independently by Kemp et al and Xu et al in 2006Cluster nodes via a Chinese Restaurant ProcessLink probability η(a, b) between members of clusters a and b

Chinese restaurant process in a nutshell

P (zi = a|z1, . . . , zi−1) =

{ nai−1+γ na > 0γ

i−1+γ a is a new clusterwhere z1, . . . , zi−1 are the cluster asssignments of objects1, . . . , n, na is the number of objects assigned to cluster a,and γ is a parameter.The more objects there are in a cluster, the more likely it isthat a new data point is also assigned to this cluster.The creation of a new cluster is also possible.

link prediction - eth zürich - homepage | eth zürichlink prediction karsten borgwardt, christoph...

Documents