s.v.n. vishwanathanvishy/talks/graphs.pdf · s.v.n. vishwanathan:graph kernels, page 16 different...
TRANSCRIPT
S.V.N. Vishwanathan: Graph Kernels, Page 1
Graph KernelsS.V.N. Vishwanathan
[email protected]://www.stat.purdue.edu/~vishy
Purdue University
Joint work with Karsten Borgwardt, Nic Schraudolph, andRisi Kondor
Graphs are Everywhere
S.V.N. Vishwanathan: Graph Kernels, Page 2
Two protein molecules The Internet
Comparing Graphs:How similar are two graphs?
Comparing Nodes:How similar are two nodes of a graph?
Adjacency Matrix
S.V.N. Vishwanathan: Graph Kernels, Page 3
1
23
4
5
6
7 8
vertices/nodes edges
Undirected Graph G(V, E)
A =
0 1 0 0 0 0 1 01 0 1 0 0 0 0 00 1 0 1 0 0 0 10 0 1 0 1 1 0 10 0 0 1 0 1 0 00 0 0 1 1 0 1 01 0 0 0 0 1 0 10 0 1 1 0 0 1 0
sub-matrix of A = a subgraph of G
Degree Matrix
S.V.N. Vishwanathan: Graph Kernels, Page 4
1
23
4
5
6
7 8
D =
2 0 0 0 0 0 0 00 2 0 0 0 0 0 00 0 3 0 0 0 0 00 0 0 4 0 0 0 00 0 0 0 2 0 0 00 0 0 0 0 3 0 00 0 0 0 0 0 3 00 0 0 0 0 0 0 3
Normalized Adjacency matrix A = D−1A is a stochastic ma-trix (each row sums to one)
Graph Laplacian
S.V.N. Vishwanathan: Graph Kernels, Page 5
1
23
4
5
6
7 8
L =
2 −1 0 0 0 0 −1 0−1 2 −1 0 0 0 0 0
0 −1 3 −1 0 0 0 −10 0 −1 4 −1 −1 0 −10 0 0 −1 2 −1 0 00 0 0 −1 −1 3 −1 0−1 0 0 0 0 −1 3 −1
0 0 −1 −1 0 0 −1 3
L = D − A
Normalized version
L = D−12(D − A)D−
12
Spectrum bounded between 0 and 2
Random Walk
S.V.N. Vishwanathan: Graph Kernels, Page 6
1
23
4
5
6
7 8
?
From a vertex i randomly jump to any adjacent vertex j
Probability of jumping to j proportional to Aij
Walks of Length 2
S.V.N. Vishwanathan: Graph Kernels, Page 7
1
23
4
5
6
7 8
A2 =
2 0 1 0 0 1 0 10 2 0 1 0 0 1 11 0 3 1 1 1 1 10 1 1 4 1 1 2 10 0 1 1 2 1 1 11 0 1 1 1 3 0 20 1 1 2 1 0 3 01 1 1 1 1 2 0 3
Entries of A2 = number of length 2 walks
Entries of A2 = probability of length 2 walks
Idea!
S.V.N. Vishwanathan: Graph Kernels, Page 8
1
23
4
5
6
7 8
Count walks?
Count number ofwalks between twonodes
Two nodes are similarif they are connectedby many walks
Does not work :(
If graph has cycles then number of walks goes to∞
A Better Idea!
S.V.N. Vishwanathan: Graph Kernels, Page 9
1
23
4
5
6
7 8
Count walks? Discount contributionof longer walks
Count number ofwalks between twonodes
Two nodes are similarif they are connectedby many walks
Works if discounting factor chosen appropriately!
Diffusion Kernels
S.V.N. Vishwanathan: Graph Kernels, Page 10
Discounting Factor:Discount a k length walk by λk/k! for 0 ≤ λ ≤ 1
Similarity:Similarity defined as
k(i, j) =
[∑k
λk
k!Ak
]ij
= [exp(λA)]ij
Kondor and Lafferty:Work with diffusion and hence the graph Laplacian
k(i, j) =
[∑k
λk
k!Lk
]ij
= [exp(λL)]ij
They show that this is a valid p.s.d kernel
Extensions
S.V.N. Vishwanathan: Graph Kernels, Page 11
Laplacian as a regularizer:For any real-valued function f on the vertices of a graph
〈f, Lf〉 = f>Lf = −1
2
∑i∼j
(fi − fj)2
Can regularize differently if we replace L by
r(L) :=∑i
r(ρi)lil>i
Any monotonically increasing function of ρ admissible
Smola and Kondor
S.V.N. Vishwanathan: Graph Kernels, Page 12
Other Kernels:
r(ρ) = 1 + σ2ρ, K = (I + σ2L)−1 regularized Laplacianr(ρ) = (1− λρ)−p, K = (I − λL)p p-step random walk
Comparing Graphs
S.V.N. Vishwanathan: Graph Kernels, Page 13
3
4
5
6
1
2
7 8
Count matching walks?
3'1'
2'
4'5'
Count number ofmatching walks intwo graphs
Discount contributionof longer walks
Two graphs are simi-lar if many walks arematching
Three Questions:How to formalize this intuition?How to compute this efficiently?How is this related to diffusion kernels?
Direct Product Graph
S.V.N. Vishwanathan: Graph Kernels, Page 14
3'1'
2'
4'
31
2
11' 21'31'
12'
32'
22'
13'23'
33'14'
24'
34'
X
Formal Definition
V×(G×G′) ={(v, v′) : v ∈ V, v′ ∈ V ′}E×(G×G′) ={((v, v′), (w,w′)) : (v, w) ∈ E, (v′, w′) ∈ E ′}
Key Insight
S.V.N. Vishwanathan: Graph Kernels, Page 15
Random Walk on Product Graph:Equivalent to simultaneous random walk on input graphs
Kernel Definition:
k(G,G′) =1
|G| |G′|∑k
λk
k!e>Ak
× e =1
|G| |G′|e> exp(λA×) e
Extensions
S.V.N. Vishwanathan: Graph Kernels, Page 16
Different Decay Factor (Gärtner et al.):Using a λk decay
k(G,G′) =1
|G||G′|∑k
λk e>Ak× e
=1
|G||G′|e>(I−λA×)−1 e
Taking expectations:Instead of summing, take expectations
k(G,G′) =∑k
λk q>×Ak×p× = q>×(I−λA×)−1p×
p× and q× are initial and stopping probabilities resp.
Efficient Computation
S.V.N. Vishwanathan: Graph Kernels, Page 17
Product Graph is Huge:If G and G′ have n vertices then product graph has n2
verticesAdjacency matrix A× is of size n2 × n2
Houston we have a problem:Kernel computation involves
k(G,G′) = q>× exp(λA×)︸ ︷︷ ︸O(n6) !
p×
or
k(G,G′) = q>× (I−λA×)−1︸ ︷︷ ︸O(n6) !
p×
Kronecker Products
S.V.N. Vishwanathan: Graph Kernels, Page 18
Definition (by example):
A =
[1 00 1
]and B =
2 5 25 2 51 5 2
then
A⊗B =
2 5 2 0 0 05 2 5 0 0 01 5 2 0 0 00 0 0 2 5 20 0 0 5 2 50 0 0 1 5 2
Key Insight
S.V.N. Vishwanathan: Graph Kernels, Page 19
The adjacency matrix of the product graph
A× = A⊗ A′
Can compute exp(A×) as
exp(A×) = exp(A)︸ ︷︷ ︸O(n3)
⊗ exp(A′)︸ ︷︷ ︸O(n3)
Computing (I−λA)−1 involves a bit more work . . .
Sylvester Equations
S.V.N. Vishwanathan: Graph Kernels, Page 20
Claim:Computing the Gärtner et. al kernel is no harder thansolving
X = A′XA> + P
Sylvester Equations:The above equation is called a Sylvester equationWell studied in control theoryEfficiently solvable in O(n3) timedylap method in Matlab
Before the proof . . .
S.V.N. Vishwanathan: Graph Kernels, Page 21
vec operator:
B =
2 5 25 2 51 5 2
and vec(B) =
251525252
Key Equation:
vec(ABC)= (C> ⊗ A) vec(B)
The Proof
S.V.N. Vishwanathan: Graph Kernels, Page 22
Rewrite the Sylvester equation as
vec(X) = vec(A′XA>) + vec(P )
Apply key equation
vec(X) = (A⊗ A′) vec(X) + vec(P )
Rearrange
(I−A⊗ A′) vec(X) = vec(P )
or equivalently
vec(X) = (I−A⊗ A′)−1 vec(P )
Let p× = vec(P ) and multiply both sides by q×
q>× vec(X) = q>×(I−A⊗ A′)−1p× = K(G,G′)
Other Schemes
S.V.N. Vishwanathan: Graph Kernels, Page 23
Basic Idea:
vec(A′XA>)︸ ︷︷ ︸O(n3)
= (A⊗ A′) vec(X)︸ ︷︷ ︸O(n4)
Can exploit sparsity of A and A′ to speed up things
Fixed Point Iteration:Solve for a fixed point (Kashima et. al):
(I−A⊗ A′) vec(X∞) = vec(X∞)
Conjugate Gradient:Fast matrix-vector multiplication to speed up CG solverConvergence depends on spectrum of A and A′
Relation to Diffusion Kernels
S.V.N. Vishwanathan: Graph Kernels, Page 24
Laplacian of the Direct Product Graph:In general L× 6= L1 ⊗ L2 :(But there is a fix ...
Cartesian Product of Graphs:
V� ={(v, v′) : v ∈ V, v′ ∈ V ′}E� ={((v, v′), (w,w′)) : (v, w) ∈ E, (v′, w′) ∈ E ′}
For Cartesian products
A� = A1 ⊕ A2 := A1 ⊗ I + I ⊗ A2
L� = L1 ⊕ L2
All our efficient computation tricks apply!Is the kernel PSD?
Experiments
S.V.N. Vishwanathan: Graph Kernels, Page 25
Scaling Behavior - I:Begin with empty graphs of size 2k where k = 1, . . . 10
Randomly insert edges untilavg. degree at least 2 orgraph is full
Generate 10 random graphs and compute kernel matrix
Experiments
S.V.N. Vishwanathan: Graph Kernels, Page 26
Scaling Behavior - II:Begin with empty graphs of size 32
Randomly insert edges untilavg. fill-in of adjacency matrix is 10% . . . 100% andGraph is connected
Generate 10 random graphs and compute kernel matrix
Experiments
S.V.N. Vishwanathan: Graph Kernels, Page 27
Impact of the vec-trick:Same graphs as the runtime vs nodes experimentUse the vec trick in the fixed point iterationCompare to original fixed point iteration
Experiments
S.V.N. Vishwanathan: Graph Kernels, Page 28
Unlabeled Graphs: We computed graph kernels on fourdatasets for molecular function prediction: Mutag andPtc (chemical compounds), Enzyme and Protein (proteinstructures). We report runtimes for computing a 100 × 100kernel matrix.
dataset Mutag Ptc Enzyme Proteinnodes/graph 17.7 26.7 32.6 38.6edges/node 2.2 1.9 3.8 3.7
Direct 18’09" 142’53" 31h* 36d*Sylvester 25.9" 73.8" 48.3" 69’15"
Conjugate 42.1" 58.4" 44.6" 55.3"Fixed-Point 12.3" 32.4" 13.6" 31.1"
Experiments
S.V.N. Vishwanathan: Graph Kernels, Page 29
Labeled Graphs: We repeated the above graph kernel com-putation, now using either a linear or delta kernel betweennode labels as well.
kernel delta lineardataset Mutag Ptc Enzyme Protein
Direct 7.2h 1.4d* 2.4d* 5.3d*Sylvester 3.9d* 2.7d* 89.8" 25’24"
Conjugate 2’35" 3’20" 124.4" 3’01"Fixed Point 1’05" 1’31" 50.1" 1’47"
What I did not talk about
S.V.N. Vishwanathan: Graph Kernels, Page 30
Random walks on other semirings e.g. (min,+)
Why (min,+) does not yield p.s.d kernels
Differences between A, A, L, and L
Kernels on vertices (yields marginal graph kernels ofKashima et. al)
Extensions to trajectories of ARMA models (joint work withRené Vidal and Alex Smola)
General theory using Binet-Cauchy theorem (joint workwith Alex Smola)
Connections to Rational kernels of Cortes et. al
Connections to R-Convolution kernels of Haussler
Overview of my Research
S.V.N. Vishwanathan: Graph Kernels, Page 31
Structured Input:StringsGraphsARMA models
Structured Output:Exponential families in feature space
Optimization for Machine Learning:Bundle methodssubBFGS
TheoryFundamental limitations of kernelsRates of convergence of boosting algorithms
Conclusion
S.V.N. Vishwanathan: Graph Kernels, Page 32
First unifying view of
Diffusion kernelsRegularization on graphsGeometric and random walk kernelsMarginal graph kernels
Efficient computation by exploiting Kronecker products
Papers at http://www.stat.purdue.edu/~vishy
Big Open Question
S.V.N. Vishwanathan: Graph Kernels, Page 33
Comparing paths in two different graphs is polynomial
Subgraph isomorphism is known to be NP-hard
Computing the so-called universal graph kernel whichcounts all common subgraphs of two graphs is harder thansubgraph isomorphism
When we compare any other subgraphs e.g.
simple paths (where vertices do not repeat)cyclestrees
we seem to lose polynomial run-time
Are there other subgraphs for which efficient computationis possible?
References
S.V.N. Vishwanathan: Graph Kernels, Page 34
Journal Papers[1] S. V. N. Vishwanathan, Karsten Borgwardt, Nicol N. Schraudolph, and Imre Risi Kondor. On graph kernels. J. Mach. Learn. Res.,
2008. submitted.
[2] S. V. N. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical systems and its application to the analysis ofdynamic scenes. International Journal of Computer Vision, 73(1):95–119, 2007.
Conference Papers[1] S. V. N. Vishwanathan, Karsten Borgwardt, and Nicol N. Schraudolph. Fast computation of graph kernels. Technical report, NICTA,
2006.
[2] S. V. N. Vishwanathan and A. J. Smola. Binet-Cauchy kernels. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in NeuralInformation Processing Systems 17, pages 1441–1448, Cambridge, MA, 2005. MIT Press.
Applications to Bioinformatics[1] Karsten M. Borgwardt, H.-P. Kriegel, S. V. N. Vishwanathan, and N. Schraudolph. Graph kernels for disease outcome prediction from
protein-protein interaction networks. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray, and Teri E Klein, editors,Proceedings of the Pacific Symposium of Biocomputing 2007, Maui Hawaii, January 2007. World Scientific.
[2] Karsten M. Borgwardt, S. V. N. Vishwanathan, and H.-P. Kriegel. Class prediction from time series gene expression profiles us-ing dynamical systems kernels. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray, and Teri E Klein, editors,Proceedings of the Pacific Symposium of Biocomputing 2006, pages 547–558, Maui Hawaii, January 2006. World Scientific.
[3] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. N. Vishwanathan, A. J. Smola, and H.P. Kriegel. Protein function prediction viagraph kernels. In Proceedings of Intelligent Systems in Molecular Biology (ISMB), Detroit, USA, 2005.