an attributed graph kernel from the jensen-shannon divergence · 2014-09-08 · contribution in...
TRANSCRIPT
An Attributed Graph Kernel from The Jensen-Shannon Divergence
Lu Bai*, Horst Bunke^, and Edwin R. Hancock*
*Department of Computer Science, University of York, UK ^Institute of Computer Science and Applied Mathematics,
University of Bern, Switzerland
Contribution
In past have reported new graph-kernel based on Jensen-Shannon divergence between both graph von Neumann entropies and Shannon entropies of random walk pdf’s.
Difficult estimating overlap entropy.
Ignore node label and attribute information.
Here we present information theoretic way to extend Jensen-Shannon graph kernel using tree indexing and label strengthening.
Outline
Background and Motivation
Attributed Jensen-Shannon Diffusion Kernel
Jensen-Shannon divergence
Tree-index for label strengthening
Shannon label entropy
Attributed Jensen-Shannon diffusion kernel
Experiments
Conclusions
Background and Motivation (Graph Kernels)
Graph Kernels: Why use graph kernel?
Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]
Existing Graph Kernels from the R-convolution [Haussler, 1999]
Random walk based kernels
Product graph kernels [Gartner et al., 2003, ICML]
Marginalized kernels on graphs [Kashima et al., 2003, ICML]
Path based kernels
Shortest path kernel [Borgwardt, 2005, ICDM]
Restricted subgraphs or subtrees based kernels
A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]
A graphlet count kernel [Shevashidze et al., 2009, ICML]
A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]
Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,
Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .
Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD
Background and Motivation (Graph Kernels)
Graph Kernels: Why use graph kernel?
Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]
Existing Graph Kernels from the R-convolution [Haussler, 1999]
Random walk based kernels
Product graph kernels [Gartner et al., 2003, ICML]
Marginalized kernels on graphs [Kashima et al., 2003, ICML]
Path based kernels
Shortest path kernel [Borgwardt, 2005, ICDM]
Restricted subgraphs or subtrees based kernels
A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]
A graphlet count kernel [Shevashidze et al., 2009, ICML]
A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]
Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,
Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .
Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD
Background and Motivation (Graph Kernels)
Graph Kernels: Why use graph kernel?
Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]
Existing Graph Kernels from the R-convolution [Haussler, 1999]
Random walk based kernels
Product graph kernels [Gartner et al., 2003, ICML]
Marginalized kernels on graphs [Kashima et al., 2003, ICML]
Path based kernels
Shortest path kernel [Borgwardt, 2005, ICDM]
Restricted subgraphs or subtrees based kernels
A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]
A graphlet count kernel [Shevashidze et al., 2009, ICML]
A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]
Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,
Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .
Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD
Background and Motivation (Graph Kernels)
Graph Kernels: Why use graph kernel?
Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]
Existing Graph Kernels from the R-convolution [Haussler, 1999]
Random walk based kernels
Product graph kernels [Gartner et al., 2003, ICML]
Marginalized kernels on graphs [Kashima et al., 2003, ICML]
Path based kernels
Shortest path kernel [Borgwardt, 2005, ICDM]
Restricted subgraphs or subtrees based kernels
A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]
A graphlet count kernel [Shevashidze et al., 2009, ICML]
A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]
Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,
Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .
Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD
Background and Motivation (Graph Kernels)
Drawbacks of the existing R-convolution kernels
1) Definitions of R-convolution kernels: for a pair of graph Gp and Gq, assume and are their substructure sets respectively, then the R-convolution kernel is
where
2) Neglects non-isomorphic but similar substructures,
Graph Kernels from Jensen-Shannon Divergence
Classical Jensen-Shannon Divergence (JSD)
Definition Classical JSD is a non-extensive mutual information measure between probability
distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is
Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]
For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is
where is a composite structure graph formed from the pair of
(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior
topology information of graphs, and c) lacking pairwise correspondence information between vertices.
Graph Kernels from Jensen-Shannon Divergence
Classical Jensen-Shannon Divergence (JSD)
Definition Classical JSD is a non-extensive mutual information measure between probability
distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is
Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]
For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is
where is a composite structure graph formed from the pair of
(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior
topology information of graphs, and c) lacking pairwise correspondence information between vertices.
Graph Kernels from Jensen-Shannon Divergence
Classical Jensen-Shannon Divergence (JSD)
Definition Classical JSD is a non-extensive mutual information measure between probability
distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is
Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]
For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is
where is a composite structure graph formed from the pair of
(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior
topology information of graphs, and c) lacking pairwise correspondence information between vertices.
Tree Index Strengthening
Tree-index (TI) label strengthening
Example: each strengthened label corrsponds to a subtree of height h=2
Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.
Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.
Tree Index Strengthening
Tree-index (TI) label strengthening
Example: each strengthened label corrsponds to a subtree of height h=2
Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.
Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.
Tree Index Strengthening
Tree-index (TI) label strengthening
Example: each strengthened label corrsponds to a subtree of height h=2
Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.
Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.
Tree Index Strengthening
Tree-index (TI) label strengthening
Example: each strengthened label corrsponds to a subtree of height h=2
Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.
Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.
Jensen-Shannon Diffusion Kernel (TI Method)
Overcome problem: at each iteration h strengthen vertex label by taking union of original vertex label and its neighbouring vertex labels. Pseudocode:
Tree Index Jensen-Shannon Divergence
Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution
Resulting Shannon label entropy defined as
JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is
Tree Index Jensen-Shannon Divergence
Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution
Resulting Shannon label entropy defined as
JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is
Tree Index Jensen-Shannon Divergence
Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution
Resulting Shannon label entropy defined as
JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is
Jensen-Shannon Diffusion Kernel
The Jensen-Shannon diffusion kernel For a pair of graphs G and G’, we have their label probability distributions
as and . The JSD between G and G’ is
The Jensen-Shannon diffusion kernel is defined as
Jensen-Shannon diffusion kernel is positive definite (pd)
Because the JSD is symmetric, thus a diffusion kernel k=-exp{s(G,G’)} associated with any dissimilarity measure is pd.
Advantages
The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]
Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.
Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.
Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.
New kernel not restricted to un-attributed graphs.
Advantages
The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]
Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.
Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.
Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.
New kernel not restricted to un-attributed graphs.
Advantages
The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]
Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.
Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.
Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.
New kernel not restricted to un-attributed graphs.
Advantages
The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]
Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.
Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.
Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.
New kernel not restricted to un-attributed graphs.
Advantages
The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]
Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.
Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.
Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.
New kernel not restricted to un-attributed graphs.
Experiments
Standard Graph Datasets: MUTAG, NCI1, NCI109, ENZYMES, PPIs, and PTC(MR)
Alternative state-of-the-art kernels for comparision: The Jensen-Shannon graph kernel (JSGK) [Bai and Hancock, JMIV, 2013]
The Weisfeiler-Lehman subtree kernel (WLSK) [Shervashidze et al., JMLR, 2010]
The shortest path kernel (SPGK) [Borgwardt and Kriegel, ICDM, 2005]
The graphlet count kernel with graphlet of size 3 (GCGK) [Shervashidze et al., ICML, 2009]
The backtracless kernel using cycles identified by the Ihara zeta function (BRWK) [Aziz et al., TNNLS, 2013]
Experiments
Experimental results
Conclusion
Showed how to incorporate attributes into Jensen-Shannon graph-kernel.
Based on label strengthening via tree indexing.
Labels have information theoretic characterisation.
Kernel proves effective on bioinformatics datasets and outperforms a number of alternatives.
Future
Hypergraphs via oriented line graphs.
Directed graphs via directedgraph entropies (Cheng et al Phys Rev E 2014).
Acknowledgments
Prof. Edwin R. Hancock is supported by a Royal Society Wolfson Research Merit Award.
We thank Prof. Karsten Borgwardt and Dr. Nino Shervashidze for providing the Matlab implementation for the various graph kernel methods, and Dr. Geng Li for providing the graph datasets.
Thank you!