an attributed graph kernel from the jensen-shannon divergence · 2014-09-08 · contribution in...

31
An Attributed Graph Kernel from The Jensen-Shannon Divergence Lu Bai*, Horst Bunke^, and Edwin R. Hancock* *Department of Computer Science, University of York, UK ^Institute of Computer Science and Applied Mathematics, University of Bern, Switzerland

Upload: others

Post on 06-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

An Attributed Graph Kernel from The Jensen-Shannon Divergence

Lu Bai*, Horst Bunke^, and Edwin R. Hancock*

*Department of Computer Science, University of York, UK ^Institute of Computer Science and Applied Mathematics,

University of Bern, Switzerland

Page 2: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Contribution

In past have reported new graph-kernel based on Jensen-Shannon divergence between both graph von Neumann entropies and Shannon entropies of random walk pdf’s.

Difficult estimating overlap entropy.

Ignore node label and attribute information.

Here we present information theoretic way to extend Jensen-Shannon graph kernel using tree indexing and label strengthening.

Page 3: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Outline

Background and Motivation

Attributed Jensen-Shannon Diffusion Kernel

Jensen-Shannon divergence

Tree-index for label strengthening

Shannon label entropy

Attributed Jensen-Shannon diffusion kernel

Experiments

Conclusions

Page 4: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Page 5: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Page 6: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Page 7: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Page 8: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Background and Motivation (Graph Kernels)

Drawbacks of the existing R-convolution kernels

1) Definitions of R-convolution kernels: for a pair of graph Gp and Gq, assume and are their substructure sets respectively, then the R-convolution kernel is

where

2) Neglects non-isomorphic but similar substructures,

Page 9: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Graph Kernels from Jensen-Shannon Divergence

Classical Jensen-Shannon Divergence (JSD)

Definition Classical JSD is a non-extensive mutual information measure between probability

distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is

Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]

For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is

where is a composite structure graph formed from the pair of

(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior

topology information of graphs, and c) lacking pairwise correspondence information between vertices.

Page 10: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Graph Kernels from Jensen-Shannon Divergence

Classical Jensen-Shannon Divergence (JSD)

Definition Classical JSD is a non-extensive mutual information measure between probability

distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is

Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]

For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is

where is a composite structure graph formed from the pair of

(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior

topology information of graphs, and c) lacking pairwise correspondence information between vertices.

Page 11: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Graph Kernels from Jensen-Shannon Divergence

Classical Jensen-Shannon Divergence (JSD)

Definition Classical JSD is a non-extensive mutual information measure between probability

distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is

Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]

For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is

where is a composite structure graph formed from the pair of

(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior

topology information of graphs, and c) lacking pairwise correspondence information between vertices.

Page 12: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Page 13: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Page 14: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Page 15: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Page 16: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Jensen-Shannon Diffusion Kernel (TI Method)

Overcome problem: at each iteration h strengthen vertex label by taking union of original vertex label and its neighbouring vertex labels. Pseudocode:

Page 17: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Jensen-Shannon Divergence

Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution

Resulting Shannon label entropy defined as

JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is

Page 18: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Jensen-Shannon Divergence

Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution

Resulting Shannon label entropy defined as

JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is

Page 19: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Tree Index Jensen-Shannon Divergence

Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution

Resulting Shannon label entropy defined as

JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is

Page 20: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Jensen-Shannon Diffusion Kernel

The Jensen-Shannon diffusion kernel For a pair of graphs G and G’, we have their label probability distributions

as and . The JSD between G and G’ is

The Jensen-Shannon diffusion kernel is defined as

Jensen-Shannon diffusion kernel is positive definite (pd)

Because the JSD is symmetric, thus a diffusion kernel k=-exp{s(G,G’)} associated with any dissimilarity measure is pd.

Page 21: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Page 22: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Page 23: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Page 24: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Page 25: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Page 26: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Experiments

Standard Graph Datasets: MUTAG, NCI1, NCI109, ENZYMES, PPIs, and PTC(MR)

Alternative state-of-the-art kernels for comparision: The Jensen-Shannon graph kernel (JSGK) [Bai and Hancock, JMIV, 2013]

The Weisfeiler-Lehman subtree kernel (WLSK) [Shervashidze et al., JMLR, 2010]

The shortest path kernel (SPGK) [Borgwardt and Kriegel, ICDM, 2005]

The graphlet count kernel with graphlet of size 3 (GCGK) [Shervashidze et al., ICML, 2009]

The backtracless kernel using cycles identified by the Ihara zeta function (BRWK) [Aziz et al., TNNLS, 2013]

Page 27: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Experiments

Experimental results

Page 28: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Conclusion

Showed how to incorporate attributes into Jensen-Shannon graph-kernel.

Based on label strengthening via tree indexing.

Labels have information theoretic characterisation.

Kernel proves effective on bioinformatics datasets and outperforms a number of alternatives.

Page 29: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Future

Hypergraphs via oriented line graphs.

Directed graphs via directedgraph entropies (Cheng et al Phys Rev E 2014).

Page 30: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Acknowledgments

Prof. Edwin R. Hancock is supported by a Royal Society Wolfson Research Merit Award.

We thank Prof. Karsten Borgwardt and Dr. Nino Shervashidze for providing the Matlab implementation for the various graph kernel methods, and Dr. Geng Li for providing the graph datasets.

Page 31: An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In past have reported new graph-kernel based on Jensen- Shannon divergence between both

Thank you!