computational bottlenecks in graph mining€¦ · graph mining is everywhere i graphs are...

Computational Bottlenecks in Graph Mining

Karsten Borgwardt

Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems & Max Planck Institute for Developmental

Biology, Tubingen, Germany

MLG, San DiegoAugust 21, 2011

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 1

Graph mining is everywhere

I Graphs are everywhereI BioinformaticsI Social Network AnalysisI Natural Language Processing

I Hot topics in databases/data miningI Frequent subgraph miningI Dense subgraph miningI Graph indexing and search

I Recent trendsI Data: Growing size of graphs is a

challenge for classic approachesI Methods: Kernel Machine Learning

approaches to graph mining


Problem 1: Measure the similarity of two graphs

I How similar are two graphs?I How similar is their structure?I How similar are their node labels and edge labels?

I

I ApplicationsI Function prediction for molecules and proteinsI Change detection in networks of friendshipI Comparison of semantic structures in NLP


Graph comparison

1. Graph isomorphism and subgraph isomorphism checkingI Exact matchI Exponential runtime

2. Graph edit distancesI Involves definition of a cost functionI Typically subgraph isomorphism as intermediate step

3. Topological descriptorsI Lose some of the structural information represented by the graph orI Exponential runtime effort

4. Graph kernels (Gartner et al, 2003; Kashima et al. 2003)I Goal 1: Polynomial runtimeI Goal 2: Applicable to large graphsI Goal 3: Applicable to graphs with attributes


Speeding up Graph kernels

I Walks (NIPS 2006c, JMLR 2010)I Defined by Gartner et al. and Kashima et al. in 2003I Slow: O(n6) where n is the number of nodes in G and G′

I We use Sylvester equations and Kronecker products to compute thesame kernel in O(n3)

I Shortest paths (ICDM 2005)I Literature claimed there was no obvious way to define a graph kernel

based on shortest paths.I We defined a graph kernel comparing the lengths of shortest paths in

two graphs.I Wiener Index from chemoinformatics is an instance of this kernel.


Speeding up Graph kernels

I Subgraphs of limited size k (AISTATS 2009)I Suggested as ’graphlets’ (k=4) by Przulj (Bioinformatics, 2007)I Corresponding graph kernel scales as O(n8)I We turn this into a fast kernel on unlabeled graphs O(ndk−1).

I Results from group theory (ICML 2008, 2009)I Use concepts from group theory to derive feature vector representation

of graphsI computable in O(n3)

I Unresolved question:I How to compute kernels efficiently on large, labeled graphs?


Weisfeiler-Lehman kernel (Shervashidze and Borgwardt, NIPS 2009)

1

34

2

1

5

1

34

5

2

2

1,4

3,2454,1135

2,35

1,4

5,234

1,4

3,2454,1235

5,234

2,3

2,45

1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’

2,35

6

7

8

10

11

12

4,1135

1,4

5,234

3,245

4,1235

2,3

2,45 139

1st iterationResult of step 3: label compression

13 13

6 6 6 7

8 9

11 1210 10

1st iterationResult of step 4: relabeling

End of the 1st iterationFeature vector representations of G and G’

φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)

WLsubtree

φ (G’) = (

Counts oforiginal

node labels

1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1

Counts ofcompressednode labels

)(1)

WLsubtree

a b

c d

e

k (G,G’)=< φ (G), φ (G’) >=11.(1)

WLsubtree(1) (1)

WLsubtree WLsubtree

G’G

G’G G’G


Subtree-like patterns

1

2

3

4

5

6

1

1 3 1 51 2 4 5

2 63


Weisfeiler-Lehman kernel: Theoretical runtime properties

I Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011)I Algorithm: Repeat the following steps h times

1. Sort: Represent each node v as sorted list Lv of its neighbors (O(m))2. Compress: Compress this list into a hash value h(Lv) (O(m))3. Relabel: Relabel v by the hash value h(Lv) (O(n))

I Runtime analysisI per graph pair: Runtime O(m h)I for N graphs: Runtime O(N m h+N2 n h) (naively O(N2 m h))


Weisfeiler-Lehman kernel: Empirical Runtime properties

101 102 10310−1

100

101

102

103

104

105

Number of graphs N

Run

time

in s

econ

ds

200 400 600 800 10000

100

200

300

400

500

Graph size n

Run

time

in s

econ

ds

2 4 6 80

5

10

15

20

Subtree height h

Run

time

in s

econ

ds

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

5

10

15

Graph density c

Run

time

in s

econ

ds

pairwiseglobal


Weisfeiler-Lehman kernel: Runtime and accuracy

MUTAG NCI1 NCI109 D&D

10 sec1 minute

1 hour

1 day

10 days

100 days

1000 days

50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

WLRG3 GraphletRWSP

graph size


Problem 2: Find the most similar nodes in a graph

created by Social GraphKarsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 13

The lightbulb approach (Paturi et al., COLT 1989)

Maximum correlation

I The lightbulb algorithm tackles the maximum correlation problem onan m× n matrix A with binary entries:

argmaxi,j

|ρA(xi, xj)|. (1)

Quadratic runtime algorithm

I The problem can be solved by naive enumeration of all n2 possiblesolutions.


The lightbulb approach

Lightbulb algorithm

1. Given a binary matrix A with m rows and n columns.

2. Repeat l times:I Sample k rowsI Increase a counter for all pairs of columns that match on these k rows.

3. The counters divided by l give an estimate of the correlationP (xi = xj).

Subquadratic runtime

I With probability 1− n−α, the lightbulb algorithm retrieves the most

correlated pair in O(α n1+ln p1ln q2 ln2 n) = O(n(α n

ln p1ln q2 ln2 n)).


Difference between the node-pair and the lightbulb setting

Discrepancies

I Node attributes are non-binary in general

I Pearson’s correlation coefficient


Locality sensitive hashing (Charikar, 2002)

Given a collection of vectors in Rm we choose a random vector ~r from them-dimensional Gaussian distribution. Corresponding to this vector ~r, wedefine a hash function h~r as follows:

h~r(~u) =

{1 if ~r · ~u ≥ 00 if ~r · ~u < 0

(2)

Theorem

For vectors ~v, ~u, Pr[h~r(~u) = h~r(~v)] = 1− θ(~u,~v)π

, where θ is the angle

between the two vectors.


Pearson’s correlation coefficient

Link between correlation and cosine

Karl Pearson defined the correlation of 2 vectors ~v, ~u in Rm as

ρ =cov(~v, ~u)σvσu

, (3)

that is the covariance of the two vectors divided by their standarddeviations. An equivalent geometric way to define it is:

ρ = cos(~v − v, ~u− u), (4)

where v and u are the mean value of ~u and ~v, respectively.


Genome-wide association mapping

by courtesy of D. Weigel


Challenges in two-locus mapping

Scale of the problem

I Typical datasets include order 105 − 107 SNPs.

I Hence we have to consider order 1010 − 1014 SNP pairs.

I Enormous multiple hypothesis testing problem.

I Enormous computational runtime problem.

Our contribution

I We assume binary phenotypes (cases and controls).

I Genotypes may be homozygous or heterozygous.

I We assume m individuals with n SNPs each.

I We define an algorithm that rapidly detects epistatic interactions in aruntime subquadratic in n (Achlioptas et al., KDD 2011).


Common approaches in the literature

Exhaustive enumeration

I Only with special hardware such as GPU implementations:EPIBLASTER (Kam-Thong et al., EJHG 2010)

Filtering approaches

I Statistical criterion, e.g. SNPs with large main effect (Zhang et al.,2007)

I Biological criterion, e.g. underlying PPI (Emily et al., 2009)

Index structure approaches

I fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008)

I TEAM, efficient updates of contingency tables (Zhang et al., 2010)


Difference in correlation for epistasis detection

I We phrase epistasis detection as a difference in correlation problem:

argmaxi,j

|ρcases(xi, xj)− ρcontrols(xi, xj)|. (5)

I Different degree of linkage disequilibrium of two loci in cases andcontrols


Difference in correlation

Theorem

I Given a matrix of cases A and a matrix of controls B of identical size.

I Finding the maximally correlated pair on(A AB 1−B

)(6)

I and on (A 1−AB B

)(7)

I is identical to

argmaxi,j

|ρA(xi, xj)− ρB(xi, xj)|. (8)


The lightbulb epistasis algorithm

Algorithm

1. Binarize original matrices A0 and B0 into A and B by localitysensitive hashing.

2. Compute maximally correlated pair P1 on

(A AB 1−B

)via

lightbulb.

3. Compute maximally correlated pair P2 on

(A 1−AB B

)via

lightbulb.

4. Report the maximum of P1 and P2.


Experiments: Nordborg lab SNP dataset

Results on Nordborg SNP dataset# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K

100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98

Runtime

I Runtime is empirically O(n1.5).

I Epistasis detection on the human genome would require 1 day ofcomputation on a typical desktop PC.


First finding

by P. Samann

I 567 subjects

I 1,075,163 SNPs

I phenotype: Hippocampus volume

I genome-wide significant results (p < 10−12)

I near genes involved in cell-cell signaling


SummaryEfficient graph comparison and node pair search

I We define kernels on graphs with discrete node labels, whose runtimeis only linear in the number of edges m and the number of iterationsh of the Weisfeiler-Lehman algorithm.

I We define a scheme to find the most correlated pair of nodes in agraph, which is subquadratic in the number of nodes n.

I A variant of this correlation search algorithm can be used to searchfor interacting genetic loci in subquadratic time.


Thank you

Group members:

I Nino Shervashidze

I Panagiotis Achlioptas

I Tony Kam-Thong

I Chloe-Agathe Azencott

I Barbara Rakitsch

I Limin Li

I Dominik Grimm

I Theofanis Karaletsos

I Christoph Lippert

I Oliver Stegle

I Hyokun Yun

Collaborators:

I F. Holsboer, MPI Psychiatry

I K. Mehlhorn, MPI ComputerScience

I B. Muller-Myhsok, MPI Psychiatry

I B. Scholkopf, MPI-IS

I A. Smola, Yahoo! Research

I D. Weigel, MPI Dev. Biology

Sponsors:

I A.-v.-Humboldt (Chloe)

I DFG

I Microsoft Research Cambridge

I VW (Oliver)


Main references

I Nino Shervashidze and Karsten Borgwardt. Fast subtree kernels ongraphs, NIPS 2009.

I Nino Shervashidze et al. Weisfeiler-Lehman graph kernels, JMLR2011.

I Panagiotis Achlioptas et al. Two-locus association mapping insubquadratic runtime, KDD 2011.

I Tony Kam-Thong et al. Epistasis detection on quantitativephenotypes by exhaustive enumeration using GPUs, ISMB 2011.

I Tony Kam-Thong et al. EPIBLASTER-Fast exhaustive two-locusepistasis detection strategy using graphical processing units,European Journal of Human Genetics, 2011.


computational bottlenecks in graph mining€¦ · graph mining is everywhere i graphs are...

Documents