computational bottlenecks in graph mining€¦ · graph mining is everywhere i graphs are...

29
Computational Bottlenecks in Graph Mining Karsten Borgwardt Machine Learning and Computational Biology Research Group Max Planck Institute for Intelligent Systems & Max Planck Institute for Developmental Biology, T¨ ubingen, Germany MLG, San Diego August 21, 2011 Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 1

Upload: others

Post on 30-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Computational Bottlenecks in Graph Mining

Karsten Borgwardt

Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems & Max Planck Institute for Developmental

Biology, Tubingen, Germany

MLG, San DiegoAugust 21, 2011

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 1

Page 2: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Graph mining is everywhere

I Graphs are everywhereI BioinformaticsI Social Network AnalysisI Natural Language Processing

I Hot topics in databases/data miningI Frequent subgraph miningI Dense subgraph miningI Graph indexing and search

I Recent trendsI Data: Growing size of graphs is a

challenge for classic approachesI Methods: Kernel Machine Learning

approaches to graph mining

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 2

Page 3: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Problem 1: Measure the similarity of two graphs

I How similar are two graphs?I How similar is their structure?I How similar are their node labels and edge labels?

I

I ApplicationsI Function prediction for molecules and proteinsI Change detection in networks of friendshipI Comparison of semantic structures in NLP

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 3

Page 4: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Graph comparison

1. Graph isomorphism and subgraph isomorphism checkingI Exact matchI Exponential runtime

2. Graph edit distancesI Involves definition of a cost functionI Typically subgraph isomorphism as intermediate step

3. Topological descriptorsI Lose some of the structural information represented by the graph orI Exponential runtime effort

4. Graph kernels (Gartner et al, 2003; Kashima et al. 2003)I Goal 1: Polynomial runtimeI Goal 2: Applicable to large graphsI Goal 3: Applicable to graphs with attributes

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 4

Page 5: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Speeding up Graph kernels

I Walks (NIPS 2006c, JMLR 2010)I Defined by Gartner et al. and Kashima et al. in 2003I Slow: O(n6) where n is the number of nodes in G and G′

I We use Sylvester equations and Kronecker products to compute thesame kernel in O(n3)

I Shortest paths (ICDM 2005)I Literature claimed there was no obvious way to define a graph kernel

based on shortest paths.I We defined a graph kernel comparing the lengths of shortest paths in

two graphs.I Wiener Index from chemoinformatics is an instance of this kernel.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 5

Page 6: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Speeding up Graph kernels

I Subgraphs of limited size k (AISTATS 2009)I Suggested as ’graphlets’ (k=4) by Przulj (Bioinformatics, 2007)I Corresponding graph kernel scales as O(n8)I We turn this into a fast kernel on unlabeled graphs O(ndk−1).

I Results from group theory (ICML 2008, 2009)I Use concepts from group theory to derive feature vector representation

of graphsI computable in O(n3)

I Unresolved question:I How to compute kernels efficiently on large, labeled graphs?

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 6

Page 7: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Weisfeiler-Lehman kernel (Shervashidze and Borgwardt, NIPS 2009)

1

34

2

1

5

1

34

5

2

2

1,4

3,2454,1135

2,35

1,4

5,234

1,4

3,2454,1235

5,234

2,3

2,45

1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’

2,35

6

7

8

10

11

12

4,1135

1,4

5,234

3,245

4,1235

2,3

2,45 139

1st iterationResult of step 3: label compression

13 13

6 6 6 7

8 9

11 1210 10

1st iterationResult of step 4: relabeling

End of the 1st iterationFeature vector representations of G and G’

φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)

WLsubtree

φ (G’) = (

Counts oforiginal

node labels

1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1

Counts ofcompressednode labels

)(1)

WLsubtree

a b

c d

e

k (G,G’)=< φ (G), φ (G’) >=11.(1)

WLsubtree(1) (1)

WLsubtree WLsubtree

G’G

G’G G’G

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 7

Page 8: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Weisfeiler-Lehman kernel (Shervashidze and Borgwardt, NIPS 2009)

1

34

2

1

5

1

34

5

2

2

1,4

3,2454,1135

2,35

1,4

5,234

1,4

3,2454,1235

5,234

2,3

2,45

1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’

2,35

6

7

8

10

11

12

4,1135

1,4

5,234

3,245

4,1235

2,3

2,45 139

1st iterationResult of step 3: label compression

13 13

6 6 6 7

8 9

11 1210 10

1st iterationResult of step 4: relabeling

End of the 1st iterationFeature vector representations of G and G’

φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)

WLsubtree

φ (G’) = (

Counts oforiginal

node labels

1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1

Counts ofcompressednode labels

)(1)

WLsubtree

a b

c d

e

k (G,G’)=< φ (G), φ (G’) >=11.(1)

WLsubtree(1) (1)

WLsubtree WLsubtree

G’G

G’G G’G

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 8

Page 9: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Subtree-like patterns

1

2

3

4

5

6

1

1 3 1 51 2 4 5

2 63

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 9

Page 10: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Weisfeiler-Lehman kernel: Theoretical runtime properties

I Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011)I Algorithm: Repeat the following steps h times

1. Sort: Represent each node v as sorted list Lv of its neighbors (O(m))2. Compress: Compress this list into a hash value h(Lv) (O(m))3. Relabel: Relabel v by the hash value h(Lv) (O(n))

I Runtime analysisI per graph pair: Runtime O(m h)I for N graphs: Runtime O(N m h+N2 n h) (naively O(N2 m h))

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 10

Page 11: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Weisfeiler-Lehman kernel: Empirical Runtime properties

101 102 10310−1

100

101

102

103

104

105

Number of graphs N

Run

time

in s

econ

ds

200 400 600 800 10000

100

200

300

400

500

Graph size n

Run

time

in s

econ

ds

2 4 6 80

5

10

15

20

Subtree height h

Run

time

in s

econ

ds

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

5

10

15

Graph density c

Run

time

in s

econ

ds

pairwiseglobal

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 11

Page 12: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Weisfeiler-Lehman kernel: Runtime and accuracy

MUTAG NCI1 NCI109 D&D

10 sec1 minute

1 hour

1 day

10 days

100 days

1000 days

50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

WLRG3 GraphletRWSP

graph size

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 12

Page 13: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Problem 2: Find the most similar nodes in a graph

created by Social GraphKarsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 13

Page 14: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

The lightbulb approach (Paturi et al., COLT 1989)

Maximum correlation

I The lightbulb algorithm tackles the maximum correlation problem onan m× n matrix A with binary entries:

argmaxi,j

|ρA(xi, xj)|. (1)

Quadratic runtime algorithm

I The problem can be solved by naive enumeration of all n2 possiblesolutions.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 14

Page 15: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

The lightbulb approach

Lightbulb algorithm

1. Given a binary matrix A with m rows and n columns.

2. Repeat l times:I Sample k rowsI Increase a counter for all pairs of columns that match on these k rows.

3. The counters divided by l give an estimate of the correlationP (xi = xj).

Subquadratic runtime

I With probability 1− n−α, the lightbulb algorithm retrieves the most

correlated pair in O(α n1+ln p1ln q2 ln2 n) = O(n(α n

ln p1ln q2 ln2 n)).

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 15

Page 16: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Difference between the node-pair and the lightbulb setting

Discrepancies

I Node attributes are non-binary in general

I Pearson’s correlation coefficient

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 16

Page 17: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Locality sensitive hashing (Charikar, 2002)

Given a collection of vectors in Rm we choose a random vector ~r from them-dimensional Gaussian distribution. Corresponding to this vector ~r, wedefine a hash function h~r as follows:

h~r(~u) =

{1 if ~r · ~u ≥ 00 if ~r · ~u < 0

(2)

Theorem

For vectors ~v, ~u, Pr[h~r(~u) = h~r(~v)] = 1− θ(~u,~v)π

, where θ is the angle

between the two vectors.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 17

Page 18: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Pearson’s correlation coefficient

Link between correlation and cosine

Karl Pearson defined the correlation of 2 vectors ~v, ~u in Rm as

ρ =cov(~v, ~u)σvσu

, (3)

that is the covariance of the two vectors divided by their standarddeviations. An equivalent geometric way to define it is:

ρ = cos(~v − v, ~u− u), (4)

where v and u are the mean value of ~u and ~v, respectively.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 18

Page 19: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Genome-wide association mapping

by courtesy of D. Weigel

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 19

Page 20: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Challenges in two-locus mapping

Scale of the problem

I Typical datasets include order 105 − 107 SNPs.

I Hence we have to consider order 1010 − 1014 SNP pairs.

I Enormous multiple hypothesis testing problem.

I Enormous computational runtime problem.

Our contribution

I We assume binary phenotypes (cases and controls).

I Genotypes may be homozygous or heterozygous.

I We assume m individuals with n SNPs each.

I We define an algorithm that rapidly detects epistatic interactions in aruntime subquadratic in n (Achlioptas et al., KDD 2011).

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 20

Page 21: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Common approaches in the literature

Exhaustive enumeration

I Only with special hardware such as GPU implementations:EPIBLASTER (Kam-Thong et al., EJHG 2010)

Filtering approaches

I Statistical criterion, e.g. SNPs with large main effect (Zhang et al.,2007)

I Biological criterion, e.g. underlying PPI (Emily et al., 2009)

Index structure approaches

I fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008)

I TEAM, efficient updates of contingency tables (Zhang et al., 2010)

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 21

Page 22: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Difference in correlation for epistasis detection

I We phrase epistasis detection as a difference in correlation problem:

argmaxi,j

|ρcases(xi, xj)− ρcontrols(xi, xj)|. (5)

I Different degree of linkage disequilibrium of two loci in cases andcontrols

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 22

Page 23: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Difference in correlation

Theorem

I Given a matrix of cases A and a matrix of controls B of identical size.

I Finding the maximally correlated pair on(A AB 1−B

)(6)

I and on (A 1−AB B

)(7)

I is identical to

argmaxi,j

|ρA(xi, xj)− ρB(xi, xj)|. (8)

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 23

Page 24: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

The lightbulb epistasis algorithm

Algorithm

1. Binarize original matrices A0 and B0 into A and B by localitysensitive hashing.

2. Compute maximally correlated pair P1 on

(A AB 1−B

)via

lightbulb.

3. Compute maximally correlated pair P2 on

(A 1−AB B

)via

lightbulb.

4. Report the maximum of P1 and P2.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 24

Page 25: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Experiments: Nordborg lab SNP dataset

Results on Nordborg SNP dataset# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K

100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98

Runtime

I Runtime is empirically O(n1.5).

I Epistasis detection on the human genome would require 1 day ofcomputation on a typical desktop PC.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 25

Page 26: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

First finding

by P. Samann

I 567 subjects

I 1,075,163 SNPs

I phenotype: Hippocampus volume

I genome-wide significant results (p < 10−12)

I near genes involved in cell-cell signaling

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 26

Page 27: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

SummaryEfficient graph comparison and node pair search

I We define kernels on graphs with discrete node labels, whose runtimeis only linear in the number of edges m and the number of iterationsh of the Weisfeiler-Lehman algorithm.

I We define a scheme to find the most correlated pair of nodes in agraph, which is subquadratic in the number of nodes n.

I A variant of this correlation search algorithm can be used to searchfor interacting genetic loci in subquadratic time.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 27

Page 28: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Thank you

Group members:

I Nino Shervashidze

I Panagiotis Achlioptas

I Tony Kam-Thong

I Chloe-Agathe Azencott

I Barbara Rakitsch

I Limin Li

I Dominik Grimm

I Theofanis Karaletsos

I Christoph Lippert

I Oliver Stegle

I Hyokun Yun

Collaborators:

I F. Holsboer, MPI Psychiatry

I K. Mehlhorn, MPI ComputerScience

I B. Muller-Myhsok, MPI Psychiatry

I B. Scholkopf, MPI-IS

I A. Smola, Yahoo! Research

I D. Weigel, MPI Dev. Biology

Sponsors:

I A.-v.-Humboldt (Chloe)

I DFG

I Microsoft Research Cambridge

I VW (Oliver)

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 28

Page 29: Computational Bottlenecks in Graph Mining€¦ · Graph mining is everywhere I Graphs are everywhere I Bioinformatics I Social Network Analysis I Natural Language Processing I Hot

Main references

I Nino Shervashidze and Karsten Borgwardt. Fast subtree kernels ongraphs, NIPS 2009.

I Nino Shervashidze et al. Weisfeiler-Lehman graph kernels, JMLR2011.

I Panagiotis Achlioptas et al. Two-locus association mapping insubquadratic runtime, KDD 2011.

I Tony Kam-Thong et al. Epistasis detection on quantitativephenotypes by exhaustive enumeration using GPUs, ISMB 2011.

I Tony Kam-Thong et al. EPIBLASTER-Fast exhaustive two-locusepistasis detection strategy using graphical processing units,European Journal of Human Genetics, 2011.

Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 29