computational bottlenecks in graph mining€¦ · graph mining is everywhere i graphs are...
TRANSCRIPT
Computational Bottlenecks in Graph Mining
Karsten Borgwardt
Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems & Max Planck Institute for Developmental
Biology, Tubingen, Germany
MLG, San DiegoAugust 21, 2011
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 1
Graph mining is everywhere
I Graphs are everywhereI BioinformaticsI Social Network AnalysisI Natural Language Processing
I Hot topics in databases/data miningI Frequent subgraph miningI Dense subgraph miningI Graph indexing and search
I Recent trendsI Data: Growing size of graphs is a
challenge for classic approachesI Methods: Kernel Machine Learning
approaches to graph mining
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 2
Problem 1: Measure the similarity of two graphs
I How similar are two graphs?I How similar is their structure?I How similar are their node labels and edge labels?
I
I ApplicationsI Function prediction for molecules and proteinsI Change detection in networks of friendshipI Comparison of semantic structures in NLP
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 3
Graph comparison
1. Graph isomorphism and subgraph isomorphism checkingI Exact matchI Exponential runtime
2. Graph edit distancesI Involves definition of a cost functionI Typically subgraph isomorphism as intermediate step
3. Topological descriptorsI Lose some of the structural information represented by the graph orI Exponential runtime effort
4. Graph kernels (Gartner et al, 2003; Kashima et al. 2003)I Goal 1: Polynomial runtimeI Goal 2: Applicable to large graphsI Goal 3: Applicable to graphs with attributes
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 4
Speeding up Graph kernels
I Walks (NIPS 2006c, JMLR 2010)I Defined by Gartner et al. and Kashima et al. in 2003I Slow: O(n6) where n is the number of nodes in G and G′
I We use Sylvester equations and Kronecker products to compute thesame kernel in O(n3)
I Shortest paths (ICDM 2005)I Literature claimed there was no obvious way to define a graph kernel
based on shortest paths.I We defined a graph kernel comparing the lengths of shortest paths in
two graphs.I Wiener Index from chemoinformatics is an instance of this kernel.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 5
Speeding up Graph kernels
I Subgraphs of limited size k (AISTATS 2009)I Suggested as ’graphlets’ (k=4) by Przulj (Bioinformatics, 2007)I Corresponding graph kernel scales as O(n8)I We turn this into a fast kernel on unlabeled graphs O(ndk−1).
I Results from group theory (ICML 2008, 2009)I Use concepts from group theory to derive feature vector representation
of graphsI computable in O(n3)
I Unresolved question:I How to compute kernels efficiently on large, labeled graphs?
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 6
Weisfeiler-Lehman kernel (Shervashidze and Borgwardt, NIPS 2009)
1
34
2
1
5
1
34
5
2
2
1,4
3,2454,1135
2,35
1,4
5,234
1,4
3,2454,1235
5,234
2,3
2,45
1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’
2,35
6
7
8
10
11
12
4,1135
1,4
5,234
3,245
4,1235
2,3
2,45 139
1st iterationResult of step 3: label compression
13 13
6 6 6 7
8 9
11 1210 10
1st iterationResult of step 4: relabeling
End of the 1st iterationFeature vector representations of G and G’
φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)
WLsubtree
φ (G’) = (
Counts oforiginal
node labels
1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1
Counts ofcompressednode labels
)(1)
WLsubtree
a b
c d
e
k (G,G’)=< φ (G), φ (G’) >=11.(1)
WLsubtree(1) (1)
WLsubtree WLsubtree
G’G
G’G G’G
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 7
Weisfeiler-Lehman kernel (Shervashidze and Borgwardt, NIPS 2009)
1
34
2
1
5
1
34
5
2
2
1,4
3,2454,1135
2,35
1,4
5,234
1,4
3,2454,1235
5,234
2,3
2,45
1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’
2,35
6
7
8
10
11
12
4,1135
1,4
5,234
3,245
4,1235
2,3
2,45 139
1st iterationResult of step 3: label compression
13 13
6 6 6 7
8 9
11 1210 10
1st iterationResult of step 4: relabeling
End of the 1st iterationFeature vector representations of G and G’
φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)
WLsubtree
φ (G’) = (
Counts oforiginal
node labels
1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1
Counts ofcompressednode labels
)(1)
WLsubtree
a b
c d
e
k (G,G’)=< φ (G), φ (G’) >=11.(1)
WLsubtree(1) (1)
WLsubtree WLsubtree
G’G
G’G G’G
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 8
Subtree-like patterns
1
2
3
4
5
6
1
1 3 1 51 2 4 5
2 63
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 9
Weisfeiler-Lehman kernel: Theoretical runtime properties
I Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011)I Algorithm: Repeat the following steps h times
1. Sort: Represent each node v as sorted list Lv of its neighbors (O(m))2. Compress: Compress this list into a hash value h(Lv) (O(m))3. Relabel: Relabel v by the hash value h(Lv) (O(n))
I Runtime analysisI per graph pair: Runtime O(m h)I for N graphs: Runtime O(N m h+N2 n h) (naively O(N2 m h))
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 10
Weisfeiler-Lehman kernel: Empirical Runtime properties
101 102 10310−1
100
101
102
103
104
105
Number of graphs N
Run
time
in s
econ
ds
200 400 600 800 10000
100
200
300
400
500
Graph size n
Run
time
in s
econ
ds
2 4 6 80
5
10
15
20
Subtree height h
Run
time
in s
econ
ds
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
5
10
15
Graph density c
Run
time
in s
econ
ds
pairwiseglobal
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 11
Weisfeiler-Lehman kernel: Runtime and accuracy
MUTAG NCI1 NCI109 D&D
10 sec1 minute
1 hour
1 day
10 days
100 days
1000 days
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
WLRG3 GraphletRWSP
graph size
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 12
Problem 2: Find the most similar nodes in a graph
created by Social GraphKarsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 13
The lightbulb approach (Paturi et al., COLT 1989)
Maximum correlation
I The lightbulb algorithm tackles the maximum correlation problem onan m× n matrix A with binary entries:
argmaxi,j
|ρA(xi, xj)|. (1)
Quadratic runtime algorithm
I The problem can be solved by naive enumeration of all n2 possiblesolutions.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 14
The lightbulb approach
Lightbulb algorithm
1. Given a binary matrix A with m rows and n columns.
2. Repeat l times:I Sample k rowsI Increase a counter for all pairs of columns that match on these k rows.
3. The counters divided by l give an estimate of the correlationP (xi = xj).
Subquadratic runtime
I With probability 1− n−α, the lightbulb algorithm retrieves the most
correlated pair in O(α n1+ln p1ln q2 ln2 n) = O(n(α n
ln p1ln q2 ln2 n)).
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 15
Difference between the node-pair and the lightbulb setting
Discrepancies
I Node attributes are non-binary in general
I Pearson’s correlation coefficient
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 16
Locality sensitive hashing (Charikar, 2002)
Given a collection of vectors in Rm we choose a random vector ~r from them-dimensional Gaussian distribution. Corresponding to this vector ~r, wedefine a hash function h~r as follows:
h~r(~u) =
{1 if ~r · ~u ≥ 00 if ~r · ~u < 0
(2)
Theorem
For vectors ~v, ~u, Pr[h~r(~u) = h~r(~v)] = 1− θ(~u,~v)π
, where θ is the angle
between the two vectors.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 17
Pearson’s correlation coefficient
Link between correlation and cosine
Karl Pearson defined the correlation of 2 vectors ~v, ~u in Rm as
ρ =cov(~v, ~u)σvσu
, (3)
that is the covariance of the two vectors divided by their standarddeviations. An equivalent geometric way to define it is:
ρ = cos(~v − v, ~u− u), (4)
where v and u are the mean value of ~u and ~v, respectively.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 18
Genome-wide association mapping
by courtesy of D. Weigel
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 19
Challenges in two-locus mapping
Scale of the problem
I Typical datasets include order 105 − 107 SNPs.
I Hence we have to consider order 1010 − 1014 SNP pairs.
I Enormous multiple hypothesis testing problem.
I Enormous computational runtime problem.
Our contribution
I We assume binary phenotypes (cases and controls).
I Genotypes may be homozygous or heterozygous.
I We assume m individuals with n SNPs each.
I We define an algorithm that rapidly detects epistatic interactions in aruntime subquadratic in n (Achlioptas et al., KDD 2011).
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 20
Common approaches in the literature
Exhaustive enumeration
I Only with special hardware such as GPU implementations:EPIBLASTER (Kam-Thong et al., EJHG 2010)
Filtering approaches
I Statistical criterion, e.g. SNPs with large main effect (Zhang et al.,2007)
I Biological criterion, e.g. underlying PPI (Emily et al., 2009)
Index structure approaches
I fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008)
I TEAM, efficient updates of contingency tables (Zhang et al., 2010)
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 21
Difference in correlation for epistasis detection
I We phrase epistasis detection as a difference in correlation problem:
argmaxi,j
|ρcases(xi, xj)− ρcontrols(xi, xj)|. (5)
I Different degree of linkage disequilibrium of two loci in cases andcontrols
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 22
Difference in correlation
Theorem
I Given a matrix of cases A and a matrix of controls B of identical size.
I Finding the maximally correlated pair on(A AB 1−B
)(6)
I and on (A 1−AB B
)(7)
I is identical to
argmaxi,j
|ρA(xi, xj)− ρB(xi, xj)|. (8)
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 23
The lightbulb epistasis algorithm
Algorithm
1. Binarize original matrices A0 and B0 into A and B by localitysensitive hashing.
2. Compute maximally correlated pair P1 on
(A AB 1−B
)via
lightbulb.
3. Compute maximally correlated pair P2 on
(A 1−AB B
)via
lightbulb.
4. Report the maximum of P1 and P2.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 24
Experiments: Nordborg lab SNP dataset
Results on Nordborg SNP dataset# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K
100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98
Runtime
I Runtime is empirically O(n1.5).
I Epistasis detection on the human genome would require 1 day ofcomputation on a typical desktop PC.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 25
First finding
by P. Samann
I 567 subjects
I 1,075,163 SNPs
I phenotype: Hippocampus volume
I genome-wide significant results (p < 10−12)
I near genes involved in cell-cell signaling
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 26
SummaryEfficient graph comparison and node pair search
I We define kernels on graphs with discrete node labels, whose runtimeis only linear in the number of edges m and the number of iterationsh of the Weisfeiler-Lehman algorithm.
I We define a scheme to find the most correlated pair of nodes in agraph, which is subquadratic in the number of nodes n.
I A variant of this correlation search algorithm can be used to searchfor interacting genetic loci in subquadratic time.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 27
Thank you
Group members:
I Nino Shervashidze
I Panagiotis Achlioptas
I Tony Kam-Thong
I Chloe-Agathe Azencott
I Barbara Rakitsch
I Limin Li
I Dominik Grimm
I Theofanis Karaletsos
I Christoph Lippert
I Oliver Stegle
I Hyokun Yun
Collaborators:
I F. Holsboer, MPI Psychiatry
I K. Mehlhorn, MPI ComputerScience
I B. Muller-Myhsok, MPI Psychiatry
I B. Scholkopf, MPI-IS
I A. Smola, Yahoo! Research
I D. Weigel, MPI Dev. Biology
Sponsors:
I A.-v.-Humboldt (Chloe)
I DFG
I Microsoft Research Cambridge
I VW (Oliver)
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 28
Main references
I Nino Shervashidze and Karsten Borgwardt. Fast subtree kernels ongraphs, NIPS 2009.
I Nino Shervashidze et al. Weisfeiler-Lehman graph kernels, JMLR2011.
I Panagiotis Achlioptas et al. Two-locus association mapping insubquadratic runtime, KDD 2011.
I Tony Kam-Thong et al. Epistasis detection on quantitativephenotypes by exhaustive enumeration using GPUs, ISMB 2011.
I Tony Kam-Thong et al. EPIBLASTER-Fast exhaustive two-locusepistasis detection strategy using graphical processing units,European Journal of Human Genetics, 2011.
Karsten Borgwardt Computational Bottlenecks in Graph Mining August 21, 2011 29