recentadvances&in&stochas0c&flow&clustering& · 1 theoretical results •...
TRANSCRIPT
Recent Advances in Stochas0c Flow Clustering
Srinivasan Parthasarathy
Data Mining Research Laboratory Dept. of Computer Science and Engineering
The Ohio State University
h:p://www.cse.ohio-‐state.edu/~srini
Graph Clustering: A Fundamental Problem
2
Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.!!! !
What Makes this Problem Hard?
• Scale – High throughput experiments, social media, high-‐res images.
• Noise – False posi0ve interac0ons; False nega0ves
• Novel Topological Characteris0cs – Hub nodes; power-‐law
• Domain Insights – Balance; known biological rela0onships
• Dynamics – Changes to nodes and links and content
3
Extant Solu0ons
• Spectral methods [Shi ‘00] • Edge-‐based agglomera0ve/divisive methods [Newman ‘04]
• Graclus/Kernel K-‐Means [Dhillon ‘07]
• Me0s [Karypis ’98] + MQI [Leskovec, Lang’10]
• Markov Clustering [van Dongen’00]
• A Host of Specialized Solu0ons (e.g. MCODE, LINK-‐CLUSTER; etc.)
5
Markov Clustering (MCL)!Stijn van Dongen, 2000!
!The original Stochastic flow
clustering algorithm!!
6
3
1
2
4
1! 2! 3! 4!
1! 0.33! 0.25! 0.33!
2! 0.33! 0.25! 0.5! 0.33!
3! 0.25! 0.5!
4! 0.33! 0.25! 0.33!
Out-flows of 2!
In-flows of 2!
Column Stochastic Matrix: A matrix where each column sums to 1.!!Stochastic Flow: An entry in a column stochastic matrix, interpreted as the “flow” or “transition probability”.!
7
Repeatedly apply certain operations to the flow matrix until the matrix converges and can be interpreted as a clustering. !
1! 2! 3! 4!
1!
2! 1.0! 1.0! 1.0!
3! 1.0!
4!3
1
2
4
The MCL algorithm
Expand: M := M*M
Inflate: M := M.^r (r usually 2), renormalize columns
Converged?
Input: A, Adjacency matrix Ini0alize M to MG, the canonical transi0on matrix M:= MG:= (A+I) D-‐1
Yes
Output clusters
No
Prune
Enhances flow to well-‐connected nodes (i.e. nodes within a community).
Increases inequality in each column. “Rich get richer, poor get poorer.” (reduces flow across communi0es)
Saves memory by removing entries close to zero. Enables faster convergence
Clustering Interpreta0on: Nodes flowing into the same sink node are assigned same cluster labels
9
[van Dongen ’00]
10
MCL Strengths!!1. Theoretically well founded [Von Dongen’00]!
2. Simple, linear algebraic operations!
3. Noise Tolerant. [Brohee’06, Vlasblom’09]!!
[Chakrabarti and Faloutsos ‘06]
11
MCL Limitations!!
!1. Outputs many small clusters. [Satuluri, Parthasarathy’09]!!
!!
2. Does not scale well. ! ! [Chakrabarti, Faloutsos’06]!
[Chakrabarti and Faloutsos ‘06]
12
MCL Flaws!!!1. Outputs many small clusters.!!
Fix I: Regularized MCL !!
2. Does not scale well. !!
Fix II: Multi-Level Regularized MCL! Fix III: Localized Graph Sparsification!!
Key Idea I: The Regularize operator
Why does MCL output many clusters? Due to overfisng; it does not penalize divergence of flows between neighbors.
Remedy: Penalize divergence in flows between neighbors. Use KL Divergence (a well known measure for comparing probability distribu0ons).
Turns out to have a nice closed form soluCon:
Regularize(M) :=M*(A+I)D-‐1= M*MG
The Regularized-‐MCL algorithm
Regularize: M := M*MG
Inflate: M := M.^r (r usually 2), renormalize columns
Converge?
Yes
Output clusters
No
Prune
Takes into account flows of the neighbors.
Increases inequality in each column. “Rich get richer, poor get poorer.” [Hadamard power + rescaling]
Saves memory by removing entries close to zero. Enables faster convergence
Input: A, Adjacency matrix Ini0alize M to MG, the canonical transi0on matrix M:= MG:= (A+I) D-‐1
Nodes flowing into the same sink node are assigned same cluster labels
Key Idea II: Mul0-‐level Regularized MCL
Input Graph
Intermediate Graph
Intermediate Graph
Coarsest Graph
. . . . . .
Coarsen
Coarsen
Coarsen
Run Curtailed R-‐MCL,project flow.
Run Curtailed R-‐MCL, project flow.
Input Graph
Run R-‐MCL to convergence, output clusters.
Faster to run on smaller graphs first!
Captures global
topology of graph!
Good initialization for refined flow
matrix!
16
Comparison with MCL on!Protein Interaction Networks!
Dataset !(n,m)
Quality !Change! Speedup (Time)!
Yeast !(5k, 15k) 36%! 2.5x !
(0.4s)!
Yeast_Noisy!(6k, 200k) 300%! 57x!
(8s)!
Human!(10k, 60k) 21.6%! 200x!
(2s)!
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
Comparison with Graclus and Me0s
Quality: MLR-‐MCL improves upon both Graclus and Me0s
Speed: MLR-‐MCL is faster than Graclus, comparable to Me0s
Key Idea III: Graph Sparsifica0on
Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?
18
Original! Sparsified!
!!
Our Approach
Main Idea: Retain edges which are likely to be intra-‐cluster edges , while discarding likely inter-‐cluster edges.
Similarity-‐based SparsificaCon HeurisCc: An edge (i,j) is likely to be an intra-‐cluster edge if ver0ces i and j have highly overlapping adjacency lists.
|)()(||)()(|),(
jAdjiAdjjAdjiAdjjiSim
∪
∩=
20
Algorithm: Global Sparsification (G-Spar)!
Parameter: Sparsification ratio, s!!
1. For each edge <i,j>:!(i) Calculate Sim ( <i,j> )
2. Retain top s% of edges in order of Sim, discard others!
!
21
Dense clusters are over-represented, sparse clusters under-represented!!Works great when the goal is to just find the top communities!
22
Algorithm: Local Sparsification (L-Spar)!
Parameter: Sparsification exponent, e (0 < e < 1)!!
1. For each node i of degree di:!(i) For each neighbor j: !
(a) Calculate Sim ( <i,j> ) (ii) Retain top (d i)e neighbors in order of Sim, for node i!!
Edges compete to be retained locally (think globally act locally paradigm)
23
Ensures representation of clusters of varying densities!
24
But...!
Similarity computation is expensive!!!!
Solution: A randomized, approximate solution based on Minwise Hashing [Broder et. al., 1998]!
25
Dataset !(n, m)!
Spars. !Ratio!
L-Spar!
Speed! Quality!
Yeast ! 17%! 17x! +4%!
Human! 40%! 6x! +1%!
L-Spar: Results Using MLR-MCL
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
Results using MLR-‐MCL
Dataset (Nodes, Edges)
Spars. RaCo
RandomEdge G-‐Spar L-‐Spar
Spdup QualityΔ Spdup
QualityΔ Spdup
QualityΔ
BioGrid(6K, 200K)
17% 6x -‐16% 38x -‐23% 17x +4%
Wiki (1.1M, 53M)
15% 19x -‐58% 92x -‐54% 23x -‐4.5%
Orkut(3M, 117M)
17% 6x -‐32% 39x -‐59% 22x 0
Twi:er(146K, 83M)
4% 63x -‐90% 188x +10% 22x +40%
L-‐Spar enables high speed-‐ups, without significant loss of accuracy.
1 Theoretical Results
• Theorem 2.1(c) of [1]: The multiplicity of 0 as an eigenvalue of graph Laplacian is equal tothe number of components of the graph.
• Theorem 2.2(d) of [1]:Pn
i=1 �i = 2|E(G)| =P
v d(v)
2 SIGMOD Paper
Flickr 33911, Flickr.spars 33953. Total 64903 nodesWikipedia 1, Wikipedia.spars 164. Total 1129060 nodesOrkut 186, Orkut.spars 257. Total 3072626 nodes
Dataset Original L-Spar G-SparBioGrid �4197 = 0.34284616 �4197 = 0.08680336 �4894 = 0.14459744DIP �45 = 0.117378 �54 = 0.03226163 �888 = 0.036117
Human �219 = 0.1038301 �234 = 0.05650493 �2266 = 0.05255889
Table 1: Spectral Gap Comparisons
Figure 1: BioGrid Eigenvalues COmparison
References
[1] B. Mohar. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications,2:871–898, 1991.
1
Impact of Sparsification on Spectrum: Yeast PPI
Global Sparsification results in multiple components
Local sparsification seems to match trends of original graph
Human PPI
Synthe0c data (Fortunato’09)
As clustering problem gets harder, L-‐Spar is more beneficial
SOCIAL IMPACT: EMERGENCY RESPONSE AND FLOOD MAPPING
Copyright 2006, Data Mining Research Laboratory
Crisis Informatics and Flood Mapping
• Disaster Informatics or crisis informatics is the study of the use of information and technology in the different phases of disasters or crisis
• Flood Mapping: Mapping the extent of flood damage a key step for relief and recovery
Copyright 2006, Data Mining Research Laboratory
Chennai Floods (2015): Social Sensing Enhanced Flood Mapping
Prior to 1rst Depression After 3rd Depression
Markov Clustering and Flood Mapping
• Water Delinea0on Segmenta0on of remote-‐sensed images is key strategy employed in flood mapping.
• Confounding factors: cloud cover, sinuous river beds, urban area reflectance effects
• Key Idea: Semi-‐supervised flood mapping using MLR-‐MCL as a key pre-‐processing step.
Procedure
1. Cluster patches of satellite image with MLR-‐MCL – Graph-‐based image segmenta0on [Shi-‐Malik’00]
2. Guided patch labeling a) Volunteer crowdsourcing b) Social-‐media induced labels
3. Semi-‐supervised learning of flood extent HUG-‐FM (KNN variant) SEANO (Neural Net)
Houston Floods (original image)
Otsu Thresholding
• Clustering based image thresholding [Otsu’79]
• Converts image to binary image
• Simple, widely used, but prone to false posi0ves
• Detects highways as waterbodies
Watershed Algorithm • Relies on pre-‐iden0fied landmarks [Beucher’79, Meyer’92]
• Applies gradient transforma0on and thresholding
• Prone to smoothing errors
Normal Thresholding • Improved variant of Otzu and watershed
• Relies on land-‐cover iden0fica0on
• nuanced threshold separa0on of types of land-‐cover from water
• State-‐of-‐the-‐art in remote sensing [2016]
HUG-‐FM + MLR-‐MCL
Quantitative evaluation on Houston dataset
METHOD ACCURACY F1 Otsu Thr 0.89 0.74 Watershed 0.89 0.68 Normal Thr 0.87 0.84 HUG-‐FM 0.96 0.87 SEANO 0.97 0.90
Rely on MLR-‐MCL preprocessing Quality (SEANO) vs Speed (HUG-‐FM) tradeoff
Standard Remote Sensing Methods
Chennai Floods 11/24 (bet 2nd and 3rd depression)
Watershed (Beucher, Meyer 1992)
N-‐cuts (100 parOOons) (Shi, Malik’01)
HUG-‐FM + MLR-‐MCL patching
Take Home: Recent Advances in MCL Key Idea 1: Regularization !• Avoids fragmenting community structure. [SIGKDD’09, ACM BCB’10, Bioinformatics 2012]!
!Key Idea 2: Multi-level Regularization !• Improves scalability. [SIGKDD’09, ACM BCB’10]!
!Key Idea 3: Sparsification: Simple pre-processing that makes a difference!• Reduces clustering time from hours down to minutes. [SIGMOD’11, WWW’13]!• Theoretical rationale [SoCG’17]!
!Key Ideas 4 & 5 : Soft Clustering[ISMB12] & GPU acceleration [HiPC14]!!Social Impact: Use of MLR-MCL for Flood Mapping shows promise.!
46
References (incomplete)
1. MCL -‐ Graph Clustering by Flow SimulaVon. S. van Dongen, Ph.D. thesis, University of Utrecht, 2000.
2. Graclus -‐ Weighted Graph Cuts without Eigenvectors: A MulVlevel Approach. Dhillon et. al., IEEE. Trans. PAMI, 2007.
3. Me0s -‐ A fast and high quality mulVlevel scheme for parVVoning irregular graphs. Karypis and Kumar, SIAM J. on Scien0fic Compu0ng, 1998
4. Normalized Cuts and Image SegmentaVon. Shi and Malik, IEEE. Trans. PAMI, 2000.
5. Finding and evaluaVng community structure in networks. Newman and Girvan, Phys. Rev. E 69, 2004.
6. The idenVficaVon of funcVonal modules from the genomic associaVon of genes. Snel et. al., PNAS 2002.
Thanks & Acknowledgements
• Joint work with: • Albert Liang • Peter Jacobs • Nikhita Vedula • Venu Satuluri (Twi:er) • Yu-‐Keng Shih (GraphSQL) • Sitaram Asur (HP Laboratories) • Duygu Ucar (Jackson Laboratories)
• Grant Acknowledgements • NSF HazardSEES #1520870 and SOCS # IIS-‐1111118
• Soyware and References:h:ps://sites.google.com/site/stochas0cflowclustering/