recentadvances&in&stochas0c&flow&clustering& · 1 theoretical results •...

Recent Advances in Stochas0c Flow Clustering

Srinivasan Parthasarathy

Data Mining Research Laboratory Dept. of Computer Science and Engineering

The Ohio State University

h:p://www.cse.ohio-‐state.edu/~srini

Graph Clustering: A Fundamental Problem

2

Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.!!! !

What Makes this Problem Hard?

•  Scale –  High throughput experiments, social media, high-‐res images.

•  Noise –  False posi0ve interac0ons; False nega0ves

•  Novel Topological Characteris0cs –  Hub nodes; power-‐law

•  Domain Insights –  Balance; known biological rela0onships

•  Dynamics –  Changes to nodes and links and content

3

Extant Solu0ons

• Spectral methods [Shi ‘00] • Edge-‐based agglomera0ve/divisive methods [Newman ‘04]

• Graclus/Kernel K-‐Means [Dhillon ‘07]

• Me0s [Karypis ’98] + MQI [Leskovec, Lang’10]

• Markov Clustering [van Dongen’00]

• A Host of Specialized Solu0ons (e.g. MCODE, LINK-‐CLUSTER; etc.)

5

Markov Clustering (MCL)!Stijn van Dongen, 2000!

!The original Stochastic flow

clustering algorithm!!

6

3

1

2

4

1! 2! 3! 4!

1! 0.33! 0.25! 0.33!

2! 0.33! 0.25! 0.5! 0.33!

3! 0.25! 0.5!

4! 0.33! 0.25! 0.33!

Out-flows of 2!

In-flows of 2!

Column Stochastic Matrix: A matrix where each column sums to 1.!!Stochastic Flow: An entry in a column stochastic matrix, interpreted as the “flow” or “transition probability”.!

7

Repeatedly apply certain operations to the flow matrix until the matrix converges and can be interpreted as a clustering. !

1! 2! 3! 4!

1!

2! 1.0! 1.0! 1.0!

3! 1.0!

4!3

1

2

4

The MCL algorithm

Expand: M := M*M

Inflate: M := M.^r (r usually 2), renormalize columns

Converged?

Input: A, Adjacency matrix Ini0alize M to MG, the canonical transi0on matrix M:= MG:= (A+I) D-‐1

Yes

Output clusters

No

Prune

Enhances flow to well-‐connected nodes (i.e. nodes within a community).

Increases inequality in each column. “Rich get richer, poor get poorer.” (reduces flow across communi0es)

Saves memory by removing entries close to zero. Enables faster convergence

Clustering Interpreta0on: Nodes flowing into the same sink node are assigned same cluster labels

9

[van Dongen ’00]

10

MCL Strengths!!1.  Theoretically well founded [Von Dongen’00]!

2.  Simple, linear algebraic operations!

3.  Noise Tolerant. [Brohee’06, Vlasblom’09]!!

[Chakrabarti and Faloutsos ‘06]

11

MCL Limitations!!

!1.  Outputs many small clusters. [Satuluri, Parthasarathy’09]!!

!!

2.  Does not scale well. ! ! [Chakrabarti, Faloutsos’06]!

[Chakrabarti and Faloutsos ‘06]

12

MCL Flaws!!!1. Outputs many small clusters.!!

Fix I: Regularized MCL !!

2. Does not scale well. !!

Fix II: Multi-Level Regularized MCL! Fix III: Localized Graph Sparsification!!

Key Idea I: The Regularize operator

Why does MCL output many clusters? Due to overfisng; it does not penalize divergence of flows between neighbors.

Remedy: Penalize divergence in flows between neighbors. Use KL Divergence (a well known measure for comparing probability distribu0ons).

Turns out to have a nice closed form soluCon:

Regularize(M) :=M*(A+I)D-‐1= M*MG

The Regularized-‐MCL algorithm

Regularize: M := M*MG

Inflate: M := M.^r (r usually 2), renormalize columns

Converge?

Yes

Output clusters

No

Prune

Takes into account flows of the neighbors.

Increases inequality in each column. “Rich get richer, poor get poorer.” [Hadamard power + rescaling]

Saves memory by removing entries close to zero. Enables faster convergence

Input: A, Adjacency matrix Ini0alize M to MG, the canonical transi0on matrix M:= MG:= (A+I) D-‐1

Nodes flowing into the same sink node are assigned same cluster labels

Key Idea II: Mul0-‐level Regularized MCL

Input Graph

Intermediate Graph

Intermediate Graph

Coarsest Graph

. . . . . .

Coarsen

Coarsen

Coarsen

Run Curtailed R-‐MCL,project flow.

Run Curtailed R-‐MCL, project flow.

Input Graph

Run R-‐MCL to convergence, output clusters.

Faster to run on smaller graphs first!

Captures global

topology of graph!

Good initialization for refined flow

matrix!

16

Comparison with MCL on!Protein Interaction Networks!

Dataset !(n,m)

Quality !Change! Speedup (Time)!

Yeast !(5k, 15k) 36%! 2.5x !

(0.4s)!

Yeast_Noisy!(6k, 200k) 300%! 57x!

(8s)!

Human!(10k, 60k) 21.6%! 200x!

(2s)!

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Comparison with Graclus and Me0s

Quality: MLR-‐MCL improves upon both Graclus and Me0s

Speed: MLR-‐MCL is faster than Graclus, comparable to Me0s

Key Idea III: Graph Sparsifica0on

Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

18

Original! Sparsified!

!!

Our Approach

Main Idea: Retain edges which are likely to be intra-‐cluster edges , while discarding likely inter-‐cluster edges.

Similarity-‐based SparsificaCon HeurisCc: An edge (i,j) is likely to be an intra-‐cluster edge if ver0ces i and j have highly overlapping adjacency lists.

|)()(||)()(|),(

jAdjiAdjjAdjiAdjjiSim

∪

∩=

20

Algorithm: Global Sparsification (G-Spar)!

Parameter: Sparsification ratio, s!!

1. For each edge <i,j>:!(i) Calculate Sim ( <i,j> )

2. Retain top s% of edges in order of Sim, discard others!

!

21

Dense clusters are over-represented, sparse clusters under-represented!!Works great when the goal is to just find the top communities!

22

Algorithm: Local Sparsification (L-Spar)!

Parameter: Sparsification exponent, e (0 < e < 1)!!

1. For each node i of degree di:!(i) For each neighbor j: !

(a) Calculate Sim ( <i,j> ) (ii) Retain top (d i)e neighbors in order of Sim, for node i!!

Edges compete to be retained locally (think globally act locally paradigm)

23

Ensures representation of clusters of varying densities!

24

But...!

Similarity computation is expensive!!!!

Solution: A randomized, approximate solution based on Minwise Hashing [Broder et. al., 1998]!

25

Dataset !(n, m)!

Spars. !Ratio!

L-Spar!

Speed! Quality!

Yeast ! 17%! 17x! +4%!

Human! 40%! 6x! +1%!

L-Spar: Results Using MLR-MCL

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Results using MLR-‐MCL

Dataset (Nodes, Edges)

Spars. RaCo

RandomEdge G-‐Spar L-‐Spar

Spdup QualityΔ Spdup

QualityΔ Spdup

QualityΔ

BioGrid(6K, 200K)

17% 6x -‐16% 38x -‐23% 17x +4%

Wiki (1.1M, 53M)

15% 19x -‐58% 92x -‐54% 23x -‐4.5%

Orkut(3M, 117M)

17% 6x -‐32% 39x -‐59% 22x 0

Twi:er(146K, 83M)

4% 63x -‐90% 188x +10% 22x +40%

L-‐Spar enables high speed-‐ups, without significant loss of accuracy.

1 Theoretical Results

• Theorem 2.1(c) of [1]: The multiplicity of 0 as an eigenvalue of graph Laplacian is equal tothe number of components of the graph.

• Theorem 2.2(d) of [1]:Pn

i=1 �i = 2|E(G)| =P

v d(v)

2 SIGMOD Paper

Flickr 33911, Flickr.spars 33953. Total 64903 nodesWikipedia 1, Wikipedia.spars 164. Total 1129060 nodesOrkut 186, Orkut.spars 257. Total 3072626 nodes

Dataset Original L-Spar G-SparBioGrid �4197 = 0.34284616 �4197 = 0.08680336 �4894 = 0.14459744DIP �45 = 0.117378 �54 = 0.03226163 �888 = 0.036117

Human �219 = 0.1038301 �234 = 0.05650493 �2266 = 0.05255889

Table 1: Spectral Gap Comparisons

Figure 1: BioGrid Eigenvalues COmparison

References

[1] B. Mohar. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications,2:871–898, 1991.

1

Impact of Sparsification on Spectrum: Yeast PPI

Global Sparsification results in multiple components

Local sparsification seems to match trends of original graph

Human PPI

Synthe0c data (Fortunato’09)

As clustering problem gets harder, L-‐Spar is more beneficial

SOCIAL IMPACT: EMERGENCY RESPONSE AND FLOOD MAPPING

Copyright 2006, Data Mining Research Laboratory

Crisis Informatics and Flood Mapping

• Disaster Informatics or crisis informatics is the study of the use of information and technology in the different phases of disasters or crisis

•  Flood Mapping: Mapping the extent of flood damage a key step for relief and recovery

Copyright 2006, Data Mining Research Laboratory

Chennai Floods (2015): Social Sensing Enhanced Flood Mapping

Prior to 1rst Depression After 3rd Depression

Markov Clustering and Flood Mapping

•  Water Delinea0on Segmenta0on of remote-‐sensed images is key strategy employed in flood mapping.

•  Confounding factors: cloud cover, sinuous river beds, urban area reflectance effects

•  Key Idea: Semi-‐supervised flood mapping using MLR-‐MCL as a key pre-‐processing step.

Procedure

1.  Cluster patches of satellite image with MLR-‐MCL –  Graph-‐based image segmenta0on [Shi-‐Malik’00]

2.  Guided patch labeling a)  Volunteer crowdsourcing b)  Social-‐media induced labels

3.  Semi-‐supervised learning of flood extent HUG-‐FM (KNN variant) SEANO (Neural Net)

Houston Floods (original image)

Otsu Thresholding

• Clustering based image thresholding [Otsu’79]

• Converts image to binary image

•  Simple, widely used, but prone to false posi0ves

•  Detects highways as waterbodies

Watershed Algorithm • Relies on pre-‐iden0fied landmarks [Beucher’79, Meyer’92]

• Applies gradient transforma0on and thresholding

• Prone to smoothing errors

Normal Thresholding •  Improved variant of Otzu and watershed

• Relies on land-‐cover iden0fica0on

•  nuanced threshold separa0on of types of land-‐cover from water

•  State-‐of-‐the-‐art in remote sensing [2016]

HUG-‐FM + MLR-‐MCL

Quantitative evaluation on Houston dataset

METHOD ACCURACY F1 Otsu Thr 0.89 0.74 Watershed 0.89 0.68 Normal Thr 0.87 0.84 HUG-‐FM 0.96 0.87 SEANO 0.97 0.90

Rely on MLR-‐MCL preprocessing Quality (SEANO) vs Speed (HUG-‐FM) tradeoff

Standard Remote Sensing Methods

Chennai Floods 11/24 (bet 2nd and 3rd depression)

Watershed (Beucher, Meyer 1992)

N-‐cuts (100 parOOons) (Shi, Malik’01)

HUG-‐FM + MLR-‐MCL patching

Take Home: Recent Advances in MCL Key Idea 1: Regularization !•  Avoids fragmenting community structure. [SIGKDD’09, ACM BCB’10, Bioinformatics 2012]!

!Key Idea 2: Multi-level Regularization !•  Improves scalability. [SIGKDD’09, ACM BCB’10]!

!Key Idea 3: Sparsification: Simple pre-processing that makes a difference!•  Reduces clustering time from hours down to minutes. [SIGMOD’11, WWW’13]!•  Theoretical rationale [SoCG’17]!

!Key Ideas 4 & 5 : Soft Clustering[ISMB12] & GPU acceleration [HiPC14]!!Social Impact: Use of MLR-MCL for Flood Mapping shows promise.!

46

References (incomplete)

1.  MCL -‐ Graph Clustering by Flow SimulaVon. S. van Dongen, Ph.D. thesis, University of Utrecht, 2000.

2.  Graclus -‐ Weighted Graph Cuts without Eigenvectors: A MulVlevel Approach. Dhillon et. al., IEEE. Trans. PAMI, 2007.

3.  Me0s -‐ A fast and high quality mulVlevel scheme for parVVoning irregular graphs. Karypis and Kumar, SIAM J. on Scien0fic Compu0ng, 1998

4.  Normalized Cuts and Image SegmentaVon. Shi and Malik, IEEE. Trans. PAMI, 2000.

5.  Finding and evaluaVng community structure in networks. Newman and Girvan, Phys. Rev. E 69, 2004.

6.  The idenVficaVon of funcVonal modules from the genomic associaVon of genes. Snel et. al., PNAS 2002.

Thanks & Acknowledgements

•  Joint work with: •  Albert Liang •  Peter Jacobs •  Nikhita Vedula •  Venu Satuluri (Twi:er) •  Yu-‐Keng Shih (GraphSQL) •  Sitaram Asur (HP Laboratories) •  Duygu Ucar (Jackson Laboratories)

•  Grant Acknowledgements •  NSF HazardSEES #1520870 and SOCS # IIS-‐1111118

•  Soyware and References:h:ps://sites.google.com/site/stochas0cflowclustering/

recentadvances&in&stochas0c&flow&clustering& · 1 theoretical results •...

Documents