clique-based network clustering

Guang OuyangAdvisor: Dipak Dey

1

Facebook LinkedIn Internet Instagram Tweets Google+ Quora Wechat Stack Oversflow Research Gate

2

Small World: everyone and everything is six or fewer steps away, by way of introduction, from any other person in the world.

Power Law: degree distribution has long tail power law distribution.

Community Structure: community groups based on common location, interests, occupations, etc. are quite common in real networks.

3

Detect community structure in large and complex networks.

Community can be viewed as a summary of the whole network, and therefore easy to visualize and analyze.

Communities provide important information for applications such as market segmentation, building recommender system.

4

Network data is a graph structure made up of ‘nodes’ and ‘links’ that connect them.

Network data tends to have ‘discrete’ similarity matrix.

Most clustering algorithms work on the “continuous” distance or similarity matrix.

Real-world networks usually very large. Even is unbearable for efficiency or space.

5

Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1), (7,4), (6,7)].

Adjacency matrix:

6

No statistically precise definition so farGenerally speaking, a community is a set of nodes densely connected internallyNodes between two communities are loosely connected

7

A random network without real clustering structure should not be split (type 1 error of over-splitting).

Two weakly connected communities should not be merged (type 2 error of under-splitting).

Modern network data is usually huge, space and time efficient clustering is needed

8

Minimum-cut method(spectral clustering) Hierarchical clustering Girvan-Newman algorithm (betweenness) Modularity maximization Stochastic block model as well as variants

including mixed membership model Finding maximal clique

9

A measure of strength of division of a network into clusters or communities.

where i and j denotes nodes, c denotes clusters, is the (i,j) entry in adjacency matrix A, is the degree of node i, m is the total number of links in a network.

(1)

10

11

ji

1 2 3 4 5 6 7

1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3

2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3

3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3

4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4

5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7

6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8

7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2

Degrees of the 7 nodes are:

Total Degree:

The modularity matrix below has (i, j) entry:

Node 1, 2, 3, 4 tend to form one community and node 5, 6, 7 for another. The Modularity Q based on this division is the sum of all green cells in modularity matrix divided by 2m: 0.355

High modularity implies dense connections inside communities and sparse connections between communities.

Approximate maximization algorithms:• Greedy algorithms• Simulated annealing• Leading eigen-vector• Louvian’s method• Ensemble learning(Currently fastest)

12

Benchmark model to simulate stochastic block network 1 with built-in cluster structures.

where

Each cluster has 40 nodesModularity-based clustering on random network from stochastic block model.Modularity maximization approach works well if clusters have similar size

13

Random network without cluster structure may be splited. (Erodos Renyi network)

Small clusters in large network may be merged.(Resolution limitation)

Multi-resolution method may not reduce both types of error simultaneously.

A bottleneck of many other network clustering algorithms.

14

Erdos Renyi network of 40 nodes, density 0.1

Modularity Maximized Clustering: Q=0.37

15

Stochastic Block Model 2 with

Two small clusters have 20 nodes, and the largest clusters have 100 nodesThe largest clusters are splitedModularity maximization algorithms tend to fail in networks with clusters of very different sizes

Modularity maximized clustering with Q=0.429

16

Stochastic Block Model 3 with link probability

Cluster size: [800, 400, 50, 20] Modularity method clustering results:

• 7 nodes in cluster 3 are merged with cluster 1• All the 20 nodes in cluster 4 are merged with cluster

1

17

Algorithm 1◦ Global algorithm◦ Cluster internal link density above user defined

threshold guaranteed Algorithm 2

◦ Local algorithm◦ Risk of splitting a cluster is quantified and under

user control◦ Risk of merging clusters are minimized

18

Objective Function:

where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is adjacency matrix, c is community membership vector, m is total link count

Reward table:Connected pair of nodes

Disconnected pair of nodes

Pair of nodes in the same cluster

1-p -p

Pair of nodes in different clusters

-1+p p

(2)

19

It is guaranteed that every found communities has internal link density higher than user defined threshold p.◦ If p=1, every found communities is a clique.◦ If p=25%, every communities has internal link

density higher than 25%.◦ Communities with link density “significantly”

higher than p will not be split.◦ Communities with link density lower than p will

definitely be split into smaller communities.

20

Maximize objective function (2):

where s is n by 1 vector of community membership with binary entries 1 or

-1, A is adjacency matrix, J is one matrix, I is the identity matrix

Search over all possible divisions is N-P hard Approximate spectral method:

◦ Find the largest Eigen-value w of p-clique matrix:

◦ Choose a corresponding Eigen-vector v of w◦ Use the sign of v to split the network of n nodes

(3)

(4)

21

is the best approximate solution to (3) If , division by v will be executed. If , but , division by v will

still executed. If , and , division by v will be

cancelled

22

Python-scipy wrapper of ARPACK software Iterative matrix-vector product finding Eigen-

value of large sparse or structured matrices. is dense but structured Matrix-vector product requires

much less than the usual operations◦ Adjacency matrix is usually sparse◦ requires only operations◦ requires only operations ◦ Time complexity: per iteration◦ Space complexity: (applicable to huge graph)

23

Usually it is hard to tell how many communities are there in a large network

First split the network into two parts, then divide these two parts, and so forth.

Use the Bipartition Criteria in slide 21 as the stopping criteria of these recursive dividing prodedure

24

p=0.1

p=0.05p=0.02

Stochastic Block Network 2 with

Two small clusters have 20 nodes, and the largest clusters have 100 nodesExpected link density 0.1125

25

Karate Club Member data (34 people) Link density: 0.139

p = 0.1 p = 0.15

26

Doubtful Sound Dolphin (62 dolphins) Link density: 0.084

p = 0.03 p = 0.2

27

Increasing p: zoom in ◦ Smaller communities are found.◦ Risk of merging clusters(type 2 error) is lower.◦ Risk of splitting cluster/Erdos Renyi sub-network

(type 1 error) is higher. Decreasing p: zoom out

◦ Larger communities are found.◦ Risk of merging clusters(type 2 error) is higher.◦ Risk of splitting cluster/Erdos Renyi sub-network

(type 1 error) is lower.

28

Objective: choose parameter p such that at most 2.5% of nodes in an Erdos Renyi sub-network will be trimmed off.

Cause of Type 1 Error:◦ Due to random fluctuation in link formation, 2.5% of

nodes has less than 0.975 np links with the rest 97.5% nodes.

◦ Threshold p is higher than the link density between the 2.5% group and 97.5% group of nodes

Strategy:◦ Choose p to be significantly smaller than observed

total link density .

29

Solution:

Intuition:◦ Use truncated normal distribution to approximate the

distribution of link density between the 2.5% group and the 97.5% group.

Experiment results:◦ In 100 SBM networks, the type 1 error is bounded by

5% (mostly 3.5%).◦ In SBM networks with average degree less than 5,

type 1 error is less than 2%.

(5)

30

When two clusters of size and , link probability will be merged?

where is observed link density. The risk of type 2 error will be bounded by

2.5% if

(6)

31

Challenge:◦ In splitting a sub-network, we usually don’t know

the link density or between two clusters.◦ In theory, there maybe cases when inequality (5)

and (6) are a conflict Solution:

◦ Choose p to be the upper bound in (5)◦ Develop a more flexible algorithm which allows p

very from one sub-network to another. This may reduce the chance of a conflict between inequality (5) and (6).

32

A measure of consistence between a found communities R and real communities F.

where I is the Kullback-Leibler divergence, H is entropy, N is diffusion matrix, and are number of real and found communities.

33

(7)

Review Stochastic Block network 1 through 3 using NMI

Results:◦ Type 1 error is overly controlled for small and

sparse network such as SBM 1.

34

size Link density

Auto chosen p

Average NMI

s.e. Number of simulation

SBM 1 120 0.0723 0.0239 0.8484 0.0195 100

SBM 2 140 0.1125 0.0579 0.9483 0.0078 100

SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100

35

Modularity

p = 0.0888

Stochastic Block model 4 with

Cluster size: (100, 20, 20)Expected link density:0.1507Auto-chosen parameter p from (5) : 0.0888 Using auto-chosen parameter p will end up with merging small clusters 2 and 3Cluster 2 and 3 will be divided only if we zoom in more by increase pModularity method not only merged cluster 2 and 3, but also split cluster 1

36

S1

S0

S2

C1 S3 C2 C3

C4 C5

p(S0) p(S0)

p(S1) p(S1) p(S2) p(S2)

p(S3) p(S3)

are observed link density and node count in sub-network S

Maximize localized clique-index

where T is the binary tree representing the hierarchical clustering process, p(S) is automatic choice of local threshold parameter p for sub-network S, is the indicator if node i and j will be divided in bipartition of S

37

(8)

Every bipartition in sub-network S will bring contribution:

The best bipartition is obtained from the sign of leading Eigenvector of matrix:

The bipartition on S will be cancelled if contribution .

38

(9)

(10)

Each matrix-vector product takes time O(m) Finding leading Eigen-vector takes O(n)

matrix-vector product. On average, the height of the binary tree

representing hierarchical clustering procedure is O(log(n)).

For both global and localized algorithm, the time complexity is or for sparse network.

39

Stochastic Block model 4 with

Cluster size: (100, 20, 20) Average NMI among 100 simulation is 0.9717 Localized clustering algorithm is able to detect the built-in community structure.

40

Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters

Cluster sizes with internal link density: [(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15),

(200,0.2), (100, 0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]

Link density between different clusters is 0.005 Average NMI among 20 simulation is 0.9895 Average Running time: 1.66 seconds

41

Stochastic Block Model with 20000 nodes and 25 clusters Cluster sizes with internal link density: [(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000,

0.07), (1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12), (500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14), (200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)] Link density between clusters: 0.0001 Average NMI among 10 simulations: 0.8960 Average running time: 12.6 seconds

42

Review of SBM network 1 through 6:

Clustering quality is high for large network or network with high link density

43

Built-incluster

size Link density

Average NMI

s.e. Number of Simulation

SBM1 3 120 0.0723 0.8972 0.0195 100

SBM2 3 140 0.1125 0.9476 0.0051 100

SBM3 4 1200 0.0722 0.9687 0.0028 100

SBM4 3 140 0.0888 0.9717 0.0033 100

SBM5 10 7000 0.0285 0.9895 0.0022 20

SBM6 25 20000 0.005 0.8960 0.0029 10

Global Algorithm:◦ Good for application with specific requirements in

internal link density of every found communities Localized Algorithm:

◦ Good for finding statistically significant communities.

◦ Type 1 error seem to be overly controlled for sparse network.

◦ The conflict between type 1 error and type 2 error is effectively avoided in sample simulated network.

44

Erdos Renyi Model may not serve as a good Null Model of random network without built-in communities structures. Statistically significant community for other null model need consideration.

Extend the algorithm to directed network, network with numerical values in adjacency matrix, and network with additional profile information in each node.

Develop close to linear time clustering algorithm.

45

clique-based network clustering

Documents

large network

clustering onrandom

modern network data

community structure

summary of thewhole

modularity matrix

clustering algorithms

adjacency matrix