clique-based network clustering
TRANSCRIPT
Guang OuyangAdvisor: Dipak Dey
1
Facebook LinkedIn Internet Instagram Tweets Google+ Quora Wechat Stack Oversflow Research Gate
2
Small World: everyone and everything is six or fewer steps away, by way of introduction, from any other person in the world.
Power Law: degree distribution has long tail power law distribution.
Community Structure: community groups based on common location, interests, occupations, etc. are quite common in real networks.
3
Detect community structure in large and complex networks.
Community can be viewed as a summary of the whole network, and therefore easy to visualize and analyze.
Communities provide important information for applications such as market segmentation, building recommender system.
4
Network data is a graph structure made up of ‘nodes’ and ‘links’ that connect them.
Network data tends to have ‘discrete’ similarity matrix.
Most clustering algorithms work on the “continuous” distance or similarity matrix.
Real-world networks usually very large. Even is unbearable for efficiency or space.
5
Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1), (7,4), (6,7)].
Adjacency matrix:
6
No statistically precise definition so farGenerally speaking, a community is a set of nodes densely connected internallyNodes between two communities are loosely connected
7
A random network without real clustering structure should not be split (type 1 error of over-splitting).
Two weakly connected communities should not be merged (type 2 error of under-splitting).
Modern network data is usually huge, space and time efficient clustering is needed
8
Minimum-cut method(spectral clustering) Hierarchical clustering Girvan-Newman algorithm (betweenness) Modularity maximization Stochastic block model as well as variants
including mixed membership model Finding maximal clique
9
A measure of strength of division of a network into clusters or communities.
where i and j denotes nodes, c denotes clusters, is the (i,j) entry in adjacency matrix A, is the degree of node i, m is the total number of links in a network.
(1)
10
11
ji
1 2 3 4 5 6 7
1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3
2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3
3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3
4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4
5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7
6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8
7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2
Degrees of the 7 nodes are:
Total Degree:
The modularity matrix below has (i, j) entry:
Node 1, 2, 3, 4 tend to form one community and node 5, 6, 7 for another. The Modularity Q based on this division is the sum of all green cells in modularity matrix divided by 2m: 0.355
High modularity implies dense connections inside communities and sparse connections between communities.
Approximate maximization algorithms:• Greedy algorithms• Simulated annealing• Leading eigen-vector• Louvian’s method• Ensemble learning(Currently fastest)
12
Benchmark model to simulate stochastic block network 1 with built-in cluster structures.
where
Each cluster has 40 nodesModularity-based clustering on random network from stochastic block model.Modularity maximization approach works well if clusters have similar size
13
Random network without cluster structure may be splited. (Erodos Renyi network)
Small clusters in large network may be merged.(Resolution limitation)
Multi-resolution method may not reduce both types of error simultaneously.
A bottleneck of many other network clustering algorithms.
14
Erdos Renyi network of 40 nodes, density 0.1
Modularity Maximized Clustering: Q=0.37
15
Stochastic Block Model 2 with
Two small clusters have 20 nodes, and the largest clusters have 100 nodesThe largest clusters are splitedModularity maximization algorithms tend to fail in networks with clusters of very different sizes
Modularity maximized clustering with Q=0.429
16
Stochastic Block Model 3 with link probability
Cluster size: [800, 400, 50, 20] Modularity method clustering results:
• 7 nodes in cluster 3 are merged with cluster 1• All the 20 nodes in cluster 4 are merged with cluster
1
17
Algorithm 1◦ Global algorithm◦ Cluster internal link density above user defined
threshold guaranteed Algorithm 2
◦ Local algorithm◦ Risk of splitting a cluster is quantified and under
user control◦ Risk of merging clusters are minimized
18
Objective Function:
where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is adjacency matrix, c is community membership vector, m is total link count
Reward table:Connected pair of nodes
Disconnected pair of nodes
Pair of nodes in the same cluster
1-p -p
Pair of nodes in different clusters
-1+p p
(2)
19
It is guaranteed that every found communities has internal link density higher than user defined threshold p.◦ If p=1, every found communities is a clique.◦ If p=25%, every communities has internal link
density higher than 25%.◦ Communities with link density “significantly”
higher than p will not be split.◦ Communities with link density lower than p will
definitely be split into smaller communities.
20
Maximize objective function (2):
where s is n by 1 vector of community membership with binary entries 1 or
-1, A is adjacency matrix, J is one matrix, I is the identity matrix
Search over all possible divisions is N-P hard Approximate spectral method:
◦ Find the largest Eigen-value w of p-clique matrix:
◦ Choose a corresponding Eigen-vector v of w◦ Use the sign of v to split the network of n nodes
(3)
(4)
21
is the best approximate solution to (3) If , division by v will be executed. If , but , division by v will
still executed. If , and , division by v will be
cancelled
22
Python-scipy wrapper of ARPACK software Iterative matrix-vector product finding Eigen-
value of large sparse or structured matrices. is dense but structured Matrix-vector product requires
much less than the usual operations◦ Adjacency matrix is usually sparse◦ requires only operations◦ requires only operations ◦ Time complexity: per iteration◦ Space complexity: (applicable to huge graph)
23
Usually it is hard to tell how many communities are there in a large network
First split the network into two parts, then divide these two parts, and so forth.
Use the Bipartition Criteria in slide 21 as the stopping criteria of these recursive dividing prodedure
24
p=0.1
p=0.05p=0.02
Stochastic Block Network 2 with
Two small clusters have 20 nodes, and the largest clusters have 100 nodesExpected link density 0.1125
25
Karate Club Member data (34 people) Link density: 0.139
p = 0.1 p = 0.15
26
Doubtful Sound Dolphin (62 dolphins) Link density: 0.084
p = 0.03 p = 0.2
27
Increasing p: zoom in ◦ Smaller communities are found.◦ Risk of merging clusters(type 2 error) is lower.◦ Risk of splitting cluster/Erdos Renyi sub-network
(type 1 error) is higher. Decreasing p: zoom out
◦ Larger communities are found.◦ Risk of merging clusters(type 2 error) is higher.◦ Risk of splitting cluster/Erdos Renyi sub-network
(type 1 error) is lower.
28
Objective: choose parameter p such that at most 2.5% of nodes in an Erdos Renyi sub-network will be trimmed off.
Cause of Type 1 Error:◦ Due to random fluctuation in link formation, 2.5% of
nodes has less than 0.975 np links with the rest 97.5% nodes.
◦ Threshold p is higher than the link density between the 2.5% group and 97.5% group of nodes
Strategy:◦ Choose p to be significantly smaller than observed
total link density .
29
Solution:
Intuition:◦ Use truncated normal distribution to approximate the
distribution of link density between the 2.5% group and the 97.5% group.
Experiment results:◦ In 100 SBM networks, the type 1 error is bounded by
5% (mostly 3.5%).◦ In SBM networks with average degree less than 5,
type 1 error is less than 2%.
(5)
30
When two clusters of size and , link probability will be merged?
where is observed link density. The risk of type 2 error will be bounded by
2.5% if
(6)
31
Challenge:◦ In splitting a sub-network, we usually don’t know
the link density or between two clusters.◦ In theory, there maybe cases when inequality (5)
and (6) are a conflict Solution:
◦ Choose p to be the upper bound in (5)◦ Develop a more flexible algorithm which allows p
very from one sub-network to another. This may reduce the chance of a conflict between inequality (5) and (6).
32
A measure of consistence between a found communities R and real communities F.
where I is the Kullback-Leibler divergence, H is entropy, N is diffusion matrix, and are number of real and found communities.
33
(7)
Review Stochastic Block network 1 through 3 using NMI
Results:◦ Type 1 error is overly controlled for small and
sparse network such as SBM 1.
34
size Link density
Auto chosen p
Average NMI
s.e. Number of simulation
SBM 1 120 0.0723 0.0239 0.8484 0.0195 100
SBM 2 140 0.1125 0.0579 0.9483 0.0078 100
SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100
35
Modularity
p = 0.0888
Stochastic Block model 4 with
Cluster size: (100, 20, 20)Expected link density:0.1507Auto-chosen parameter p from (5) : 0.0888 Using auto-chosen parameter p will end up with merging small clusters 2 and 3Cluster 2 and 3 will be divided only if we zoom in more by increase pModularity method not only merged cluster 2 and 3, but also split cluster 1
36
S1
S0
S2
C1 S3 C2 C3
C4 C5
p(S0) p(S0)
p(S1) p(S1) p(S2) p(S2)
p(S3) p(S3)
are observed link density and node count in sub-network S
Maximize localized clique-index
where T is the binary tree representing the hierarchical clustering process, p(S) is automatic choice of local threshold parameter p for sub-network S, is the indicator if node i and j will be divided in bipartition of S
37
(8)
Every bipartition in sub-network S will bring contribution:
The best bipartition is obtained from the sign of leading Eigenvector of matrix:
The bipartition on S will be cancelled if contribution .
38
(9)
(10)
Each matrix-vector product takes time O(m) Finding leading Eigen-vector takes O(n)
matrix-vector product. On average, the height of the binary tree
representing hierarchical clustering procedure is O(log(n)).
For both global and localized algorithm, the time complexity is or for sparse network.
39
Stochastic Block model 4 with
Cluster size: (100, 20, 20) Average NMI among 100 simulation is 0.9717 Localized clustering algorithm is able to detect the built-in community structure.
40
Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters
Cluster sizes with internal link density: [(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15),
(200,0.2), (100, 0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]
Link density between different clusters is 0.005 Average NMI among 20 simulation is 0.9895 Average Running time: 1.66 seconds
41
Stochastic Block Model with 20000 nodes and 25 clusters Cluster sizes with internal link density: [(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000,
0.07), (1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12), (500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14), (200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)] Link density between clusters: 0.0001 Average NMI among 10 simulations: 0.8960 Average running time: 12.6 seconds
42
Review of SBM network 1 through 6:
Clustering quality is high for large network or network with high link density
43
Built-incluster
size Link density
Average NMI
s.e. Number of Simulation
SBM1 3 120 0.0723 0.8972 0.0195 100
SBM2 3 140 0.1125 0.9476 0.0051 100
SBM3 4 1200 0.0722 0.9687 0.0028 100
SBM4 3 140 0.0888 0.9717 0.0033 100
SBM5 10 7000 0.0285 0.9895 0.0022 20
SBM6 25 20000 0.005 0.8960 0.0029 10
Global Algorithm:◦ Good for application with specific requirements in
internal link density of every found communities Localized Algorithm:
◦ Good for finding statistically significant communities.
◦ Type 1 error seem to be overly controlled for sparse network.
◦ The conflict between type 1 error and type 2 error is effectively avoided in sample simulated network.
44
Erdos Renyi Model may not serve as a good Null Model of random network without built-in communities structures. Statistically significant community for other null model need consideration.
Extend the algorithm to directed network, network with numerical values in adjacency matrix, and network with additional profile information in each node.
Develop close to linear time clustering algorithm.
45