clique-based network clustering

45
Guang Ouyang Advisor: Dipak Dey 1

Upload: guang-ouyang

Post on 06-Aug-2015

90 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Clique-based Network Clustering

Guang OuyangAdvisor: Dipak Dey

1

Page 2: Clique-based Network Clustering

Facebook LinkedIn Internet Instagram Tweets Google+ Quora Wechat Stack Oversflow Research Gate

2

Page 3: Clique-based Network Clustering

Small World: everyone and everything is six or fewer steps away, by way of introduction, from any other person in the world.

Power Law: degree distribution has long tail power law distribution.

Community Structure: community groups based on common location, interests, occupations, etc. are quite common in real networks.

3

Page 4: Clique-based Network Clustering

Detect community structure in large and complex networks.

Community can be viewed as a summary of the whole network, and therefore easy to visualize and analyze.

Communities provide important information for applications such as market segmentation, building recommender system.

4

Page 5: Clique-based Network Clustering

Network data is a graph structure made up of ‘nodes’ and ‘links’ that connect them.

Network data tends to have ‘discrete’ similarity matrix.

Most clustering algorithms work on the “continuous” distance or similarity matrix.

Real-world networks usually very large. Even is unbearable for efficiency or space.

5

Page 6: Clique-based Network Clustering

Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1), (7,4), (6,7)].

Adjacency matrix:

6

Page 7: Clique-based Network Clustering

No statistically precise definition so farGenerally speaking, a community is a set of nodes densely connected internallyNodes between two communities are loosely connected

7

Page 8: Clique-based Network Clustering

A random network without real clustering structure should not be split (type 1 error of over-splitting).

Two weakly connected communities should not be merged (type 2 error of under-splitting).

Modern network data is usually huge, space and time efficient clustering is needed

8

Page 9: Clique-based Network Clustering

Minimum-cut method(spectral clustering) Hierarchical clustering Girvan-Newman algorithm (betweenness) Modularity maximization Stochastic block model as well as variants

including mixed membership model Finding maximal clique

9

Page 10: Clique-based Network Clustering

A measure of strength of division of a network into clusters or communities.

where i and j denotes nodes, c denotes clusters, is the (i,j) entry in adjacency matrix A, is the degree of node i, m is the total number of links in a network.

(1)

10

Page 11: Clique-based Network Clustering

11

ji

1 2 3 4 5 6 7

1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3

2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3

3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3

4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4

5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7

6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8

7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2

Degrees of the 7 nodes are:

Total Degree:

The modularity matrix below has (i, j) entry:

Node 1, 2, 3, 4 tend to form one community and node 5, 6, 7 for another. The Modularity Q based on this division is the sum of all green cells in modularity matrix divided by 2m: 0.355

Page 12: Clique-based Network Clustering

High modularity implies dense connections inside communities and sparse connections between communities.

Approximate maximization algorithms:• Greedy algorithms• Simulated annealing• Leading eigen-vector• Louvian’s method• Ensemble learning(Currently fastest)

12

Page 13: Clique-based Network Clustering

Benchmark model to simulate stochastic block network 1 with built-in cluster structures.

where

Each cluster has 40 nodesModularity-based clustering on random network from stochastic block model.Modularity maximization approach works well if clusters have similar size

13

Page 14: Clique-based Network Clustering

Random network without cluster structure may be splited. (Erodos Renyi network)

Small clusters in large network may be merged.(Resolution limitation)

Multi-resolution method may not reduce both types of error simultaneously.

A bottleneck of many other network clustering algorithms.

14

Page 15: Clique-based Network Clustering

Erdos Renyi network of 40 nodes, density 0.1

Modularity Maximized Clustering: Q=0.37

15

Page 16: Clique-based Network Clustering

Stochastic Block Model 2 with

Two small clusters have 20 nodes, and the largest clusters have 100 nodesThe largest clusters are splitedModularity maximization algorithms tend to fail in networks with clusters of very different sizes

Modularity maximized clustering with Q=0.429

16

Page 17: Clique-based Network Clustering

Stochastic Block Model 3 with link probability

Cluster size: [800, 400, 50, 20] Modularity method clustering results:

• 7 nodes in cluster 3 are merged with cluster 1• All the 20 nodes in cluster 4 are merged with cluster

1

17

Page 18: Clique-based Network Clustering

Algorithm 1◦ Global algorithm◦ Cluster internal link density above user defined

threshold guaranteed Algorithm 2

◦ Local algorithm◦ Risk of splitting a cluster is quantified and under

user control◦ Risk of merging clusters are minimized

18

Page 19: Clique-based Network Clustering

Objective Function:

where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is adjacency matrix, c is community membership vector, m is total link count

Reward table:Connected pair of nodes

Disconnected pair of nodes

Pair of nodes in the same cluster

1-p -p

Pair of nodes in different clusters

-1+p p

(2)

19

Page 20: Clique-based Network Clustering

It is guaranteed that every found communities has internal link density higher than user defined threshold p.◦ If p=1, every found communities is a clique.◦ If p=25%, every communities has internal link

density higher than 25%.◦ Communities with link density “significantly”

higher than p will not be split.◦ Communities with link density lower than p will

definitely be split into smaller communities.

20

Page 21: Clique-based Network Clustering

Maximize objective function (2):

where s is n by 1 vector of community membership with binary entries 1 or

-1, A is adjacency matrix, J is one matrix, I is the identity matrix

Search over all possible divisions is N-P hard Approximate spectral method:

◦ Find the largest Eigen-value w of p-clique matrix:

◦ Choose a corresponding Eigen-vector v of w◦ Use the sign of v to split the network of n nodes

(3)

(4)

21

Page 22: Clique-based Network Clustering

is the best approximate solution to (3) If , division by v will be executed. If , but , division by v will

still executed. If , and , division by v will be

cancelled

22

Page 23: Clique-based Network Clustering

Python-scipy wrapper of ARPACK software Iterative matrix-vector product finding Eigen-

value of large sparse or structured matrices. is dense but structured Matrix-vector product requires

much less than the usual operations◦ Adjacency matrix is usually sparse◦ requires only operations◦ requires only operations ◦ Time complexity: per iteration◦ Space complexity: (applicable to huge graph)

23

Page 24: Clique-based Network Clustering

Usually it is hard to tell how many communities are there in a large network

First split the network into two parts, then divide these two parts, and so forth.

Use the Bipartition Criteria in slide 21 as the stopping criteria of these recursive dividing prodedure

24

Page 25: Clique-based Network Clustering

p=0.1

p=0.05p=0.02

Stochastic Block Network 2 with

Two small clusters have 20 nodes, and the largest clusters have 100 nodesExpected link density 0.1125

25

Page 26: Clique-based Network Clustering

Karate Club Member data (34 people) Link density: 0.139

p = 0.1 p = 0.15

26

Page 27: Clique-based Network Clustering

Doubtful Sound Dolphin (62 dolphins) Link density: 0.084

p = 0.03 p = 0.2

27

Page 28: Clique-based Network Clustering

Increasing p: zoom in ◦ Smaller communities are found.◦ Risk of merging clusters(type 2 error) is lower.◦ Risk of splitting cluster/Erdos Renyi sub-network

(type 1 error) is higher. Decreasing p: zoom out

◦ Larger communities are found.◦ Risk of merging clusters(type 2 error) is higher.◦ Risk of splitting cluster/Erdos Renyi sub-network

(type 1 error) is lower.

28

Page 29: Clique-based Network Clustering

Objective: choose parameter p such that at most 2.5% of nodes in an Erdos Renyi sub-network will be trimmed off.

Cause of Type 1 Error:◦ Due to random fluctuation in link formation, 2.5% of

nodes has less than 0.975 np links with the rest 97.5% nodes.

◦ Threshold p is higher than the link density between the 2.5% group and 97.5% group of nodes

Strategy:◦ Choose p to be significantly smaller than observed

total link density .

29

Page 30: Clique-based Network Clustering

Solution:

Intuition:◦ Use truncated normal distribution to approximate the

distribution of link density between the 2.5% group and the 97.5% group.

Experiment results:◦ In 100 SBM networks, the type 1 error is bounded by

5% (mostly 3.5%).◦ In SBM networks with average degree less than 5,

type 1 error is less than 2%.

(5)

30

Page 31: Clique-based Network Clustering

When two clusters of size and , link probability will be merged?

where is observed link density. The risk of type 2 error will be bounded by

2.5% if

(6)

31

Page 32: Clique-based Network Clustering

Challenge:◦ In splitting a sub-network, we usually don’t know

the link density or between two clusters.◦ In theory, there maybe cases when inequality (5)

and (6) are a conflict Solution:

◦ Choose p to be the upper bound in (5)◦ Develop a more flexible algorithm which allows p

very from one sub-network to another. This may reduce the chance of a conflict between inequality (5) and (6).

32

Page 33: Clique-based Network Clustering

A measure of consistence between a found communities R and real communities F.

where I is the Kullback-Leibler divergence, H is entropy, N is diffusion matrix, and are number of real and found communities.

33

(7)

Page 34: Clique-based Network Clustering

Review Stochastic Block network 1 through 3 using NMI

Results:◦ Type 1 error is overly controlled for small and

sparse network such as SBM 1.

34

size Link density

Auto chosen p

Average NMI

s.e. Number of simulation

SBM 1 120 0.0723 0.0239 0.8484 0.0195 100

SBM 2 140 0.1125 0.0579 0.9483 0.0078 100

SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100

Page 35: Clique-based Network Clustering

35

Modularity

p = 0.0888

Stochastic Block model 4 with

Cluster size: (100, 20, 20)Expected link density:0.1507Auto-chosen parameter p from (5) : 0.0888 Using auto-chosen parameter p will end up with merging small clusters 2 and 3Cluster 2 and 3 will be divided only if we zoom in more by increase pModularity method not only merged cluster 2 and 3, but also split cluster 1

Page 36: Clique-based Network Clustering

36

S1

S0

S2

C1 S3 C2 C3

C4 C5

p(S0) p(S0)

p(S1) p(S1) p(S2) p(S2)

p(S3) p(S3)

are observed link density and node count in sub-network S

Page 37: Clique-based Network Clustering

Maximize localized clique-index

where T is the binary tree representing the hierarchical clustering process, p(S) is automatic choice of local threshold parameter p for sub-network S, is the indicator if node i and j will be divided in bipartition of S

37

(8)

Page 38: Clique-based Network Clustering

Every bipartition in sub-network S will bring contribution:

The best bipartition is obtained from the sign of leading Eigenvector of matrix:

The bipartition on S will be cancelled if contribution .

38

(9)

(10)

Page 39: Clique-based Network Clustering

Each matrix-vector product takes time O(m) Finding leading Eigen-vector takes O(n)

matrix-vector product. On average, the height of the binary tree

representing hierarchical clustering procedure is O(log(n)).

For both global and localized algorithm, the time complexity is or for sparse network.

39

Page 40: Clique-based Network Clustering

Stochastic Block model 4 with

Cluster size: (100, 20, 20) Average NMI among 100 simulation is 0.9717 Localized clustering algorithm is able to detect the built-in community structure.

40

Page 41: Clique-based Network Clustering

Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters

Cluster sizes with internal link density: [(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15),

(200,0.2), (100, 0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]

Link density between different clusters is 0.005 Average NMI among 20 simulation is 0.9895 Average Running time: 1.66 seconds

41

Page 42: Clique-based Network Clustering

Stochastic Block Model with 20000 nodes and 25 clusters Cluster sizes with internal link density: [(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000,

0.07), (1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12), (500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14), (200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)] Link density between clusters: 0.0001 Average NMI among 10 simulations: 0.8960 Average running time: 12.6 seconds

42

Page 43: Clique-based Network Clustering

Review of SBM network 1 through 6:

Clustering quality is high for large network or network with high link density

43

Built-incluster

size Link density

Average NMI

s.e. Number of Simulation

SBM1 3 120 0.0723 0.8972 0.0195 100

SBM2 3 140 0.1125 0.9476 0.0051 100

SBM3 4 1200 0.0722 0.9687 0.0028 100

SBM4 3 140 0.0888 0.9717 0.0033 100

SBM5 10 7000 0.0285 0.9895 0.0022 20

SBM6 25 20000 0.005 0.8960 0.0029 10

Page 44: Clique-based Network Clustering

Global Algorithm:◦ Good for application with specific requirements in

internal link density of every found communities Localized Algorithm:

◦ Good for finding statistically significant communities.

◦ Type 1 error seem to be overly controlled for sparse network.

◦ The conflict between type 1 error and type 2 error is effectively avoided in sample simulated network.

44

Page 45: Clique-based Network Clustering

Erdos Renyi Model may not serve as a good Null Model of random network without built-in communities structures. Statistically significant community for other null model need consideration.

Extend the algorithm to directed network, network with numerical values in adjacency matrix, and network with additional profile information in each node.

Develop close to linear time clustering algorithm.

45