graph-based clusteringcse.msu.edu/~cse802/s17/slides/lec_20_21_22_clustering.pdf · graph-based...
TRANSCRIPT
‹#›
Graph-based Clustering
● Transform the data into a graph representation– Vertices are the data points to be clustered – Edges are weighted based on similarity between data
points
ÞGraph partitioning
Each connected component is a cluster
‹#›
Clustering as Graph Partitioning
● Two things needed:1. An objective function to determine what would be the
best way to “cut” the edges of a graph
2. An algorithm to find the optimal partition (optimal according to the objective function)
‹#›
Objective Function for Partitioning
● Suppose we want to partition the set of vertices V into two sets: V1 and V2– One possible objective function is to minimize graph cut
åÎÎ
=
2
1 ,21 ),(Cut
VjVi
ijwVV
0.1
0.3
0.1
0.1
v1
v3
v2
v5
v4
v6
0.2 0.1
0.3
0.1
0.1
v1
v3
v2
v5
v4
v6
0.2
Cut = 0.2 Cut = 0.4
wij is weight of the edge between nodes i and j
‹#›
Objective Function for Partitioning
● Limitation of minimizing graph cut:
– The optimal solution might be to split up a single node from the rest of the graph! Not a desirable solution
Cut = 0.10.1
0.3
0.1
0.1
v1
v3
v2
v5
v4
v6
0.2
‹#›
Objective Function for Partitioning
● We should not only minimize the graph cut; but also look for “balanced” clusters
å
åå=
+=
+=
ÎÎ
jiji
Vjj
Vii
wd
dVV
dVVVV
VVV
VVVVV
e wher
),Cut(),Cut(),(cut Normalized
||),Cut(
||),Cut(),(cut Ratio
21
212121
2
21
1
2121
V1 and V2 are the set of nodes in partitions 1 and 2
|Vi| is the number of nodes in partition Vi
V1 V2
‹#›
Example
Cut = 0.1
Ratio cut = 0.1/1 + 0.1/5 = 0.12
Normalized cut = 0.1/0.1 + 0.1/1.5 = 1.07
Cut = 0.2
Ratio cut = 0.2/3 + 0.2/3 = 0.13
Normalized cut = 0.2/1 + 0.2/0.6 = 0.53
0.1
0.3
0.1
0.1
v1
v3
v2
v5
v4
v6
0.20.1
0.3
0.1
0.1
v1
v3
v2
v5
v4
v6
0.2
‹#›
Example
Cut = 1
Ratio cut = 1/1 + 1/5 = 1.2
Normalized cut = 1/1 + 1/9 = 1.11
Cut = 2
Ratio cut = 1/3 + 1/3 = 0.67
Normalized cut = 1/5 + 1/5 = 0.2
1
1
1
1
v1
v3
v2
v5
v4
v6
11
1
1
1
v1
v3
v2
v5
v4
v6
1
If graph is unweighted (or has the same edge weight)
‹#›
Algorithm for Graph Partitioning
● How to minimize the objective function?– We can use a heuristic (greedy) approach to do this
u Example: METIS graph partitioninghttp://www.cs.umn.edu/~metis
– An elegant way to optimize the function is by using ideas from spectral graph theoryu This leads to a class of algorithms known as spectral clustering
‹#›
Spectral Clustering
● Spectral properties of a graph– Spectral properties: eigenvalues/eigenvectors of the
adjacency matrix can be used to represent a graph
● There exists a relationship between spectral properties of a graph and the graph partitioning problem
‹#›
Spectral Properties of a Graph
● Start with a similarity/adjacency matrix, W, of a graph
● Define a diagonal matrix D
– If W is a binary 0/1 matrix, then Dii represents the degree of node i
ïî
ïíì
== å=
otherwise0
if1
jiwDn
kik
ij
‹#›
Preliminaries
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=
001000001000110000000011000100000100
W1
1
1
1
v1
v3
v2
v5
v4
v6
Two block-diagonal matrices
ïî
ïíì
== å=
otherwise0
if1
jiwDn
kik
ij
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=
100000010000002000000200000010000001
D
Two clusters
‹#›
Graph Laplacian Matrix
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=
001000001000110000000011000100000100
W1
1
1
1
v1
v3
v2
v5
v4
v6
Two block matrices
WDL -=Laplacian,
úúúúúúúú
û
ù
êêêêêêêê
ë
é
--
----
--
=
101000011000112000000211000110000101
LLaplacian also has a block structure
‹#›
Properties of Graph Laplacian
● L = (D – W) is a symmetric matrix● L is a positive semi-definite matrix
– Consequence: all eigenvalues of L are ³ 0
‹#›
Spectral Clustering
Consider a data set with N data points1. Construct an N ´ N similarity matrix, W2. Compute the N ´ N Laplacian matrix, L = D – W3. Compute the k “smallest” eigenvectors of L
a) Each eigenvector vi is an N ´ 1 column vectorb) Create a matrix V containing eigenvectors v1, v2, .., vk
as columns (you may exclude the first eigenvector)4. Cluster the rows in V using k-means or other
clustering algorithms into K clusters
‹#›
Example
‹#›
Summary
● Spectral properties of a graph (i.e., eigenvalues and eigenvectors) contain information about clustering structure
● To find k clusters, apply k-means or other algorithms to the first k eigenvectors of the graph Laplacian matrix
‹#›
Minimum Spanning Tree
● Given the MST of data points, remove the longest edge (inconsistent) and then the next longest edge,.......
‹#›
‹#›
● One useful statistics that can be estimated from the MST is the edge length distribution
● For instance, in the case of 2 dense clusters immersed in a sparse set of points
‹#›
Cluster Validity
l Which clustering method is appropriate for a particular data set?
l How does one determine whether the results of a clustering method truly characterize data?
l How do you know when you have a good set of clusters?
l Is it unusual to find a cluster as compact and isolated as the observed clusters?
l How to guard against elaborate interpretation of randomly distributed data?
‹#›21
Cluster Validity
● Clustering algorithms find clusters, even if there are no natural clusters in data
to design new methods, difficult to validateK-Means; K=3100 2D uniform data points
• Cluster stability: Perturb data by bootstrapping.How do clusters change over the ensemble
‹#›
Hierarchical Clustering
• Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Two approaches:
• Agglomerative ("bottom up“): each point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy; more popular
• Divisive ("top down“): all points start in one cluster, and splits are performed recursively as one moves down the hierarchy
How to define similarity between two clusters or a point and a cluster?
‹#›
Agglomerative Clustering Example
• Cluster six elements {a}, {b}, {c}, {d}, {e} and {f} in 2D; use Euclidean distance as a similarity
• Build the hierarchy from the individual elements by progressively merging clusters
• Which elements to merge in a cluster? Usually, merge the two closest elements, according to the chosen distance
‹#›
Suppose we have merged the two closest elements b and c to obtainclusters {a}, {b, c}, {d}, {e} and {f}
To merge them further, we need to take the distance between {a} and {b c}. Two common ways to define distance between two clusters:
• The maximum distance between elements of each cluster (also called complete-linkage clustering): max { d ( x , y ) : x ∈ A , y ∈ B }
• The minimum distance between elements of each cluster (single-linkage clustering): min { d ( x , y ) : x ∈ A , y ∈ B }
Stop clustering either when the clusters are too far apart to be merged or when there is a sufficiently small number of clusters
Single-link v. Complete-link Hierarchical Clustering
2DPCAProjectionofIrisData
MinimumSpanningTreeClusteringof2DPCAProjectionofIrisData
K-MeansClusteringofIrisData(ClusteringAssignmentshownon2DPCAProjection)
Single-linkClusteringofIrisData
Complete-linkClusteringofIrisData
‹#›
Angkor Wat
Hindu temple built by a Khmer king ~1,150AD; Khmer kingdom declined in the 15th century; French explorers discovered the hidden ruins in late 1800’s
‹#›
Apsaras of Angkor Wat• Angkor Wat contains the most unique gallery of ~2,000 women
depicted by detailed full body portraits
• What facial types are represented in these portraits?
‹#›
Clustering of Apsara Faces
Shape alignment
How to validate the clusters or groups?
127 facial landmarks
127 landmarks
1 2 3 4 56 7 89 10
Single Link clusters
Single Link
‹#›
Ground Truth
Khmer Dance and Cultural Center
‹#›
Exploratory Data Analysis
Clustering with large weightsassigned to chin and nose
Example devata faces from the clusters differ largely in chin and nose, thereby reflecting the weights chosen for similarity
2D MDS Projection of the Similarity matrix
‹#›
Exploratory Data Analysis3D MDS Projection of the Similarity matrix
‹#›
‹#›
Spectral Clustering & Graph Partitioning
● We have shown that the spectral properties of the graph is related to the clusters– How is it related to minimizing graph cut?
‹#›
Graph Partitioning
● Recall the following objective of graph partitioning
å
å
åå
ÎÎ
ÎÎ
=
=
+=
+=
21
21
,21
212121
2
21
1
2121
),Cut(
e wher
),Cut(),Cut(),(cut Normalized
||),Cut(
||),Cut(),(cut Ratio
VjVi
ij
jiji
Vjj
Vii
wVV
wd
dVV
dVVVV
VVV
VVVVV
‹#›
Ratio Cut
● Let xi indicates membership of node vi in a cluster:
● Also:
– where L is the graph Laplacian matrix
ïïî
ïïí
ì
Î-
Î=
22
1
11
2
if||||
if||||
VvVV
VvVV
xi
i
i
( )
( ) ( )åå
å
ÎÎÎÎ
-+-=
-=
1221 ,
2
,
2
,
2
21
2121
VjVijiij
VjVijiij
jijiij
T
xxwxxw
xxwLxx
‹#›
Ratio Cut
( ) ( )
),(||||||||
||||||),(
||||
||||
||||
||||),(
2||||
||||),(
2||||
||||
212
||||
||||
21
||||
||||
21
||||
||||
21
21
21
21
2
21
1
1221
2
2
1
1
2
1
1
221
2
1
1
221
, 1
2
2
1
, 2
1
1
2
,
2
1
2
2
1
,
2
2
1
1
2
,
2
,
2
1221
1221
1221
VVRatioCutVVVV
VVVVVCut
VV
VV
VV
VVVVCut
VV
VVVVCut
VV
VVw
VV
VVw
VV
VVw
VV
VVw
xxwxxwLxx
VjViij
VjViij
VjViij
VjViij
VjVijiij
VjVijiij
T
´=
÷÷ø
öççè
æ ++
+=
÷÷ø
öççè
æ+++=
÷÷ø
öççè
æ++=
÷÷ø
öççè
æ+++÷÷
ø
öççè
æ++=
÷÷ø
öççè
æ--+÷
÷ø
öççè
æ+=
-+-=
åå
åå
åå
ÎÎÎÎ
ÎÎÎÎ
ÎÎÎÎ
‹#›
Ratio Cut
● Therefore:
– Thus, we have related ratio cut to Laplacian matrix L – But there is one issue:
u Trivial solution is x is a vector of all zerosu Need to look for a non-trivial solution
– Look for constraints that must be satisfied by x
LxxVVRatioCut T
xVVmin),(min 21, 21
=
0||||
||||1
212
1
1
2
1=-== ååå
ÎÎ= ViVi
n
ii
TV
VV
Vxx
The solution x must be orthogonal to the vector of all 1s
‹#›
Ratio Cut
Another constraint that must be satisfied by x:
LxxVVRatioCut T
xVVmin),(min 21, 21
=
nVV
VV
VVxxx
ViVi
n
ii
T
||||
||||
||||
12
2
2
1
2
1
2
1
2
21
=+=
÷÷ø
öççè
æ-+÷÷
ø
öççè
æ== ååå
ÎÎ=
‹#›
Ratio Cut
subject to:
● This is a constrained optimization problem where
● Instead, we solve a relaxation of the problem:
LxxTxmin nxxT =
( )xLxxLx
xF
nxxLxxF TT
ll
l
=Þ=-=¶¶
--=
0
ïïî
ïïí
ì
Î-
Î=
22
1
11
2
if||||
if||||
VvVV
VvVV
xi
i
i
‹#›
Putting It Altogether
● We have shown that– Minimizing graph cut is equivalent to finding x that
– Solution for x is given by the eigenvectors of L
– Thus, the spectral decomposition of graph Laplacian is equivalent to the solution of the graph partitioning problem
LxxTxmin nxxT = such that
‹#›
Spectral Clustering with Ratio Cut
● But lmin=0 with eigenvector 1 = (1 1 1…1)T
● Since we want a solution where xT1 = 0, so x ¹ 1
● Instead of the smallest eigenvalue, we look for the eigenvector corresponding to the next smallest eigenvalue
● In summary, finding the eigenvector that corresponds to the second smallest eigenvalue is a relaxation of the ratio cut graph partitioning problem (for k=2)
ll minminmin == xxLxx T
x
T
x
‹#›
Properties of Graph Laplacian
● L = (D – W) is a symmetric matrix● L is a positive semi-definite matrix
– For all real-valued vectors, x: xTLx ³ 0
– Consequence: all eigenvalues of L are ³ 0
( )å
ååå
ååå
=
-=
÷÷ø
öççè
æ+-=
=-=
-=-=
N
jijiij
jiiij
jijiij
jiiij
jiji
jijiij
iii
TTTT
xxw
xwxxwxw
xdxxwxdWxxDxxxWDxLxx
1,
2
,
2
,,
2
,
2
21
221
) where( )(
‹#›
Properties of Laplacian Matrix
úúúú
û
ù
êêêê
ë
é
=
úúúú
û
ù
êêêê
ë
é
úúúú
û
ù
êêêê
ë
é
=
úúúú
û
ù
êêêê
ë
é
=
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=
úúúú
û
ù
êêêê
ë
é
úúúú
û
ù
êêêê
ë
é
=
úúúú
û
ù
êêêê
ë
é
=
å
å
å
=
=
=
dddd
ddd
jdj
d
jj
d
jj
dddd
d
d
D
DD
D
DD
De
D
DD
w
w
w
www
wwwwww
Wee
...1...11
...00............0...00...0
......
1...11
...............
...
...
,
1...11
Suppose
22
11
22
11
22
11
1
12
11
21
22221
11211
‹#›
Properties of Laplacian Matrix
● Since e ¹ [0..0]T, therefore l = 0 – 0 is an eigenvalue of L with the corresponding
eigenvector e = [1 1 1 1…1]T
– Furthermore, since L is positive semi-definite, 0 is the smallest eigenvalue of L
eLeLe
eWDWeDe
l==Þ
=-Þ=
:equation Eigenvalue0
0)(
‹#›
Properties of Laplacian Matrix
● More generally, if
– Thenu There are k eigenvalues of L which have the value 0u The corresponding eigenvectors are:
úúúú
û
ù
êêêê
ë
é
=
kL
LL
L
000............000000
2
1
úúúúúúúú
û
ù
êêêêêêêê
ë
é
úúúúúúúú
û
ù
êêêêêêêê
ë
é
úúúúúúúú
û
ù
êêêêêêêê
ë
é
e
ee
0......00
,...,
00......
0
,
00......0
where e is [1 1…1]T
‹#›
Properties of Laplacian Matrix
Eigenvalues of L:
úúúúúúú
û
ù
êêêêêêê
ë
é
=L
3000000300000010000001000000.00000000
Eigenvectors of L:
úúúúúúú
û
ù
êêêêêêê
ë
é
---
--
---
=
041.071.0058.00041.071.0058.00082.00058.0081.0000058.041.00071.0058.041.00071.0058.0
V
1
1
1
1
v1
v3
v2
v5
v4
v6 úúúúúúúú
û
ù
êêêêêêêê
ë
é
--
----
--
=
101000011000112000000211000110000101
L
‹#›
Properties of Laplacian Matrix
Eigenvalues of L:
úúúúúúú
û
ù
êêêêêêê
ë
é
=L
3000000300000010000001000000.00000000
Eigenvectors of L:
úúúúúúú
û
ù
êêêêêêê
ë
é
---
--
---
=
041.071.0058.00041.071.0058.00082.00058.0081.0000058.041.00071.0058.041.00071.0058.0
V
1
1
1
1
v1
v3
v2
v5
v4
v6
If we cluster the data using only the first 2 eigenvectors, we get the two desired clusters
‹#›
Properties of Laplacian Matrix
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=
001000001000110100001011000100000100
W
WDL -=Laplacian,
úúúúúúúú
û
ù
êêêêêêêê
ë
é
--
------
--
=
101000011000113100001311000110000101
L
1
1
1
1
v1
v3
v2
v5
v4
v6
1
Clusters are no longer well separated
‹#›
Properties of Laplacian Matrix
úúúúúúúú
û
ù
êêêêêêêê
ë
é
--
------
--
=
101000011000113100001311000110000101
L
úúúúúúú
û
ù
êêêêêêê
ë
é
----
----
----
=
18.029.065.028.046.041.018.029.065.028.046.041.066.058.00026.041.066.058.00026.041.018.029.028.065.046.041.018.029.028.065.046.041.0
V
úúúúúúúú
û
ù
êêêêêêêê
ë
é
=L
56.400000030000001000000100000044.00000000
Eigenvalues of L: Eigenvectors of L:
1
1
1
1
v1
v3
v2
v5
v4
v6
1
‹#›
Properties of Laplacian Matrix
Eigenvalues of the graph Laplacian:0, 0.5505, 0.5505, 3, 3, 3, 3, 5.4495, 5.4495
úúúúúúúúúúúú
û
ù
êêêêêêêêêêêê
ë
é
---
---
--
--
=
............0050.023.033.0
............35.0050.023.033.0
............35.0023.010.033.0
............61.0005.055.033.0
............26.0005.055.033.0
............35.0002.025.033.0
............35.0020.014.033.0
............18.071.045.032.033.0
............18.071.045.032.033.0
V
Eigenvectors of Laplacian:
Can be used to obtain 3 clusters