1 cse 881: data mining lecture 20: cluster validation
TRANSCRIPT
1
CSE 881: Data Mining
Lecture 20: Cluster Validation
2
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
– Accuracy, precision, recall
For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters
3
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete Link
4
Internal Index (Unsupervised): Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE)
External Index (Supervised): Used to measure the extent to which cluster labels match externally supplied class labels.– Entropy
Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for this function, e.g., SSE or
entropy
Measures of Cluster Validity
5
Unsupervised Cluster Validation
Cluster Evaluation based on Proximity Matrix– Correlation between proximity and incidence matrices
– Visualize proximity matrix
Cluster Evaluation based on Cohesion and Separation
6
Measuring Cluster Validity Via Correlation
Two matrices – Proximity Matrix – “Incidence” Matrix
One row and one column for each data pointAn entry is 1 if the associated pair of points belong to the same clusterAn entry is 0 if the associated pair of points belongs to different clusters
Compute the correlation between proximity and incidence matrices– Since the matrices are symmetric, only the correlation between n(n-1) /
2 entries needs to be calculated.
High correlation indicates that points that belong to the same cluster are close to each other.
Not a good measure for some density or contiguity based clusters
7
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810
8
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
9
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
10
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
11
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Complete Link
12
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
13
Internal Index: Used to measure the goodness of a clustering structure without respect to external information– SSE
SSE is good for comparing two clusterings or two clusters (average SSE).
Can also be used to estimate the number of clusters
Internal Measures: SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SS
E
5 10 15
-6
-4
-2
0
2
4
6
14
Internal Measures: SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
15
Unsupervised Cluster Validity Measure
More generally, given K clusters:
– Validity(Ci): a function of cohesion, separation, or both
– Weight wi associated with each cluster I
– For SSE: wi = 1
validity(Ci) = 2
iCx
ix
16
Cluster Cohesion: – Measures how closely related are objects in a
cluster
Cluster Separation: – Measure how distinct or well-separated a
cluster is from other clusters
Internal Measures: Cohesion and Separation
17
Graph-based versus Prototype-based Views
18
Graph-based View
Cluster Cohesion: Measures how closely related are objects in a cluster
Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters
i
iCyCx
i yxproximityCCohesion,
),()(
j
iCyCx
ji yxproximityCCSeparation,
),(),(
19
Prototype-Based View
Cluster Cohesion:
– Equivalent to SSE if proximity is square of Euclidean distance
Cluster Separation:
iCx
ii cxproximityCCohesion ),()(
),()(
),(),(
ccproximityCSeparation
ccproximityCCSeparation
ii
jiji
20
Unsupervised Cluster Validity Measures
21
Prototype-based vs Graph-based Cohesion
For SSE and points in Euclidean space, it can be shown that average pairwise difference between points in a cluster is equivalent to SSE of the cluster
22
Total Sum of Squares (TSS)
c: overall mean
ci: centroid of each cluster Ci
mi: number of points in cluster Ci
cc1
c2
c3
k
iii
k
i Cxi
ccdistmSSB
cxdistSSE
cxdistTSS
i
1
2
1
2
2
),(
),(
),(
23
Total Sum of Squares (TSS)
1 2 3 4 5 m1 m2
m
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
10)35()34()32()31(
22
2222
2222
SSB
SSE
TSSK=2 clusters:
0)33(4
10)35()34()23()13(
10)35()34()32()31(
2
2222
2222
SSB
SSE
TSSK=1 cluster:
TSS = SSE + SSB
24
Total Sum of Squares (TSS)
TSS = SSE + SSB
Given a data set, TSS is fixed A clustering with large SSE has small SSB, while
one with small SSE has large SSB
Goal is to minimize SSE and maximize SSB
25
Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings
For an individual point, i– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)
– Typically between 0 and 1.
– The closer to 1 the better.
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures: Silhouette Coefficient
ab
26
Unsupervised Evaluation of Hierarchical
Clustering
Distance Matrix:
Single Link
3 6 2 5 4 10
0.05
0.1
0.15
0.2
27
Unsupervised Evaluation of Hierarchical
Clustering
CPCC (CoPhenetic Correlation Coefficient)– Correlation between original distance matrix and
cophenetic distance matrix
3 6 2 5 4 10
0.05
0.1
0.15
0.2
Cophenetic Distance Matrix for Single LinkSingle Link
28
Unsupervised Evaluation of Hierarchical
Clustering
3 6 2 5 4 10
0.05
0.1
0.15
0.2
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Single Link
Complete Link
29
Supervised Cluster Validation: Entropy and
Purity
30
Supervised Cluster Validation: Precision and
Recall
Precision for cluster i w.r.t. class j =
Recall for cluster i w.r.t. class j =
Cluster imi1: class 1mi2: class 2
Overall Datam1: class 1m2: class 2
j
ij
kkj
ij
m
m
m
m
k
ik
ij
m
m
31
Supervised Cluster Validation: Hierarchical
Clustering
Hierarchical F-measure:
32
Need a framework to interpret any measure. – For example, if our measure of evaluation has the value, 10, is
that good, fair, or poor?
Statistics provide a framework for cluster validity– The more “atypical” a clustering result is, the more likely it
represents valid structure in the data
– Can compare the values of an index that result from random data or clusterings to those of a clustering result.
If the value of the index is unlikely, then the cluster results are valid
For comparing the results of two different sets of cluster analyses, a framework is less necessary.
– However, there is the question of whether the difference between two index values is significant
Framework for Cluster Validity
33
Example: 2-d data with 100 points
– Suppose a clustering algorithm produces SSE = 0.005
– Does it mean that the clusters are statistically significant?
Statistical Framework for SSE
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
34
– Generate 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values
– Perform clustering with k = 3 clusters for each data set
– Plot histogram of SSE (compare with the value 0.005)
Statistical Framework for SSE
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340
5
10
15
20
25
30
35
40
45
50
SSE
Co
unt
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
35
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.
Statistical Framework for Correlation
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235
(statistically significant)
Corr = -0.5810
(not statistically significant)
36
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
Final Comment on Cluster Validity