1 cse 881: data mining lecture 20: cluster validation

1

CSE 881: Data Mining

Lecture 20: Cluster Validation

2

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

– Accuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters

3

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

4

Internal Index (Unsupervised): Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE)

External Index (Supervised): Used to measure the extent to which cluster labels match externally supplied class labels.– Entropy

Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for this function, e.g., SSE or

entropy

Measures of Cluster Validity

5

Unsupervised Cluster Validation

Cluster Evaluation based on Proximity Matrix– Correlation between proximity and incidence matrices

– Visualize proximity matrix

Cluster Evaluation based on Cohesion and Separation

6

Measuring Cluster Validity Via Correlation

Two matrices – Proximity Matrix – “Incidence” Matrix

One row and one column for each data pointAn entry is 1 if the associated pair of points belong to the same clusterAn entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between proximity and incidence matrices– Since the matrices are symmetric, only the correlation between n(n-1) /

2 entries needs to be calculated.

High correlation indicates that points that belong to the same cluster are close to each other.

Not a good measure for some density or contiguity based clusters

7

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

8

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

9


Clusters in random data are not so crisp

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

10

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

11



0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

12


1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

13

Internal Index: Used to measure the goodness of a clustering structure without respect to external information– SSE

SSE is good for comparing two clusterings or two clusters (average SSE).

Can also be used to estimate the number of clusters

Internal Measures: SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

14

Internal Measures: SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

15

Unsupervised Cluster Validity Measure

More generally, given K clusters:

– Validity(Ci): a function of cohesion, separation, or both

– Weight wi associated with each cluster I

– For SSE: wi = 1

validity(Ci) = 2

iCx

ix

16

Cluster Cohesion: – Measures how closely related are objects in a

cluster

Cluster Separation: – Measure how distinct or well-separated a

cluster is from other clusters

Internal Measures: Cohesion and Separation

17

Graph-based versus Prototype-based Views

18

Graph-based View

Cluster Cohesion: Measures how closely related are objects in a cluster

Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

i

iCyCx

i yxproximityCCohesion,

),()(

j

iCyCx

ji yxproximityCCSeparation,

),(),(

19

Prototype-Based View

Cluster Cohesion:

– Equivalent to SSE if proximity is square of Euclidean distance

Cluster Separation:

iCx

ii cxproximityCCohesion ),()(

),()(

),(),(

ccproximityCSeparation

ccproximityCCSeparation

ii

jiji

20

Unsupervised Cluster Validity Measures

21

Prototype-based vs Graph-based Cohesion

For SSE and points in Euclidean space, it can be shown that average pairwise difference between points in a cluster is equivalent to SSE of the cluster

22

Total Sum of Squares (TSS)

c: overall mean

ci: centroid of each cluster Ci

mi: number of points in cluster Ci

cc1

c2

c3

k

iii

k

i Cxi

ccdistmSSB

cxdistSSE

cxdistTSS

i

1

2

1

2

2

),(

),(

),(

23


1 2 3 4 5 m1 m2

m

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(

10)35()34()32()31(

22

2222

2222

SSB

SSE

TSSK=2 clusters:

0)33(4

10)35()34()23()13(

10)35()34()32()31(

2

2222

2222

SSB

SSE

TSSK=1 cluster:

TSS = SSE + SSB

24


TSS = SSE + SSB

Given a data set, TSS is fixed A clustering with large SSE has small SSB, while

one with small SSE has large SSB

Goal is to minimize SSE and maximize SSB

25

Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings

For an individual point, i– Calculate a = average distance of i to the points in its cluster

– Calculate b = min (average distance of i to points in another cluster)

– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)

– Typically between 0 and 1.

– The closer to 1 the better.

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures: Silhouette Coefficient

ab

26

Unsupervised Evaluation of Hierarchical

Clustering

Distance Matrix:

Single Link

3 6 2 5 4 10

0.05

0.1

0.15

0.2

27


Clustering

CPCC (CoPhenetic Correlation Coefficient)– Correlation between original distance matrix and

cophenetic distance matrix

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Cophenetic Distance Matrix for Single LinkSingle Link

28


Clustering

3 6 2 5 4 10

0.05

0.1

0.15

0.2

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Single Link

Complete Link

29

Supervised Cluster Validation: Entropy and

Purity

30

Supervised Cluster Validation: Precision and

Recall

Precision for cluster i w.r.t. class j =

Recall for cluster i w.r.t. class j =

Cluster imi1: class 1mi2: class 2

Overall Datam1: class 1m2: class 2

j

ij

kkj

ij

m

m

m

m

k

ik

ij

m

m

31

Supervised Cluster Validation: Hierarchical

Clustering

Hierarchical F-measure:

32

Need a framework to interpret any measure. – For example, if our measure of evaluation has the value, 10, is

that good, fair, or poor?

Statistics provide a framework for cluster validity– The more “atypical” a clustering result is, the more likely it

represents valid structure in the data

– Can compare the values of an index that result from random data or clusterings to those of a clustering result.

If the value of the index is unlikely, then the cluster results are valid

For comparing the results of two different sets of cluster analyses, a framework is less necessary.

– However, there is the question of whether the difference between two index values is significant

Framework for Cluster Validity

33

Example: 2-d data with 100 points

– Suppose a clustering algorithm produces SSE = 0.005

– Does it mean that the clusters are statistically significant?

Statistical Framework for SSE

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

34

– Generate 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values

– Perform clustering with k = 3 clusters for each data set

– Plot histogram of SSE (compare with the value 0.005)

Statistical Framework for SSE

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Co

unt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

35

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

Statistical Framework for Correlation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235

(statistically significant)

Corr = -0.5810

(not statistically significant)

36

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Final Comment on Cluster Validity

1 cse 881: data mining lecture 20: cluster validation

Documents