1 cse 881: data mining lecture 20: cluster validation

36
1 CSE 881: Data Mining Lecture 20: Cluster Validation

Upload: rodney-summers

Post on 13-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

1

CSE 881: Data Mining

Lecture 20: Cluster Validation

Page 2: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

2

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

– Accuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters

Page 3: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

3

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

Page 4: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

4

Internal Index (Unsupervised): Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE)

External Index (Supervised): Used to measure the extent to which cluster labels match externally supplied class labels.– Entropy

Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for this function, e.g., SSE or

entropy

Measures of Cluster Validity

Page 5: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

5

Unsupervised Cluster Validation

Cluster Evaluation based on Proximity Matrix– Correlation between proximity and incidence matrices

– Visualize proximity matrix

Cluster Evaluation based on Cohesion and Separation

Page 6: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

6

Measuring Cluster Validity Via Correlation

Two matrices – Proximity Matrix – “Incidence” Matrix

One row and one column for each data pointAn entry is 1 if the associated pair of points belong to the same clusterAn entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between proximity and incidence matrices– Since the matrices are symmetric, only the correlation between n(n-1) /

2 entries needs to be calculated.

High correlation indicates that points that belong to the same cluster are close to each other.

Not a good measure for some density or contiguity based clusters

Page 7: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

7

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Page 8: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

8

Order the similarity matrix with respect to cluster labels and inspect visually

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 9: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

9

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 10: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

10

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 11: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

11

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

Page 12: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

12

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Page 13: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

13

Internal Index: Used to measure the goodness of a clustering structure without respect to external information– SSE

SSE is good for comparing two clusterings or two clusters (average SSE).

Can also be used to estimate the number of clusters

Internal Measures: SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Page 14: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

14

Internal Measures: SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Page 15: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

15

Unsupervised Cluster Validity Measure

More generally, given K clusters:

– Validity(Ci): a function of cohesion, separation, or both

– Weight wi associated with each cluster I

– For SSE: wi = 1

validity(Ci) = 2

iCx

ix

Page 16: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

16

Cluster Cohesion: – Measures how closely related are objects in a

cluster

Cluster Separation: – Measure how distinct or well-separated a

cluster is from other clusters

Internal Measures: Cohesion and Separation

Page 17: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

17

Graph-based versus Prototype-based Views

Page 18: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

18

Graph-based View

Cluster Cohesion: Measures how closely related are objects in a cluster

Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

i

iCyCx

i yxproximityCCohesion,

),()(

j

iCyCx

ji yxproximityCCSeparation,

),(),(

Page 19: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

19

Prototype-Based View

Cluster Cohesion:

– Equivalent to SSE if proximity is square of Euclidean distance

Cluster Separation:

iCx

ii cxproximityCCohesion ),()(

),()(

),(),(

ccproximityCSeparation

ccproximityCCSeparation

ii

jiji

Page 20: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

20

Unsupervised Cluster Validity Measures

Page 21: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

21

Prototype-based vs Graph-based Cohesion

For SSE and points in Euclidean space, it can be shown that average pairwise difference between points in a cluster is equivalent to SSE of the cluster

Page 22: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

22

Total Sum of Squares (TSS)

c: overall mean

ci: centroid of each cluster Ci

mi: number of points in cluster Ci

cc1

c2

c3

k

iii

k

i Cxi

ccdistmSSB

cxdistSSE

cxdistTSS

i

1

2

1

2

2

),(

),(

),(

Page 23: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

23

Total Sum of Squares (TSS)

1 2 3 4 5 m1 m2

m

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(

10)35()34()32()31(

22

2222

2222

SSB

SSE

TSSK=2 clusters:

0)33(4

10)35()34()23()13(

10)35()34()32()31(

2

2222

2222

SSB

SSE

TSSK=1 cluster:

TSS = SSE + SSB

Page 24: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

24

Total Sum of Squares (TSS)

TSS = SSE + SSB

Given a data set, TSS is fixed A clustering with large SSE has small SSB, while

one with small SSE has large SSB

Goal is to minimize SSE and maximize SSB

Page 25: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

25

Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings

For an individual point, i– Calculate a = average distance of i to the points in its cluster

– Calculate b = min (average distance of i to points in another cluster)

– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)

– Typically between 0 and 1.

– The closer to 1 the better.

Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures: Silhouette Coefficient

ab

Page 26: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

26

Unsupervised Evaluation of Hierarchical

Clustering

Distance Matrix:

Single Link

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Page 27: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

27

Unsupervised Evaluation of Hierarchical

Clustering

CPCC (CoPhenetic Correlation Coefficient)– Correlation between original distance matrix and

cophenetic distance matrix

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Cophenetic Distance Matrix for Single LinkSingle Link

Page 28: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

28

Unsupervised Evaluation of Hierarchical

Clustering

3 6 2 5 4 10

0.05

0.1

0.15

0.2

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Single Link

Complete Link

Page 29: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

29

Supervised Cluster Validation: Entropy and

Purity

Page 30: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

30

Supervised Cluster Validation: Precision and

Recall

Precision for cluster i w.r.t. class j =

Recall for cluster i w.r.t. class j =

Cluster imi1: class 1mi2: class 2

Overall Datam1: class 1m2: class 2

j

ij

kkj

ij

m

m

m

m

k

ik

ij

m

m

Page 31: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

31

Supervised Cluster Validation: Hierarchical

Clustering

Hierarchical F-measure:

Page 32: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

32

Need a framework to interpret any measure. – For example, if our measure of evaluation has the value, 10, is

that good, fair, or poor?

Statistics provide a framework for cluster validity– The more “atypical” a clustering result is, the more likely it

represents valid structure in the data

– Can compare the values of an index that result from random data or clusterings to those of a clustering result.

If the value of the index is unlikely, then the cluster results are valid

For comparing the results of two different sets of cluster analyses, a framework is less necessary.

– However, there is the question of whether the difference between two index values is significant

Framework for Cluster Validity

Page 33: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

33

Example: 2-d data with 100 points

– Suppose a clustering algorithm produces SSE = 0.005

– Does it mean that the clusters are statistically significant?

Statistical Framework for SSE

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 34: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

34

– Generate 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values

– Perform clustering with k = 3 clusters for each data set

– Plot histogram of SSE (compare with the value 0.005)

Statistical Framework for SSE

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Co

unt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 35: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

35

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

Statistical Framework for Correlation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235

(statistically significant)

Corr = -0.5810

(not statistically significant)

Page 36: 1 CSE 881: Data Mining Lecture 20: Cluster Validation

36

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Final Comment on Cluster Validity