data clustering 2 – k means contd & hierarchical methods data clustering – an...

40
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An Introduction Slide 1

Upload: ami-cecily-parker

Post on 03-Jan-2016

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering 2 – K Means contd & Hierarchical Methods

Data Clustering – An Introduction Slide 1

Page 2: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Recap: K-Means Clustering

1. Place K points into the feature space. These points represent initial cluster centroids.

2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate

the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not

change.

Data Clustering – An Introduction Slide 2

Page 3: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

K-Means Clustering

Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Data Clustering – An Introduction Slide 3

Page 4: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Discussions (4)

4. Intuitively, what is the ideal partition of the following problem? Can K-means Clustering algorithm give a satisfactory answer to this problem?

Elongated clusters

Clusters with different variation

Page 5: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Limitations of K-MeansAt each iteration of the K-Means Clustering algorithm, a pattern can be assigned to one cluster only, i.e. the assignment is ‘hard’.

Cluster 1

Cluster 2

x1

x2Observe an extra pattern x:It locates in the middle of the two cluster centroids. So with K-Means Clustering algorithm, it will either (1) drag m1 down, or (2) drag m2 up.

x

m1

m2

But intuitively, it should have equal contributions to both clusters…

Page 6: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Limitations of the K-Means (2)

Cannot solve effectively problems with elongated clusters or clusters with different variation.

Data Clustering – An Introduction Slide 6

Page 7: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Post ProcessingAfter running a clustering algorithm, different

procedures can be used to improve the final assignment:

Split clusters with the highest SSE

Merge clusters (e.g. those that are closest)

Introduce a new centroid (often point furthest from any cluster centre)

Data Clustering – An Introduction Slide 7

Page 8: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Slide 8

Pros and Cons of KMAdvantages

May be computationally faster than hierarchical clustering (if K is small).May produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages Fixed number of clusters can make it difficult to predict what K should be. Different initial partitions can result in different final clusters.Potential empty clusters (not always bad)Does not work well with non-globular clusters.

Page 9: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Slide 9

Pros and Cons of KM

Page 10: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

OutliersSSE can be affected greatly by outliersAn outlier is a piece of data that does not fit

within the distribution of the rest of the dataOutlier analysis is large research topic in data

mining

Key question:Is the outlier noise? Or correct and “interesting”?

Data Clustering – An Introduction Slide 10

Page 11: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Hierarchical (agglomerative) Clustering Hierarchical clustering results in a

series of clustering results The results start off with each

object in their own cluster and end with all of the objects in the same cluster

The intermediate clusters are created by a series of merges

The resultant tree like structure is called a dendrogram Data Clustering – An Introduction Slide 11

Page 12: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Slide 12

Dendrogram

Page 13: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

The Hierarchical Clustering Algorithm

Data Clustering – An Introduction Slide 13

1) Each item is assigned to its own cluster (n clusters of size one)

2) Let the distances between the clusters equals the distances between the objects they contain

3) Find the closest pair of clusters and merge them into a single cluster (one less cluster)

4) Re-compute the distances between the new cluster and each of the old clusters

5) Repeat steps 3 and 4 until there is only one cluster left

Page 14: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Re-computing Distances

Linkage

Description

SingleThe smallest distance between any two pairs from the two clusters (one from each) being compared/measured

Average The average distance between pairs

Complete

The largest distance between any two pairs from the two clusters (one from each) being compared/measuredOther methods include Ward, McQuitty,

Median and Centroid

Data Clustering – An Introduction Slide 14

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Page 15: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Re-computing Distances

Single Linkage

Complete Linkage

Average Linkage

Data Clustering – An Introduction Slide 15

Page 16: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Pros and Cons of HCAdvantages

Can produce an ordering of the objects, which may be informative for data display.Smaller clusters are generated, which may be helpful for discovery.

DisadvantagesNo provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.Use of different distance metrics for measuring distances between clusters may generate different results.

Data Clustering – An Introduction Slide 16

Page 17: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 17

Clustering Gene Expression Data

Gene expression data often consists of thousands of genes observed over tens of conditions Clustering on expression level will help towards identifying the functionality of unknown genes Clustering can be used to reduce the dimensionality of the data, making it easier to model

Page 19: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Other clustering methodsFuzzy Clustering

For example: Fuzzy c-meansIn real applications often no sharp

boundary between clusters Fuzzy clustering is often better

suitedFuzzy c-means is a fuzzification of k-

Means and the most well-known

Data Clustering – An Introduction Slide 19

Page 20: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Fuzzy Clusteringhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

AppletFCM.html

Data Clustering – An Introduction Slide 20

Cluster membership is now a weight between zero and one

The distance to a centroid is multiplied by the membership weight

Page 21: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

DBSCAN (from Ranjay Sankar notes, University of Florida)

• DBSCAN is a density based clustering algorithm

• Density = number of points within a specified radius (Eps)

• A point is a core point if it has more than

specified number of points (MinPts) within Eps

(Core point is in the interior of a cluster)Data Clustering – An Introduction Slide 21

Page 22: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

DBSCAN (from Ranjay Sankar notes, University of Florida)

• A border point has fewer than MinPts within Eps but is in neighborhood of a core point

• A noise point is any point that is neither a core point nor a border point

Data Clustering – An Introduction Slide 22

Page 23: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

DBSCAN (from Ranjay Sankar notes, University of Florida)

Data Clustering – An Introduction Slide 23

http://www.cise.ufl.edu/class/cis4930sp09dm/notes/dm5part4.pdf

Page 24: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

DBSCAN

• Density-Reachable (directly and indirectly):

• A point p is directly density-reachable from p2

• p2 is directly density-reachable from p1

• p1 is directly density-reachable from q

• p <- p2 <- p1 <- q form a chain Data Clustering – An Introduction Slide 24

Page 25: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

DBSCAN(University of Buffalo)

Data Clustering – An Introduction Slide 25

http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

Page 26: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

DBSCAN (from Ranjay Sankar notes, University of Florida)

Data Clustering – An Introduction Slide 26

Page 27: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Other Clustering Methods Clustering as Optimisation

Remember one flaw with hierarchical?

“No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.”

And with K-Means“Different initial partitions can result in different

final clusters.”

Both are using Local SearchData Clustering – An Introduction Slide 27

Page 28: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Clustering as Optimisation

We can overcome this by using the different optimisation techniques for finding global optima

e.g. Evolutionary algorithmsRepresent a cluster as a

chromosomeEvolve better clusterings using a

fitness function: SSE?Data Clustering – An Introduction Slide 28

Page 29: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Clustering as Optimisation

E.g. Represent a cluster as a chromosome containing integers:[1, 2, 1, 1, 2, 2, 3, 1, 1, 3]

Data Clustering – An Introduction Slide 29

a, b, c, d, e, f, g, h, i, j

a

c

d

hi

b ef

g j

Cluster 1

Cluster 2

Cluster 3

Page 30: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 30

Feature Selection

Recall the clustering process

involved: Selection of appropriate features

Process is known as feature

selection A very large research topic in data

mining Will affect the final clustering

Page 31: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 31

Feature Selection

Scatterplots from different features

of the same dataset

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7 8

Page 32: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 32

Evaluating Cluster Quality

How do we know if the discovered

clusters are any good?

The choice of correct metric for

judging the worth of a clustering

arrangement is vital for success

There are as many metrics as

methods!

Each has their own merits and

drawbacks

Page 33: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 33

Evaluating Cluster Quality For example, K-Means clustering judges the worth of a clustering arrangement based on the square of how far each item in the cluster is from the centre This is the sum of squared Euclidean distances X is a cluster of size k, xi an element in the cluster and c is the centre of the cluster

k

i

dXSS1

2)(

Page 34: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 34

Evaluating Cluster Quality

Unsupervised Cohesion / Homogeneity & Separation

Supervised How close to the “true” clustering

Relative Use the above for comparing two or

more clusterings e.g. K-means vs

Hierarchical

Page 35: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 35

Cohesion & Separation

Cohesion

Separation

Page 36: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 36

Evaluating Cluster Quality

Other variations: Silhouette

+1, indicating points

that are very distant

from neighbouring clusters 0, indicating points that are not distinctly in one cluster or another -1, indicating points that are probably assigned to the wrong cluster.

Page 37: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 37

Supervised

But what if we know something about the “true clusters” Can we use this to test the effectiveness of different clustering algorithms?

Page 38: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – Validation Slide 38

Comparing Clusters Metrics exist to measure how

similar two clustering arrangements are

Thus if a method produces a set of similar clustering arrangements (according to the metric) then the method is consistent

We will consider the Weighted-Kappa metric which has been adapted from Medical Statistics

Page 39: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Lab in 2 weeks

Data Clustering – An Introduction Slide 39

In the lab:

Explore some more complex data

Using Weighted Kappa to evaluate the clusters – essentially a value between 0 and 1 – more next week

Explore the effect of feature selection on the scatterplots and on clustering

Page 40: Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering – Validation Slide 40

Lecture in 2 weeks... More on Weighted Kappa And other cluster evaluation

methods

And Coursework!