data clustering 2 – k means contd & hierarchical methods data clustering – an...

Data Clustering 2 – K Means contd & Hierarchical Methods

Data Clustering – An Introduction Slide 1

Recap: K-Means Clustering

1. Place K points into the feature space. These points represent initial cluster centroids.

2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate

the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not

change.


K-Means Clustering

Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html


http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html



Discussions (4)

4. Intuitively, what is the ideal partition of the following problem? Can K-means Clustering algorithm give a satisfactory answer to this problem?

Elongated clusters

Clusters with different variation

Limitations of K-MeansAt each iteration of the K-Means Clustering algorithm, a pattern can be assigned to one cluster only, i.e. the assignment is ‘hard’.

Cluster 1

Cluster 2

x1

x2Observe an extra pattern x:It locates in the middle of the two cluster centroids. So with K-Means Clustering algorithm, it will either (1) drag m1 down, or (2) drag m2 up.

x

m1

m2

But intuitively, it should have equal contributions to both clusters…

Limitations of the K-Means (2)

Cannot solve effectively problems with elongated clusters or clusters with different variation.


Post ProcessingAfter running a clustering algorithm, different

procedures can be used to improve the final assignment:

Split clusters with the highest SSE

Merge clusters (e.g. those that are closest)

Introduce a new centroid (often point furthest from any cluster centre)


Pros and Cons of KMAdvantages

May be computationally faster than hierarchical clustering (if K is small).May produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages Fixed number of clusters can make it difficult to predict what K should be. Different initial partitions can result in different final clusters.Potential empty clusters (not always bad)Does not work well with non-globular clusters.

Pros and Cons of KM

OutliersSSE can be affected greatly by outliersAn outlier is a piece of data that does not fit

within the distribution of the rest of the dataOutlier analysis is large research topic in data

mining

Key question:Is the outlier noise? Or correct and “interesting”?


Hierarchical (agglomerative) Clustering Hierarchical clustering results in a

series of clustering results The results start off with each

object in their own cluster and end with all of the objects in the same cluster

The intermediate clusters are created by a series of merges

The resultant tree like structure is called a dendrogram Data Clustering – An Introduction Slide 11

Dendrogram

The Hierarchical Clustering Algorithm


1) Each item is assigned to its own cluster (n clusters of size one)

2) Let the distances between the clusters equals the distances between the objects they contain

3) Find the closest pair of clusters and merge them into a single cluster (one less cluster)

4) Re-compute the distances between the new cluster and each of the old clusters

5) Repeat steps 3 and 4 until there is only one cluster left

Re-computing Distances

Linkage

Description

SingleThe smallest distance between any two pairs from the two clusters (one from each) being compared/measured

Average The average distance between pairs

Complete

The largest distance between any two pairs from the two clusters (one from each) being compared/measuredOther methods include Ward, McQuitty,

Median and Centroid


http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Re-computing Distances

Single Linkage

Complete Linkage

Average Linkage


Pros and Cons of HCAdvantages

Can produce an ordering of the objects, which may be informative for data display.Smaller clusters are generated, which may be helpful for discovery.

DisadvantagesNo provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.Use of different distance metrics for measuring distances between clusters may generate different results.



Clustering Gene Expression Data

Gene expression data often consists of thousands of genes observed over tens of conditions Clustering on expression level will help towards identifying the functionality of unknown genes Clustering can be used to reduce the dimensionality of the data, making it easier to model


Clustering Gene Expression Data

http://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=bv-BOrsUW94ODM&tbnid=4esi2sQE-E_-eM:&ved=0CAUQjRw&url=http://www.bioss.ac.uk/~dirk/essays/GeneExpression/bayes_net.html&ei=rn9NUtLiOenV0QXdl4EQ&bvm=bv.53537100,d.d2k&psig=AFQjCNFwnBsZrix2qWDEtzHYqXNGq5OkJQ&ust=1380897065718601

Other clustering methodsFuzzy Clustering

For example: Fuzzy c-meansIn real applications often no sharp

boundary between clusters Fuzzy clustering is often better

suitedFuzzy c-means is a fuzzification of k-

Means and the most well-known


Fuzzy Clusteringhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

AppletFCM.html


Cluster membership is now a weight between zero and one

The distance to a centroid is multiplied by the membership weight

DBSCAN (from Ranjay Sankar notes, University of Florida)

• DBSCAN is a density based clustering algorithm

• Density = number of points within a specified radius (Eps)

• A point is a core point if it has more than

specified number of points (MinPts) within Eps

(Core point is in the interior of a cluster)Data Clustering – An Introduction Slide 21


• A border point has fewer than MinPts within Eps but is in neighborhood of a core point

• A noise point is any point that is neither a core point nor a border point




http://www.cise.ufl.edu/class/cis4930sp09dm/notes/dm5part4.pdf

DBSCAN

• Density-Reachable (directly and indirectly):

• A point p is directly density-reachable from p2

• p2 is directly density-reachable from p1

• p1 is directly density-reachable from q

• p <- p2 <- p1 <- q form a chain Data Clustering – An Introduction Slide 24

DBSCAN(University of Buffalo)


http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

Other Clustering Methods Clustering as Optimisation

Remember one flaw with hierarchical?

“No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.”

And with K-Means“Different initial partitions can result in different

final clusters.”

Both are using Local SearchData Clustering – An Introduction Slide 27

Clustering as Optimisation

We can overcome this by using the different optimisation techniques for finding global optima

e.g. Evolutionary algorithmsRepresent a cluster as a

chromosomeEvolve better clusterings using a

fitness function: SSE?Data Clustering – An Introduction Slide 28

Clustering as Optimisation

E.g. Represent a cluster as a chromosome containing integers:[1, 2, 1, 1, 2, 2, 3, 1, 1, 3]


a, b, c, d, e, f, g, h, i, j

a

c

d

hi

b ef

g j

Cluster 1

Cluster 2

Cluster 3


Feature Selection

Recall the clustering process

involved: Selection of appropriate features

Process is known as feature

selection A very large research topic in data

mining Will affect the final clustering


Feature Selection

Scatterplots from different features

of the same dataset

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7 8


Evaluating Cluster Quality

How do we know if the discovered

clusters are any good?

The choice of correct metric for

judging the worth of a clustering

arrangement is vital for success

There are as many metrics as

methods!

Each has their own merits and

drawbacks


Evaluating Cluster Quality For example, K-Means clustering judges the worth of a clustering arrangement based on the square of how far each item in the cluster is from the centre This is the sum of squared Euclidean distances X is a cluster of size k, xi an element in the cluster and c is the centre of the cluster

k

i

dXSS1

2)(



Unsupervised Cohesion / Homogeneity & Separation

Supervised How close to the “true” clustering

Relative Use the above for comparing two or

more clusterings e.g. K-means vs

Hierarchical


Cohesion & Separation

Cohesion

Separation



Other variations: Silhouette

+1, indicating points

that are very distant

from neighbouring clusters 0, indicating points that are not distinctly in one cluster or another -1, indicating points that are probably assigned to the wrong cluster.


Supervised

But what if we know something about the “true clusters” Can we use this to test the effectiveness of different clustering algorithms?

Data Clustering – Validation Slide 38

Comparing Clusters Metrics exist to measure how

similar two clustering arrangements are

Thus if a method produces a set of similar clustering arrangements (according to the metric) then the method is consistent

We will consider the Weighted-Kappa metric which has been adapted from Medical Statistics

Lab in 2 weeks


In the lab:

Explore some more complex data

Using Weighted Kappa to evaluate the clusters – essentially a value between 0 and 1 – more next week

Explore the effect of feature selection on the scatterplots and on clustering

Data Clustering – Validation Slide 40

Lecture in 2 weeks... More on Weighted Kappa And other cluster evaluation

methods

And Coursework!

data clustering 2 – k means contd & hierarchical methods data clustering – an...

Documents