data clustering 2 – k means contd & hierarchical methods data clustering – an...
TRANSCRIPT
Data Clustering 2 – K Means contd & Hierarchical Methods
Data Clustering – An Introduction Slide 1
Recap: K-Means Clustering
1. Place K points into the feature space. These points represent initial cluster centroids.
2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate
the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not
change.
Data Clustering – An Introduction Slide 2
K-Means Clustering
Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Data Clustering – An Introduction Slide 3
Discussions (4)
4. Intuitively, what is the ideal partition of the following problem? Can K-means Clustering algorithm give a satisfactory answer to this problem?
Elongated clusters
Clusters with different variation
Limitations of K-MeansAt each iteration of the K-Means Clustering algorithm, a pattern can be assigned to one cluster only, i.e. the assignment is ‘hard’.
Cluster 1
Cluster 2
x1
x2Observe an extra pattern x:It locates in the middle of the two cluster centroids. So with K-Means Clustering algorithm, it will either (1) drag m1 down, or (2) drag m2 up.
x
m1
m2
But intuitively, it should have equal contributions to both clusters…
Limitations of the K-Means (2)
Cannot solve effectively problems with elongated clusters or clusters with different variation.
Data Clustering – An Introduction Slide 6
Post ProcessingAfter running a clustering algorithm, different
procedures can be used to improve the final assignment:
Split clusters with the highest SSE
Merge clusters (e.g. those that are closest)
Introduce a new centroid (often point furthest from any cluster centre)
Data Clustering – An Introduction Slide 7
Slide 8
Pros and Cons of KMAdvantages
May be computationally faster than hierarchical clustering (if K is small).May produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Disadvantages Fixed number of clusters can make it difficult to predict what K should be. Different initial partitions can result in different final clusters.Potential empty clusters (not always bad)Does not work well with non-globular clusters.
Slide 9
Pros and Cons of KM
OutliersSSE can be affected greatly by outliersAn outlier is a piece of data that does not fit
within the distribution of the rest of the dataOutlier analysis is large research topic in data
mining
Key question:Is the outlier noise? Or correct and “interesting”?
Data Clustering – An Introduction Slide 10
Hierarchical (agglomerative) Clustering Hierarchical clustering results in a
series of clustering results The results start off with each
object in their own cluster and end with all of the objects in the same cluster
The intermediate clusters are created by a series of merges
The resultant tree like structure is called a dendrogram Data Clustering – An Introduction Slide 11
Slide 12
Dendrogram
The Hierarchical Clustering Algorithm
Data Clustering – An Introduction Slide 13
1) Each item is assigned to its own cluster (n clusters of size one)
2) Let the distances between the clusters equals the distances between the objects they contain
3) Find the closest pair of clusters and merge them into a single cluster (one less cluster)
4) Re-compute the distances between the new cluster and each of the old clusters
5) Repeat steps 3 and 4 until there is only one cluster left
Re-computing Distances
Linkage
Description
SingleThe smallest distance between any two pairs from the two clusters (one from each) being compared/measured
Average The average distance between pairs
Complete
The largest distance between any two pairs from the two clusters (one from each) being compared/measuredOther methods include Ward, McQuitty,
Median and Centroid
Data Clustering – An Introduction Slide 14
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
Re-computing Distances
Single Linkage
Complete Linkage
Average Linkage
Data Clustering – An Introduction Slide 15
Pros and Cons of HCAdvantages
Can produce an ordering of the objects, which may be informative for data display.Smaller clusters are generated, which may be helpful for discovery.
DisadvantagesNo provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.Use of different distance metrics for measuring distances between clusters may generate different results.
Data Clustering – An Introduction Slide 16
Data Clustering – An Introduction Slide 17
Clustering Gene Expression Data
Gene expression data often consists of thousands of genes observed over tens of conditions Clustering on expression level will help towards identifying the functionality of unknown genes Clustering can be used to reduce the dimensionality of the data, making it easier to model
Data Clustering – An Introduction Slide 18
Clustering Gene Expression Data
Other clustering methodsFuzzy Clustering
For example: Fuzzy c-meansIn real applications often no sharp
boundary between clusters Fuzzy clustering is often better
suitedFuzzy c-means is a fuzzification of k-
Means and the most well-known
Data Clustering – An Introduction Slide 19
Fuzzy Clusteringhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
AppletFCM.html
Data Clustering – An Introduction Slide 20
Cluster membership is now a weight between zero and one
The distance to a centroid is multiplied by the membership weight
DBSCAN (from Ranjay Sankar notes, University of Florida)
• DBSCAN is a density based clustering algorithm
• Density = number of points within a specified radius (Eps)
• A point is a core point if it has more than
specified number of points (MinPts) within Eps
(Core point is in the interior of a cluster)Data Clustering – An Introduction Slide 21
DBSCAN (from Ranjay Sankar notes, University of Florida)
• A border point has fewer than MinPts within Eps but is in neighborhood of a core point
• A noise point is any point that is neither a core point nor a border point
Data Clustering – An Introduction Slide 22
DBSCAN (from Ranjay Sankar notes, University of Florida)
Data Clustering – An Introduction Slide 23
http://www.cise.ufl.edu/class/cis4930sp09dm/notes/dm5part4.pdf
DBSCAN
• Density-Reachable (directly and indirectly):
• A point p is directly density-reachable from p2
• p2 is directly density-reachable from p1
• p1 is directly density-reachable from q
• p <- p2 <- p1 <- q form a chain Data Clustering – An Introduction Slide 24
DBSCAN(University of Buffalo)
Data Clustering – An Introduction Slide 25
http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
DBSCAN (from Ranjay Sankar notes, University of Florida)
Data Clustering – An Introduction Slide 26
Other Clustering Methods Clustering as Optimisation
Remember one flaw with hierarchical?
“No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.”
And with K-Means“Different initial partitions can result in different
final clusters.”
Both are using Local SearchData Clustering – An Introduction Slide 27
Clustering as Optimisation
We can overcome this by using the different optimisation techniques for finding global optima
e.g. Evolutionary algorithmsRepresent a cluster as a
chromosomeEvolve better clusterings using a
fitness function: SSE?Data Clustering – An Introduction Slide 28
Clustering as Optimisation
E.g. Represent a cluster as a chromosome containing integers:[1, 2, 1, 1, 2, 2, 3, 1, 1, 3]
Data Clustering – An Introduction Slide 29
a, b, c, d, e, f, g, h, i, j
a
c
d
hi
b ef
g j
Cluster 1
Cluster 2
Cluster 3
Data Clustering – An Introduction Slide 30
Feature Selection
Recall the clustering process
involved: Selection of appropriate features
Process is known as feature
selection A very large research topic in data
mining Will affect the final clustering
Data Clustering – An Introduction Slide 31
Feature Selection
Scatterplots from different features
of the same dataset
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5 6 7 8
Data Clustering – An Introduction Slide 32
Evaluating Cluster Quality
How do we know if the discovered
clusters are any good?
The choice of correct metric for
judging the worth of a clustering
arrangement is vital for success
There are as many metrics as
methods!
Each has their own merits and
drawbacks
Data Clustering – An Introduction Slide 33
Evaluating Cluster Quality For example, K-Means clustering judges the worth of a clustering arrangement based on the square of how far each item in the cluster is from the centre This is the sum of squared Euclidean distances X is a cluster of size k, xi an element in the cluster and c is the centre of the cluster
k
i
dXSS1
2)(
Data Clustering – An Introduction Slide 34
Evaluating Cluster Quality
Unsupervised Cohesion / Homogeneity & Separation
Supervised How close to the “true” clustering
Relative Use the above for comparing two or
more clusterings e.g. K-means vs
Hierarchical
Data Clustering – An Introduction Slide 35
Cohesion & Separation
Cohesion
Separation
Data Clustering – An Introduction Slide 36
Evaluating Cluster Quality
Other variations: Silhouette
+1, indicating points
that are very distant
from neighbouring clusters 0, indicating points that are not distinctly in one cluster or another -1, indicating points that are probably assigned to the wrong cluster.
Data Clustering – An Introduction Slide 37
Supervised
But what if we know something about the “true clusters” Can we use this to test the effectiveness of different clustering algorithms?
Data Clustering – Validation Slide 38
Comparing Clusters Metrics exist to measure how
similar two clustering arrangements are
Thus if a method produces a set of similar clustering arrangements (according to the metric) then the method is consistent
We will consider the Weighted-Kappa metric which has been adapted from Medical Statistics
Lab in 2 weeks
Data Clustering – An Introduction Slide 39
In the lab:
Explore some more complex data
Using Weighted Kappa to evaluate the clusters – essentially a value between 0 and 1 – more next week
Explore the effect of feature selection on the scatterplots and on clustering
Data Clustering – Validation Slide 40
Lecture in 2 weeks... More on Weighted Kappa And other cluster evaluation
methods
And Coursework!