introduction to clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Introduction to Clustering

• What is Clustering?– Finding a structure in a collection of unlabeled data.

• Types Of Clustering Algorithms– Partitional

• Divides data into non-overlapping subsets (clusters) • No cluster-internal structure

– Hierarchical• Clusters are organized as trees• Each node is consider a cluster

Clustering- What is Clustering- Types of Clustering Algorithms- Partitional and Hierarchical


K-means

• Overview– Partitional Algorithm (K user defined partitions)

• Simple Implementation– Initialize Centroids(); // some heuristic or random– While(!stopState){ // some

heuristic– Compute data point membership(); // based on

distance from Centroid– Recompute Centroids position(); // Center of Cluster– } // end loop

• Time Complexity– O(n*k)

• Space Complexity– O(n*k)

K-means- Overview- Implementation- Time and Space Complexity


Sample Run


K-means

• Properties– There are always K clusters– There is always at least one item in each

cluster– The cluster are non-hierarchical and they

do not overlap• Pros

– Easy to Implement– Speed (if K is small)– Produces tighter clusters than hierarchical

clustering, especially if the clusters are globular

• Cons– Different initial partitions affect outcome – Difficult to determined what K should be– Does not work well with “non-globular”

clusters– Different values of K affect final clusters

Figure: Natural Clustering output with k-meansSource: http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/K-Means_Clustering_Overview.htm

Clustering- Properties- Pros- Cons

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Hierarchical MethodsHierarchical Methods

Hierarchical Methods- Agglomerative vs. Divisive.-- Single-Link, Complete-Link, Average-Link

Hierarchical Methods Opposed to the partitional algorithms which work by partitioning data into clusters, Hierarchical algorithms produce a dendogram (tree-diagram) representing a hierarchy of clusters to produce a super cluster.

Agglomerative vs. DivisiveThe hierarchical algorithms work by either breaking down or building up these clusters. The characteristic of breaking down, or building up clusters determines whether the hierarchical algorithm is agglomerative or divisive.

Single-Link, Complete-Link, & Average-LinkSingle Link – Minimum distance between all points in a cluster.Complete Link - Maximum distance between all points in a cluster.Average Link – Average distance between all points in a cluster.

(Jain)

Figure: Illustration of AgglomerativeHierarchical Algorithm. (Wikipedia)


Hierarchical Algorithm Illustration

Figure: Illustration of Hierarchical Agglomerative Single-Link Algorithm http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/links.html

Psuedocode1. Begin by placing each individual element into its own cluster.2. Compute the distance between all clusters, Based on Link Type.3. Group the two most similar clusters together.4. Continue until only 1 cluster remains.

(Jain)

Hierarchical Methods- Pseudocode and Illustration


Hierarchical Method Results : Clustering Output

DendogramThe dendogram is the fundamental representation of the hierarchical clustering method.

Advantages of the DendogramThe hierarchical method unlike the k-means method generates a hierarchy of clusterings from 1 to n, where n is the number of elements to cluster.

Able to view the logic behind clusterings leading to larger clusters.

No need to guess which value of K for number of clusters is appropriate.

(Jain)

Figure: Illustration of AgglomerativeHierarchical Algorithm. (Wikipedia)

Hierarchical Methods- Dendograms


Hierarchical Clustering : Clustering Output

Simple Uniform Random Data InputThe data is randomly distributed evenly throughout the graph. No apparent clustering exists.

Time-Complexity & Space ComplexityShould be O(n2) but implementation difficulties increased to O(n3). This is because the table containing distances between points had to be re-computed.

Space Complexity is O(n2) the dominant factor is the matrix containing pairwise distances between points.(Jain), (A Tutorial on Clustering Algorithms)

Figure: Simpe Uniform Data Input, Hierarchical Agglomerative Average-Link Clustering.

Hierarchical M641yrfethods- Simple Uniform Data for Sanity check-- Time and Space Complexity


Hierarchical Clustering : Natural Clustering Output

Figure: Simpe Uniform Data Input, Hierarchical Agglomerative Average-Link Clustering.

Clustering Output performanceReal challenges arise when trying to extract natural clusters that exist in data.

Human AnalysisAble to recognize patterns such as shapes in data.

Hierarchical ClusteringIt appears that the Hierarchical clustering algorithm provides output that is fairly consistent with human expectations. However, on the intersection of the points where the circle and rectangle intersect it can be seen that the clusters appear to bleed slightly into one another.


Results


Works Cited

Jain, A.K, Murty, M.N, Flynn, P.J. "Data Clustering: A review". ACM Computing Surveys, Vol 31, No 3. Sept 1999. 30 Oct. 2008. <http://mutex.gmu.edu:2338/ft_gateway.cfm?id=331504& type=pdf&coll=portal&dl=ACM&CFID=11772714&CFTOKEN=25758562>

"Data Clustering." Wikipedia: The Free Encyclopedia. 12 Nov 2008. 18 Nov 2008. <http://en.wikipedia.org/wiki/Data_clustering>

"k-means algorithm." Wikipedia: The Free Encyclopedia. 12 Nov 2008. 18 Nov 2008. http://en.wikipedia.org/wiki/K-means

"R-tree." Wikipedia: The Free Encyclopedia. 12 Nov 2008. 18 Nov 2008. http://en.wikipedia.org/wiki/R-tree

“A Tutorial on Clustering Algorithms”. 12 Nov 2008. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

Monz, Christof. “Machine Learning for Data Mining Week 6: Clustering”. 11 Dec 2008. http://www.dcs.qmul.ac.uk/~christof/html/courses/ml4dm/week06-clustering-4pp.pdf

introduction to clustering

Documents