introduction to clustering

19
Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees Introduction to Clustering What is Clustering? Finding a structure in a collection of unlabeled data. Types Of Clustering Algorithms Partitional Divides data into non-overlapping subsets (clusters) No cluster-internal structure Hierarchical Clusters are organized as trees Each node is consider a cluster Clustering - What is Clustering - Types of Clustering Algorithms - Partitional and Hierarchical

Upload: hisa

Post on 22-Jan-2016

45 views

Category:

Documents


5 download

DESCRIPTION

Clustering - What is Clustering - Types of Clustering Algorithms - Partitional and Hierarchical. Introduction to Clustering. What is Clustering? Finding a structure in a collection of unlabeled data. Types Of Clustering Algorithms Partitional - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Introduction to Clustering

• What is Clustering?– Finding a structure in a collection of unlabeled data.

• Types Of Clustering Algorithms– Partitional

• Divides data into non-overlapping subsets (clusters) • No cluster-internal structure

– Hierarchical• Clusters are organized as trees• Each node is consider a cluster

Clustering- What is Clustering- Types of Clustering Algorithms- Partitional and Hierarchical

Page 2: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

K-means

• Overview– Partitional Algorithm (K user defined partitions)

• Simple Implementation– Initialize Centroids(); // some heuristic or random– While(!stopState){ // some

heuristic– Compute data point membership(); // based on

distance from Centroid– Recompute Centroids position(); // Center of Cluster– } // end loop

• Time Complexity– O(n*k)

• Space Complexity– O(n*k)

K-means- Overview- Implementation- Time and Space Complexity

Page 3: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Sample Run

Page 4: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

K-means

• Properties– There are always K clusters– There is always at least one item in each

cluster– The cluster are non-hierarchical and they

do not overlap• Pros

– Easy to Implement– Speed (if K is small)– Produces tighter clusters than hierarchical

clustering, especially if the clusters are globular

• Cons– Different initial partitions affect outcome – Difficult to determined what K should be– Does not work well with “non-globular”

clusters– Different values of K affect final clusters

Figure: Natural Clustering output with k-meansSource: http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/K-Means_Clustering_Overview.htm

Clustering- Properties- Pros- Cons

Page 5: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Hierarchical MethodsHierarchical Methods

Hierarchical Methods- Agglomerative vs. Divisive.-- Single-Link, Complete-Link, Average-Link

Hierarchical Methods Opposed to the partitional algorithms which work by partitioning data into clusters, Hierarchical algorithms produce a dendogram (tree-diagram) representing a hierarchy of clusters to produce a super cluster.

Agglomerative vs. DivisiveThe hierarchical algorithms work by either breaking down or building up these clusters. The characteristic of breaking down, or building up clusters determines whether the hierarchical algorithm is agglomerative or divisive.

Single-Link, Complete-Link, & Average-LinkSingle Link – Minimum distance between all points in a cluster.Complete Link - Maximum distance between all points in a cluster.Average Link – Average distance between all points in a cluster.

(Jain)

Figure: Illustration of AgglomerativeHierarchical Algorithm. (Wikipedia)

Page 6: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Hierarchical Algorithm Illustration

Figure: Illustration of Hierarchical Agglomerative Single-Link Algorithm http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/links.html

Psuedocode1. Begin by placing each individual element into its own cluster.2. Compute the distance between all clusters, Based on Link Type.3. Group the two most similar clusters together.4. Continue until only 1 cluster remains.

(Jain)

Hierarchical Methods- Pseudocode and Illustration

Page 7: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Hierarchical Method Results : Clustering Output

DendogramThe dendogram is the fundamental representation of the hierarchical clustering method.

Advantages of the DendogramThe hierarchical method unlike the k-means method generates a hierarchy of clusterings from 1 to n, where n is the number of elements to cluster.

Able to view the logic behind clusterings leading to larger clusters.

No need to guess which value of K for number of clusters is appropriate.

(Jain)

Figure: Illustration of AgglomerativeHierarchical Algorithm. (Wikipedia)

Hierarchical Methods- Dendograms

Page 8: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Hierarchical Clustering : Clustering Output

Simple Uniform Random Data InputThe data is randomly distributed evenly throughout the graph. No apparent clustering exists.

Time-Complexity & Space ComplexityShould be O(n2) but implementation difficulties increased to O(n3). This is because the table containing distances between points had to be re-computed.

Space Complexity is O(n2) the dominant factor is the matrix containing pairwise distances between points.(Jain), (A Tutorial on Clustering Algorithms)

Figure: Simpe Uniform Data Input, Hierarchical Agglomerative Average-Link Clustering.

Hierarchical M641yrfethods- Simple Uniform Data for Sanity check-- Time and Space Complexity

Page 9: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Hierarchical Clustering : Natural Clustering Output

Figure: Simpe Uniform Data Input, Hierarchical Agglomerative Average-Link Clustering.

Clustering Output performanceReal challenges arise when trying to extract natural clusters that exist in data.

Human AnalysisAble to recognize patterns such as shapes in data.

Hierarchical ClusteringIt appears that the Hierarchical clustering algorithm provides output that is fairly consistent with human expectations. However, on the intersection of the points where the circle and rectangle intersect it can be seen that the clusters appear to bleed slightly into one another.

Page 10: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 11: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 12: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 13: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 14: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 15: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 16: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 17: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 18: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Results

Page 19: Introduction to Clustering

Alex Prunka, Nathan Heminger ,and Chris Andrade Clustering Analysis: K-means, Hierarchical, R-Trees

Works Cited

Jain, A.K, Murty, M.N, Flynn, P.J. "Data Clustering: A review". ACM Computing Surveys, Vol 31, No 3. Sept 1999. 30 Oct. 2008. <http://mutex.gmu.edu:2338/ft_gateway.cfm?id=331504& type=pdf&coll=portal&dl=ACM&CFID=11772714&CFTOKEN=25758562>

"Data Clustering." Wikipedia: The Free Encyclopedia. 12 Nov 2008. 18 Nov 2008. <http://en.wikipedia.org/wiki/Data_clustering>

"k-means algorithm." Wikipedia: The Free Encyclopedia. 12 Nov 2008. 18 Nov 2008. http://en.wikipedia.org/wiki/K-means

"R-tree." Wikipedia: The Free Encyclopedia. 12 Nov 2008. 18 Nov 2008. http://en.wikipedia.org/wiki/R-tree

“A Tutorial on Clustering Algorithms”. 12 Nov 2008. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

Monz, Christof. “Machine Learning for Data Mining Week 6: Clustering”. 11 Dec 2008. http://www.dcs.qmul.ac.uk/~christof/html/courses/ml4dm/week06-clustering-4pp.pdf