clustering - wordpress.com · modern clustering problem may involve euclidean spaces of very high...
TRANSCRIPT
![Page 1: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/1.jpg)
Clustering Lecture-9
Barna Saha
Acknowledgement: Some of the slides taken from Jeff Ullman’s course on Mining Massive Datasets
![Page 2: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/2.jpg)
The Problem of Clustering � Given a set of points, with a notion of distance
between points, group the points into some number of clusters, so that members of a cluster are “close” to each other, while members of different clusters are “far.”
2
![Page 3: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/3.jpg)
Example: Clusters
3
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
![Page 4: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/4.jpg)
Clustering in Low Dimensional Euclidean
Space is Easy
![Page 5: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/5.jpg)
Modern Clustering Problem � May involve Euclidean spaces of very high
dimension.
� Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance, Edit Distance etc.
� Example: � Cluster documents by topics based on occurrences of
unusual words � Cluster moviegoers by the type or types of movies
they like � Cluster genes by their sequence similarity
![Page 6: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/6.jpg)
Clustering Strategies � Two fundamentally different approaches
� Hierarchical or Agglomerative Clustering � Start with each point in its own cluster
� Merge clusters based on ``closeness’’
� Point Assignment � Start with some clusters (possibly empty)
� Consider points and insert them in appropriate clusters
![Page 7: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/7.jpg)
Which is Better?
� Point assignment good when clusters are nice, convex shapes.
� Hierarchical can win when shapes are weird.
7
![Page 8: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/8.jpg)
8
Hierarchical Clustering � Two important questions:
1. How do you determine the “nearness” of clusters?
2. How do you represent a cluster of more than one point?
![Page 9: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/9.jpg)
9
Hierarchical Clustering – (2) � Key problem: as you build clusters, how do you
represent the location of each cluster, to tell which pair of clusters is closest?
� Euclidean case: each cluster has a centroid = average of its points. � Measure intercluster distances by distances of
centroids.
![Page 10: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/10.jpg)
Example
x (1.5,1.5)
x (4.5,0.5) x (1,1)
x (4.7,1.3)
![Page 11: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/11.jpg)
Tree showing the grouping of points
(0,0) (1,2) (2,1) (4,1) (5,0) (5,3)
![Page 12: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/12.jpg)
And in the Non-Euclidean Case?
� The only “locations” we can talk about are the points themselves. � I.e., there is no “average” of two points.
� Approach 1: clustroid = point “closest” to other points. � Treat clustroid as if it were centroid, when computing
intercluster distances.
![Page 13: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/13.jpg)
13
“Closest” Point? � Possible meanings:
1. Smallest maximum distance to the other points.
2. Smallest average distance to other points. 3. Smallest sum of squares of distances to other
points. 4. Etc., etc.
![Page 14: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/14.jpg)
Efficiency of Hierarchical Clustering
� Start by computing O(n2) distances
� Subsequent steps taken O((n-1)2), O((n-2)2),….
� Total=O(n3)
� Can be reduced to O(n2logn) using priority queue (See 7.2.2)
� Can you reduce it to o(n2)?
![Page 15: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/15.jpg)
K-Means � An example of a point-assignment based clustering
� Initially choose k points that are likely to be in different clusters
� Make these points the centroids of their clusters
� For each remaining point p DO � Find the centroid to which p is closest
� Add p to the cluster of that centroid
� Adjust the centroid of that cluster to account for p
� Optional: reassign all points based on the new centroids. Repeat as long as there is any change in assignment.
![Page 16: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,](https://reader034.vdocument.in/reader034/viewer/2022042711/5f771a83884ef770715e4e13/html5/thumbnails/16.jpg)
How to select the K centers?