this is your presentation (or slidedoc) title
TRANSCRIPT
![Page 1: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/1.jpg)
CSE/STAT 416Hierarchical Clustering
Vinitra SwamyUniversity of WashingtonJuly 29, 2020
![Page 2: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/2.jpg)
Clustering
2
SPORTS WORLD NEWS
![Page 3: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/3.jpg)
Define Clusters
In their simplest form, a cluster is defined by
βͺ The location of its center (centroid)
βͺ Shape and size of its spread
Clustering is the process of finding these clusters and assigning each example to a particular cluster.
βͺ π₯π gets assigned π§π β [1, 2,β¦ , π]
βͺ Usually based on closest centroid
Will define some kind of score for a clustering that determines how good the assignments are
βͺ Based on distance of assignedexamples to each cluster
3
![Page 4: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/4.jpg)
Not Always Easy
There are many clusters that are harder to learn with this setup
βͺ Distance does not determine clusters
4
![Page 5: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/5.jpg)
Step 0 Start by choosing the initial cluster centroids
βͺ A common default choice is to choose centroids at random
βͺ Will see later that there are smarter ways of initializing
5
π1
π2
π3
![Page 6: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/6.jpg)
Step 1 Assign each example to its closest cluster centroid
π§π β argminπβ[π]
ππ β π₯π2
6
![Page 7: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/7.jpg)
Step 2 Update the centroids to be the mean of all the points assigned to that cluster.
ππ β1
ππ
π:π§π=π
π₯π
Computes center of mass for cluster!
7
![Page 8: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/8.jpg)
Smart Initializing w/ k-means++
Making sure the initialized centroids are βgoodβ is critical to finding quality local optima. Our purely random approach was wasteful since itβs very possible that initial centroids start close together.
Idea: Try to select a set of points farther away from each other.
k-means++ does a slightly smarter random initialization
1. Choose first cluster π1 from the data uniformly at random
2. For the current set of centroids (starting with just π1), compute the distance between each datapoint and its closest centroid
3. Choose a new centroid from the remaining data points with probability of π₯π being chosen proportional to π π₯π
2
4. Repeat 2 and 3 until we have selected π centroids8
![Page 9: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/9.jpg)
Problems with k-means
In real life, cluster assignments are not always clear cut
βͺ E.g. The moon landing: Science? World News? Conspiracy?
Because we minimize Euclidean distance, k-means assumes all the clusters are spherical
We can change this with weighted Euclidean distance
βͺ Still assumes every cluster is the same shape/orientation
9
![Page 10: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/10.jpg)
Failure Modes of k-means
If we donβt meet the assumption of spherical clusters, we will get unexpected results
10
disparate cluster sizes overlapping clusters different shaped/oriented clusters
![Page 11: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/11.jpg)
Mixture Models
A much more flexible approach is modeling with a mixture model
Model each cluster as a different probability distribution and learn their parameters
βͺ E.g. Mixture of Gaussians
βͺ Allows for different cluster shapes and sizes
βͺ Typically learned using Expectation Maximization (EM) algorithm
Allows soft assignments to clusters
βͺ 54% chance document is about world news, 45% science, 1% conspiracy theory, 0% other
11
![Page 12: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/12.jpg)
Hierarchical Clustering
12
![Page 13: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/13.jpg)
Nouns Lots of data is hierarchical by nature
13
https://index.pocketcluster.io/facebookresearch-poincare-embeddings.html
![Page 14: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/14.jpg)
Species
14
![Page 15: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/15.jpg)
Motivation If we try to learn clusters in hierarchies, we can
βͺ Avoid choosing the # of clusters beforehand
βͺ Use dendrograms to help visualize different granularities of clusters
βͺ Allow us to use any distance metric - K-means requires Euclidean distance
βͺ Can often find more complex shapes than k-means
15
![Page 16: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/16.jpg)
Finding Shapes
k-means
Mixture Models
Hierarchical Clustering
16
![Page 17: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/17.jpg)
Types of Algorithms
Divisive, a.k.a. top-down
βͺ Start with all the data in one big cluster and then recursively split the data into smaller clusters
- Example: recursive k-means
Agglomerative, a.k.a. bottom-up:
βͺ Start with each data point in its own cluster. Merge clusters until all points are in one big cluster.
- Example: single linkage
17
![Page 18: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/18.jpg)
Divisive Clustering
Start with all the data in one cluster, and then run k-means to divide the data into smaller clusters. Repeatedly run k-means on each cluster to make sub-clusters.
18
![Page 19: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/19.jpg)
ExampleUsing Wikipedia
19
Wikipedia
Athletes Non-athletes
Wikipedia
Athletes Non-athletes
Wikipedia
Baseball Soccer/Ice hockey
Musicians, artists, actors
Scholars, politicians, government officials
β¦
![Page 20: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/20.jpg)
Choices to Make
For decisive clustering, you need to make the following choices:
βͺ Which algorithm to use
βͺ How many clusters per split
βͺ When to split vs when to stop- Max cluster size
Number of points in cluster falls below threshold- Max cluster radius
distance to furthest point falls below threshold- Specified # of clusters
split until pre-specified # of clusters is reached
20
![Page 21: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/21.jpg)
Agglomerative Clustering
Algorithm at a glance
1. Initialize each point in its own cluster
2. Define a distance metric between clusters
While there is more than one cluster
3. Merge the two closest clusters
21
![Page 22: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/22.jpg)
Step 1
22
1. Initialize each point to be its own cluster
![Page 23: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/23.jpg)
Step 22. Define a distance metric between clusters
This formula means we are defining the distance between two clusters as the smallest distance between any pair of points between the clusters.
23
Single Linkageπππ π‘ππππ πΆ1, πΆ2 = min
π₯πβπΆ1,π₯πβπΆ2π π₯π , π₯π
![Page 24: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/24.jpg)
Step 3 Merge closest pair of clusters
24
3. Merge the two closest clusters
![Page 25: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/25.jpg)
Repeat
25
4. Repeat step 3 until all points are in one cluster
![Page 26: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/26.jpg)
Repeat Notice that the height of the dendrogram is growing as we group points farther from each other
26
4. Repeat step 3 until all points are in one cluster
![Page 27: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/27.jpg)
Repeat
27
![Page 28: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/28.jpg)
Repeat Looking at the dendrogram, we can see there is a bit of an outlier!
Can tell by seeing a point join a cluster with a really large distance.
28
4. Repeat step 3 until all points are in one cluster
![Page 29: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/29.jpg)
Repeat The tall links in the dendrogram show us we are merging clusters that are far away from each other
29
4. Repeat step 3 until all points are in one cluster
![Page 30: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/30.jpg)
Repeat Final result after merging all clusters
30
4. Repeat step 3 until all points are in one cluster
![Page 31: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/31.jpg)
Final Result
31
![Page 32: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/32.jpg)
Brain Break
32
![Page 33: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/33.jpg)
Dendrogram x-axis shows the datapoints (arranged in a very particular order)
y-axis shows distance between pairs of clusters
33
![Page 34: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/34.jpg)
Dendrogram The path shows you all clusters that a single point belongs and the order in which its clusters merged
34
![Page 35: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/35.jpg)
Cut Dendrogram
Choose a distance π·β to βcutβ the dendrogram
βͺ Use the largest clusters with distance < π·β
βͺ Usually ignore the idea of the nested clusters after cutting
35
![Page 36: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/36.jpg)
Think
pollev.com/cse416
How many clusters would we have if we use this threshold?
36
1 min
![Page 37: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/37.jpg)
Pair
pollev.com/cse416
How many clusters would we have if we use this threshold?
37
2 min
![Page 38: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/38.jpg)
Cut Dendrogram
Every branch that crosses π·β becomes its own cluster
38
![Page 39: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/39.jpg)
Choices to Make
For agglomerative clustering, you need to make the following choices:
βͺ Distance metric π π₯π , π₯π
βͺ Linkage function - Single Linkage:
minπ₯πβπΆ1,π₯πβπΆ2
π π₯π , π₯π
- Complete Linkage: max
π₯πβπΆ1,π₯πβπΆ2π π₯π , π₯π
- Centroid Linkageπ(π1, π2)
- Others
βͺ Where and how to cut dendrogram
39
Data points
D*Cluster distance
![Page 40: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/40.jpg)
Practical Notes
For visualization, generally a smaller # of clusters is better
For tasks like outlier detection, cut based on:
βͺ Distance threshold
βͺ Or some other metric that tries to measure how big the distance increased after a merge
No matter what metric or what threshold you use, no method is βincorrectβ. Some are just more useful than others.
40
![Page 41: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/41.jpg)
Computational Cost
Computing all pairs of distances is pretty expensive!
βͺ A simple implementation takes πͺ(π2 log π )
Can be much implemented more cleverly by taking advantage of the triangle inequality
βͺ βAny side of a triangle must be less than the sum of its sidesβ
Best known algorithm is πͺ π2
41
![Page 42: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLE](https://reader031.vdocument.in/reader031/viewer/2022012512/618a9222aef69c22ce5a9f25/html5/thumbnails/42.jpg)
Concept Inventory
This week we want to practice recalling vocabulary. Spend 10 minutes trying to write down all the terms for concepts we have learned in this class and try to bucket them into the following categories.
Regression
Classification
Document Retrieval
Misc β For things that fit in multiple places
You donβt need to define/explain the terms for this exercise, but you should know what they are!
Try to do this for at least 5 minutes before looking at your notes.42