matt dickenson clustering · agenda introduction why clustering works popular approaches dbscan...
TRANSCRIPT
![Page 1: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/1.jpg)
Clustering
Matt Dickenson
![Page 2: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/2.jpg)
Agenda
● Introduction● Why Clustering Works● Popular Approaches
○ DBSCAN○ K-Means○ Gaussian EM
● When to Use Clustering● When Clustering Fails
Image: https://xkcd.com/388/
![Page 3: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/3.jpg)
Overview
![Page 4: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/4.jpg)
Overview
![Page 5: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/5.jpg)
Overview
● Unsupervised approach○ No labeled data
● Group similar points together○ How to define similarity?
similar different
![Page 6: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/6.jpg)
Why Clustering Works
● Often datasets consist of underlying groups
Image: http://theoffice.wikia.com
Sales
Accounting
![Page 7: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/7.jpg)
Why Clustering Works
● Often datasets consist of underlying groups● Usually simple to define similarity as a distance
Euclidean Manhattan
BARN
Levenshtein
YARD
![Page 8: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/8.jpg)
Popular Clustering Algorithms
Images: http://shapeofdata.wordpress.com
K-Meanscentroid-based
DBSCANneighborhood-based
Gaussian E-Mdistribution-based
![Page 9: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/9.jpg)
DBSCAN
Image: http://shapeofdata.wordpress.com
![Page 10: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/10.jpg)
DBSCAN
Image: http://shapeofdata.wordpress.com
“Density-based spatial clustering of applications with noise”
![Page 11: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/11.jpg)
DBSCAN
● 2 parameters○ ε controls the size of a point’s “neighborhood”○ How many neighbors a point must have to be considered “core”
● Result○ List of core points○ Assigned points & outliers
![Page 12: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/12.jpg)
DBSCAN
![Page 13: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/13.jpg)
DBSCANε = 4, minPts = 30
![Page 14: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/14.jpg)
DBSCANIterate through points, count neighbors within ε distance
ε = 4, minPts = 30
![Page 15: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/15.jpg)
DBSCANIf the point has |neighbors| > minPts, consider it “core”
ε = 4, minPts = 30
![Page 16: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/16.jpg)
DBSCAN
97 neighbors
If the point has |neighbors| > minPts, consider it “core”
ε = 4, minPts = 30
![Page 17: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/17.jpg)
DBSCANCheck neighbors’ neighbors for density
ε = 4, minPts = 30
![Page 18: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/18.jpg)
DBSCANAssign core point’s neighbors to its cluster
ε = 4, minPts = 30
![Page 19: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/19.jpg)
DBSCAN
ε = 4, minPts = 30
Iterate through points until you find next core point
![Page 20: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/20.jpg)
DBSCAN
ε = 4, minPts = 30
Check neighbors’ neighbors for density
![Page 21: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/21.jpg)
DBSCAN
ε = 4, minPts = 30
Assign unlabeled neighbors to new core point
![Page 22: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/22.jpg)
DBSCAN
ε = 4, minPts = 30
Continue until all points have been checked
![Page 23: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/23.jpg)
DBSCAN
ε = 4, minPts = 30
Result: 3 clusters
![Page 24: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/24.jpg)
DBSCANResult: 2 clusters
ε = 5, minPts = 30
![Page 25: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/25.jpg)
DBSCANResult: 9 clusters
ε = 2, minPts = 10
![Page 26: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/26.jpg)
DBSCAN
● Pros ○ Don’t have to specify number of clusters○ Fairly robust to outliers○ Handles non-spherical clusters
● Cons○ Non-deterministic
■ For neighbors of two core points, cluster assignment depends on order that data is processed
○ Non-probabilistic■ Returns cluster assignments without uncertainty/probability■ Tuning parameters can be difficult
![Page 27: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/27.jpg)
K-Means
Image: http://shapeofdata.wordpress.com
![Page 28: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/28.jpg)
K-Means
● 1 parameter, k, controls number of clusters● Distance function
○ Typically Euclidean distance in feature space● Result: list of centroids and point labels● Caveat: there are many ways to implement K-Means
![Page 29: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/29.jpg)
K-Means
![Page 30: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/30.jpg)
K-MeansHypothesis: k=2
![Page 31: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/31.jpg)
K-MeansPick k random data points as initial centroids
![Page 32: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/32.jpg)
K-MeansCompute the distance of each point from each centroid
![Page 33: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/33.jpg)
K-MeansCompute the distance of each point from each centroid
![Page 34: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/34.jpg)
K-MeansAssign each point to the closest cluster
![Page 35: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/35.jpg)
K-MeansUpdate centroids to the mean of new clusters
![Page 36: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/36.jpg)
K-MeansUpdate centroids to the mean of new clusters
![Page 37: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/37.jpg)
K-MeansRepeat until clusters converge
![Page 38: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/38.jpg)
K-MeansRepeat until clusters converge little/no change between iters
![Page 39: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/39.jpg)
K-MeansRepeat until clusters converge
![Page 40: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/40.jpg)
K-MeansRepeat until clusters converge
![Page 41: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/41.jpg)
K-Means
● Pros ○ Quick to implement (and usually quick to run)○ Simple, intuitive algorithm
● Cons○ Can be difficult to choose k in practice○ Clusters will all have the same radius○ Sensitive to outliers
![Page 42: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/42.jpg)
How do I choose a good value of k?
● Visually inspect the data○ This is impractical for high-dimensional data
![Page 43: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/43.jpg)
How do I choose a good value of k?
● Visually inspect the data○ This is impractical for high-dimensional data
● Try several different values○ Which value of k minimizes the total error? ○ Increasing k will always decrease the total error○ Pick the value that gives the greatest gain (“elbow method”)
![Page 44: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/44.jpg)
How do I choose a good value of k?
● Visually inspect the data○ This is impractical for high-dimensional data
● Try several different values○ Which value of k minimizes the total error? ○ Increasing k will always decrease the total error○ Pick the value that gives the greatest gain (“elbow method”)
● Other metrics○ Silhouette score (more detail next week)
![Page 45: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/45.jpg)
Expectation-Maximization
Image: http://shapeofdata.wordpress.com
![Page 46: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/46.jpg)
Expectation-Maximization
● You can use EM with a variety of distributions● We will use it with a mixture of Gaussians (GMM)● 2 parameters per cluster
○ μ represents cluster center○ σ represents cluster shape & scale (standard deviation)
● Result○ Parameters for final clusters○ Labels for each point (“soft” assignments)
![Page 47: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/47.jpg)
Gaussian EM
![Page 48: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/48.jpg)
Gaussian EMMake initial guesses of parameter values
![Page 49: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/49.jpg)
Gaussian EMMake initial guesses of parameter values
intentionally bad guesses!
![Page 50: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/50.jpg)
Gaussian EMAssign each point to its most likely cluster (E-step)
![Page 51: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/51.jpg)
Gaussian EMUpdate cluster parameters (M-step)
![Page 52: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/52.jpg)
Gaussian EM
Repeat E-Step Repeat M-Step
iteration 2
![Page 53: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/53.jpg)
iteration 3
Gaussian EM
Repeat E-Step Repeat M-Step
![Page 54: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/54.jpg)
iteration 4
Gaussian EM
Repeat E-Step Repeat M-Step
![Page 55: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/55.jpg)
iteration 5
Gaussian EM
Repeat E-Step Repeat M-Step
![Page 56: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/56.jpg)
iteration 6
Gaussian EM
Repeat E-Step Repeat M-Step
![Page 57: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/57.jpg)
Gaussian EMContinue until clusters converge
![Page 58: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/58.jpg)
Expectation-Maximization
● Pros○ Each cluster is a statistical distribution ○ “Soft” assignment with probabilistic uncertainty○ Robust to poor initial clusters○ Clusters can be different sizes○ Very flexible approximation
● Cons○ Must choose number of clusters in advance○ Distance function is less flexible○ Somewhat sensitive to outliers
![Page 59: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/59.jpg)
Comparison
Truth DBSCAN
![Page 60: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/60.jpg)
Comparison
Truth DBSCAN
![Page 61: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/61.jpg)
Comparison
Truth K-Means
![Page 62: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/62.jpg)
Comparison
Truth K-Means
![Page 63: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/63.jpg)
Comparison
Truth Gaussian EM
![Page 64: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/64.jpg)
Comparison
Truth Gaussian EM
![Page 65: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/65.jpg)
Choosing a clustering algorithm
● DBSCAN○ You have a good sense of what an appropriate ε threshold is○ You are concerned about outliers
● K-Means○ You want to get started very quickly○ You have time to try different values of k
● Expectation-Maximization○ You want to convey uncertainty about your predictions○ You know the number of clusters or have time to try several values
![Page 66: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/66.jpg)
Other Clustering Approaches
Hierarchicalhttp://www.sthda.com/sthda/RDoc/figure/clustering/cluster-analysi
s-in-r-hierarchical-clustering2-1.png
Graph-Basedhttps://hemberg-lab.github.io/scRNA.seq.course/figures/graph_net
work.jpg
Neural Modelhttp://www.pitt.edu/~is2470pb/Spring05/FinalProjects/Group1a/tut
orial/kohonen1.gif
![Page 67: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/67.jpg)
Other Clustering Approaches
Image: http://scikit-learn.org/stable/modules/clustering.html
![Page 68: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/68.jpg)
When Clustering Can Fail
non-spherical different densities high-dimensional data
DBSCAN ✔ ✘ ✘
K-Means ✘ ✘
Gaussian EM ✘ ✔
http://hameddaily.blogspot.com/2015/03/when-not-to-use-gaussian-mixtures-model.html
https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Schlegel_wireframe_8-cell.png/400px-Schlegel_wireframe_8-cell.png
![Page 69: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/69.jpg)
When to Use Clustering
● When you...○ suspect your data consists of underlying groups○ cannot collect labels for classification○ can define “similar” points via a distance metric
![Page 70: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/70.jpg)
Resources
● Geometric view of clustering: https://shapeofdata.wordpress.com/category/clustering/ ● K-Means
○ https://www.datascience.com/blog/k-means-clustering ○ https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ ○ Elbow method: https://dataskeptic.com/blog/episodes/2016/the-elbow-method
● DBSCAN○ https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/ ○ http://lineardigressions.com/episodes/2017/11/19/dbscan
● EM○ http://hameddaily.blogspot.com/2015/03/when-not-to-use-gaussian-mixtures-model.html○ https://pythonmachinelearning.pro/clustering-with-gaussian-mixture-models/
● Other clustering approaches○ http://lineardigressions.com/episodes/2016/2/29/t-sne-reduce-your-dimensions-keep-your-clusters
● TF Example: http://playground.tensorflow.org ● Image compression: http://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html
![Page 71: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/71.jpg)
Thanks!
Image: https://toptal.io/
![Page 72: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/72.jpg)
Lab Session
![Page 73: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/73.jpg)
Recap
● Clustering is an unsupervised approach● DBSCAN
○ Intuitive parameters○ Difficult to control number of clusters
● K-Means○ Easy to control number of clusters○ Sensitive to outliers
● Gaussian EM○ Probabilistic cluster assignment
![Page 74: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/74.jpg)
Evaluating Clustering Models
● How can we find the “best” parameters? ○ Grid Search
● How can we evaluate cluster quality without labeled data? ○ Silhouette Score
● Code Lab
![Page 75: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/75.jpg)
Grid Search
Image: http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf
![Page 76: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/76.jpg)
Grid Search
1 2 3 ... n
K-Means
k
DBSCAN
1 2 3 ... nεminPts
1
2
...
n
![Page 77: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/77.jpg)
Grid Search
● Pros○ Very simple○ Works well with a small number parameters
● Cons○ Assumes all parameters are equally important
■ Not always true! ■ When this assumption is violated, consider random search
There are many other advanced parameter search methods, too
![Page 78: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/78.jpg)
Silhouette Score
● Revisiting the question “how do I choose k?” ● General scoring method for clustering algorithms
○ Works for models other than K-Means too
![Page 79: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/79.jpg)
Silhouette Score
● a(i): average dissimilarity of i to all other objects in cluster A ● d(i, X): average dissimilarity between i and objects in cluster X● b(i) = minX≠A d(i, X)
Image: doi:10.1016/0377-0427(87)90125-7
![Page 80: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/80.jpg)
Silhouette Score
● a: mean distance between a sample & all other points in the same cluster ● b: mean distance between a sample & all other points in the next nearest cluster
![Page 81: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/81.jpg)
Silhouette Score
● Only defined for n>1 and k>1
-1 ≤ s(i) ≤ 1
![Page 82: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/82.jpg)
Silhouette Score
● Close to 1: a ≪ b○ Dissimilarity within clusters ≪ smallest dissimilarity between clusters○ Objects are well-clustered
● Close to 0: a ≅ b○ This object could belong to either of two clusters without affecting quality
● Close to -1: b ≪ a○ This object is very different from the others in its cluster○ Likely assigned to the wrong cluster
![Page 83: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/83.jpg)
Silhouette Score
![Page 84: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/84.jpg)
Silhouette Score
![Page 85: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/85.jpg)
Silhouette Score
![Page 86: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/86.jpg)
Silhouette Score
![Page 87: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/87.jpg)
Silhouette Score
![Page 88: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/88.jpg)
Silhouette Score
![Page 89: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/89.jpg)
Silhouette Score
● Pros○ Simple to evaluate and compare○ Allows comparison of different clustering methods○ Tends to correlate with qualitative judgment of “good” clusters
● Cons○ Can be sensitive to data scaling
■ Scale your features to equal “importance” ○ Average silhouette score can mask differences between clusters○ Biased toward many small clusters
Many other cluster quality metrics exist, too
![Page 90: Matt Dickenson Clustering · Agenda Introduction Why Clustering Works Popular Approaches DBSCAN K-Means Gaussian EM When to Use Clustering When Clustering Fails](https://reader033.vdocument.in/reader033/viewer/2022043007/5f953163ff705041a706d166/html5/thumbnails/90.jpg)
Resources
● Parameter search○ https://medium.com/rants-on-machine-learning/smarter-parameter-sweeps-or-why-grid-search-is-pl
ain-stupid-c17d97a0e881 ○ https://blog.sigopt.com/posts/evaluating-hyperparameter-optimization-strategies
● Silhouette score○ http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-gl
r-auto-examples-cluster-plot-kmeans-silhouette-analysis-py ○ http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient ○ https://kapilddatascience.wordpress.com/2015/12/08/k-mean-clustering-using-silhouette-analysis-
with-example-part-3/ ○ https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set