k-meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · the k-means algorithm...
TRANSCRIPT
![Page 1: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/1.jpg)
1
K-Means
Class Algorithmic Methods of Data MiningProgram M. Sc. Data ScienceUniversity Sapienza University of RomeSemester Fall 2018Slides by Carlos Castillo http://chato.cl/
Sources:● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download]
● Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html
![Page 2: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/2.jpg)
2
Boston University Slideshow Title Goes Here
The k-means problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given• problem:
• find k points c1,...,ck (named centers or means)
so that the cost
is minimized
![Page 3: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/3.jpg)
3
Boston University Slideshow Title Goes Here
The k-means problem
• k=1 and k=n are easy special cases (why?)
• an NP-hard problem if the dimension of the data is at least 2 (d≥2)
• in practice, a simple iterative algorithm works quite well
![Page 4: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/4.jpg)
4
Boston University Slideshow Title Goes Here
The k-means algorithm
• voted among the top-10 algorithms in data mining
• one way of solving the k-means problem
![Page 5: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/5.jpg)
5
K-means algorithm
![Page 6: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/6.jpg)
6
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj (mean of the vectors in Xj)
1.repeat (go to step 2) until convergence
![Page 7: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/7.jpg)
7
Boston University Slideshow Title Goes Here
Sample execution
![Page 8: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/8.jpg)
8
1-dimensional clustering exercise
Exercise:
● For the data in the figure● Run k-means with k=2 and initial centroids u1=2, u2=4
(Verify: last centroids are 18 units apart)
● Try with k=3 and initialization 2,3,30
http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
![Page 9: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/9.jpg)
9
Limitations of k-means
● Clusters of different size● Clusters of different density● Clusters of non-globular shape● Sensitive to initialization
![Page 10: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/10.jpg)
10
Boston University Slideshow Title Goes Here
Limitations of k-means: different sizes
![Page 11: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/11.jpg)
11
Boston University Slideshow Title Goes Here
Limitations of k-means: different density
![Page 12: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/12.jpg)
12
Boston University Slideshow Title Goes Here
Limitations of k-means: non-spherical shapes
![Page 13: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/13.jpg)
13
Boston University Slideshow Title Goes Here
Effects of bad initialization
![Page 14: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/14.jpg)
14
Boston University Slideshow Title Goes Here
k-means algorithm
• finds a local optimum
• often converges quickly but not always
• the choice of initial points can have large influence in the result
• tends to find spherical clusters
• outliers can cause a problem
• different densities may cause a problem
![Page 15: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/15.jpg)
15
Advanced: k-means initialization
![Page 16: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/16.jpg)
16
Boston University Slideshow Title Goes Here
Initialization
• random initialization
• random, but repeat many times and take the best solution• helps, but solution can still be bad
• pick points that are distant to each other• k-means++• provable guarantees
![Page 17: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/17.jpg)
17
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskii
k-means++: The advantages of careful seeding
SODA 2007
![Page 18: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/18.jpg)
18
Boston University Slideshow Title Goes Here
k-means algorithm: random initialization
![Page 19: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/19.jpg)
19
Boston University Slideshow Title Goes Here
k-means algorithm: random initialization
![Page 20: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/20.jpg)
20
Boston University Slideshow Title Goes Here
12
34
k-means algorithm: initialization with further-first traversal
![Page 21: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/21.jpg)
21
Boston University Slideshow Title Goes Here
k-means algorithm: initialization with further-first traversal
![Page 22: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/22.jpg)
22
Boston University Slideshow Title Goes Here
12
3
but... sensitive to outliers
![Page 23: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/23.jpg)
23
Boston University Slideshow Title Goes Here
but... sensitive to outliers
![Page 24: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/24.jpg)
24
Boston University Slideshow Title Goes Here
Here random may work well
![Page 25: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/25.jpg)
25
Boston University Slideshow Title Goes Here
k-means++ algorithm
• interpolate between the two methods
• let D(x) be the distance between x and the nearest center selected so far
• choose next center with probability proportional to
(D(x))a = Da(x)
✦ a = 0 random initialization✦ a = ∞ furthest-first traversal✦ a = 2 k-means++
![Page 26: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/26.jpg)
26
Boston University Slideshow Title Goes Here
k-means++ algorithm
• initialization phase: • choose the first center uniformly at random• choose next center with probability proportional
to D2(x)
• iteration phase:• iterate as in the k-means algorithm until
convergence
![Page 27: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/27.jpg)
27
Boston University Slideshow Title Goes Here
k-means++ initialization
1
2
3
![Page 28: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/28.jpg)
28
Boston University Slideshow Title Goes Here
k-means++ result
![Page 29: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/29.jpg)
29
Boston University Slideshow Title Goes Here
• approximation guarantee comes just from the first iteration (initialization)
• subsequent iterations can only improve cost
k-means++ provable guarantee
![Page 30: K-Meansaris.me/contents/teaching/data-mining-ds-2018/slides/kmeans.pdf · The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j,](https://reader033.vdocument.in/reader033/viewer/2022050107/5f44e193c84a0374f3022b4f/html5/thumbnails/30.jpg)
30
Boston University Slideshow Title Goes Here
Lesson learned
• no reason to use k-means and not k-means++
• k-means++ :• easy to implement• provable guarantee• works well in practice