clustering hongfei yan school of eecs, peking university 7/8/2009 refer to aaron kimball’s slides
DESCRIPTION
Other less glamorous things... Hospital Records Scientific Imaging –Related genes, related stars, related sequences Market Research –Segmenting markets, product positioning Social Network Analysis Data mining Image segmentation…TRANSCRIPT
![Page 1: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/1.jpg)
Clustering
http://net.pku.edu.cn/~course/cs402/2009
Hongfei YanSchool of EECS, Peking University
7/8/2009
Refer to Aaron Kimball’s slides
![Page 2: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/2.jpg)
Google News
• They didn’t pick all 3,400,217 related articles by hand…
• Or Amazon.com • Or Netflix…
![Page 3: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/3.jpg)
Other less glamorous things...
• Hospital Records• Scientific Imaging
– Related genes, related stars, related sequences• Market Research
– Segmenting markets, product positioning• Social Network Analysis• Data mining• Image segmentation…
![Page 4: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/4.jpg)
The Distance Measure• How the similarity of two elements in a set is
determined, e.g.– Euclidean Distance
– Manhattan Distance– Inner Product Space– Maximum Norm – Or any metric you define over the space…
![Page 5: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/5.jpg)
• Hierarchical Clustering vs.• Partitional Clustering
Types of Algorithms
![Page 6: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/6.jpg)
Hierarchical Clustering
• Builds or breaks up a hierarchy of clusters.
![Page 7: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/7.jpg)
Partitional Clustering
• Partitions set into all clusters simultaneously.
![Page 8: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/8.jpg)
Partitional Clustering
• Partitions set into all clusters simultaneously.
![Page 9: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/9.jpg)
K-Means Clustering • Simple Partitional Clustering
• Choose the number of clusters, k• Choose k points to be cluster centers• Then…
![Page 10: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/10.jpg)
K-Means Clustering
iterate { Compute distance from all points to all k- centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers
Replace the k-centers with the new averages}
![Page 11: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/11.jpg)
But!
• The complexity is pretty high: – k * n * O ( distance metric ) * num (iterations)
• Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.
![Page 12: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/12.jpg)
Furthermore
• There are three big ways a data set can be large:– There are a large number of elements in the
set.– Each element can have many features.– There can be many clusters to discover
• Conclusion – Clustering can be huge, even when you distribute it.
![Page 13: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/13.jpg)
Canopy Clustering
• Preliminary step to help parallelize computation.
• Clusters data into overlapping Canopies using super cheap distance metric.
• Efficient• Accurate
![Page 14: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/14.jpg)
Canopy Clustering
While there are unmarked points { pick a point which is not strongly marked call it a canopy centermark all points within some threshold of
it as in it’s canopystrongly mark all points within some
stronger threshold }
![Page 15: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/15.jpg)
After the canopy clustering…
• Resume hierarchical or partitional clustering as usual.
• Treat objects in separate clusters as being at infinite distances.
![Page 16: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/16.jpg)
MapReduce Implementation:
• Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.
![Page 17: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/17.jpg)
The Distance Metric• The Canopy Metric ($)
• The K-Means Metric ($$$)
![Page 18: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/18.jpg)
Steps!
• Get Data into a form you can use (MR)• Picking Canopy Centers (MR)• Assign Data Points to Canopies (MR)• Pick K-Means Cluster Centers• K-Means algorithm (MR)
– Iterate!
![Page 19: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/19.jpg)
Data Massage• This isn’t interesting, but it has to be done.
![Page 20: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/20.jpg)
Selecting Canopy Centers
![Page 21: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/21.jpg)
![Page 22: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/22.jpg)
![Page 23: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/23.jpg)
![Page 24: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/24.jpg)
![Page 25: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/25.jpg)
![Page 26: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/26.jpg)
![Page 27: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/27.jpg)
![Page 28: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/28.jpg)
![Page 29: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/29.jpg)
![Page 30: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/30.jpg)
![Page 31: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/31.jpg)
![Page 32: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/32.jpg)
![Page 33: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/33.jpg)
![Page 34: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/34.jpg)
![Page 35: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/35.jpg)
![Page 36: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/36.jpg)
Assigning Points to Canopies
![Page 37: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/37.jpg)
![Page 38: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/38.jpg)
![Page 39: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/39.jpg)
![Page 40: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/40.jpg)
![Page 41: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/41.jpg)
![Page 42: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/42.jpg)
K-Means Map
![Page 43: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/43.jpg)
![Page 44: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/44.jpg)
![Page 45: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/45.jpg)
![Page 46: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/46.jpg)
![Page 47: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/47.jpg)
![Page 48: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/48.jpg)
![Page 49: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/49.jpg)
![Page 50: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/50.jpg)
Elbow Criterion• Choose a number of clusters s.t. adding a
cluster doesn’t add interesting information.
• Rule of thumb to determine what number of Clusters should be chosen.
• Initial assignment of cluster seeds has bearing on final model performance.
• Often required to run clustering several times to get maximal performance
![Page 51: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/51.jpg)
Clustering Conclusions
• Clustering is slick• And it can be done super efficiently• And in lots of different ways
![Page 52: Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides](https://reader036.vdocument.in/reader036/viewer/2022062503/5a4d1aee7f8b9ab05997c5e0/html5/thumbnails/52.jpg)
Homework
• Lab 4 - Clustering the Netflix Movie Data• Hw4 – Read IIR chapter 16
– Flat clustering