machine learning queens college lecture 7: clustering

61
Machine Learning Queens College Lecture 7: Clustering

Upload: sherman-page

Post on 18-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Queens College Lecture 7: Clustering

Machine Learning

Queens College

Lecture 7: Clustering

Page 2: Machine Learning Queens College Lecture 7: Clustering

Clustering

• Unsupervised technique.

• Task is to identify groups of similar entities in the available data.– And sometimes remember how these groups

are defined for later use.

2

Page 3: Machine Learning Queens College Lecture 7: Clustering

People are outstanding at this

3

Page 4: Machine Learning Queens College Lecture 7: Clustering

People are outstanding at this

4

Page 5: Machine Learning Queens College Lecture 7: Clustering

People are outstanding at this

5

Page 6: Machine Learning Queens College Lecture 7: Clustering

Dimensions of analysis

6

Page 7: Machine Learning Queens College Lecture 7: Clustering

Dimensions of analysis

7

Page 8: Machine Learning Queens College Lecture 7: Clustering

Dimensions of analysis

8

Page 9: Machine Learning Queens College Lecture 7: Clustering

Dimensions of analysis

9

Page 10: Machine Learning Queens College Lecture 7: Clustering

How would you do this computationally or

algorithmically?

10

Page 11: Machine Learning Queens College Lecture 7: Clustering

Machine Learning Approaches

• Possible Objective functions to optimize– Maximum Likelihood/Maximum A Posteriori– Empirical Risk Minimization– Loss function Minimization

What makes a good cluster?

11

Page 12: Machine Learning Queens College Lecture 7: Clustering

Cluster Evaluation

• Intrinsic Evaluation– Measure the compactness of the clusters. – or similarity of data points that are assigned to

the same cluster

• Extrinsic Evaluation– Compare the results to some gold standard

or labeled data.– Not covered today.

12

Page 13: Machine Learning Queens College Lecture 7: Clustering

Intrinsic Evaluation cluster variability

• Intercluster Variability (IV)– How different are the data points within the same

cluster?

• Extracluster Variability (EV)– How different are the data points in distinct clusters?

• Goal: Minimize IV while maximizing EV

13

Page 14: Machine Learning Queens College Lecture 7: Clustering

Degenerate Clustering Solutions

14

Page 15: Machine Learning Queens College Lecture 7: Clustering

Degenerate Clustering Solutions

15

Page 16: Machine Learning Queens College Lecture 7: Clustering

Two approaches to clustering

• Hierarchical– Either merge or split clusters.

• Partitional– Divide the space into a fixed number of

regions and position their boundaries appropriately

16

Page 17: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

17

Page 18: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

18

Page 19: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

19

Page 20: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

20

Page 21: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

21

Page 22: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

22

Page 23: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

23

Page 24: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

24

Page 25: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

25

Page 26: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

26

Page 27: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

27

Page 28: Machine Learning Queens College Lecture 7: Clustering

Hierarchical Clustering

28

Page 29: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

29

Page 30: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

30

Page 31: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

31

Page 32: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

32

Page 33: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

33

Page 34: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

34

Page 35: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

35

Page 36: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

36

Page 37: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

37

Page 38: Machine Learning Queens College Lecture 7: Clustering

Agglomerative Clustering

38

Page 39: Machine Learning Queens College Lecture 7: Clustering

Dendrogram

39

Page 40: Machine Learning Queens College Lecture 7: Clustering

K-Means

• K-Means clustering is a partitional clustering algorithm.– Identify different partitions of the space for a

fixed number of clusters– Input: a value for K – the number of clusters– Output: K cluster centroids.

40

Page 41: Machine Learning Queens College Lecture 7: Clustering

K-Means

41

Page 42: Machine Learning Queens College Lecture 7: Clustering

K-Means Algorithm

• Given an integer K specifying the number of clusters

• Initialize K cluster centroids– Select K points from the data set at random– Select K points from the space at random

• For each point in the data set, assign it to the cluster center it is closest to

• Update each centroid based on the points that are assigned to it

• If any data point has changed clusters, repeat42

Page 43: Machine Learning Queens College Lecture 7: Clustering

How does K-Means work?

• When an assignment is changed, the distance of the data point to its assigned cluster is reduced.– IV is lower

• When a cluster centroid is moved, the mean distance from the centroid to its data points is reduced– IV is lower.

• At convergence, we have found a local minimum of IV

43

Page 44: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

44

Page 45: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

45

Page 46: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

46

Page 47: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

47

Page 48: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

48

Page 49: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

49

Page 50: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

50

Page 51: Machine Learning Queens College Lecture 7: Clustering

K-means Clustering

51

Page 52: Machine Learning Queens College Lecture 7: Clustering

Potential Problems with K-Means

• Optimal?– Is the K-means solution optimal?– K-means approaches a local minimum, but

this is not guaranteed to be globally optimal.

• Consistent?– Different seed centroids can lead to different

assignments.

52

Page 53: Machine Learning Queens College Lecture 7: Clustering

Suboptimal K-means Clustering

53

Page 54: Machine Learning Queens College Lecture 7: Clustering

Inconsistent K-means Clustering

54

Page 55: Machine Learning Queens College Lecture 7: Clustering

Inconsistent K-means Clustering

55

Page 56: Machine Learning Queens College Lecture 7: Clustering

Inconsistent K-means Clustering

56

Page 57: Machine Learning Queens College Lecture 7: Clustering

Inconsistent K-means Clustering

57

Page 58: Machine Learning Queens College Lecture 7: Clustering

Soft K-means

• In K-means, each data point is forced to be a member of exactly one cluster.

• What if we relax this constraint?

58

Page 59: Machine Learning Queens College Lecture 7: Clustering

Soft K-means

• Still define a cluster by a centroid, but now we calculate a centroid as a weighted center of all data points.

• Now convergence is based on a stopping threshold rather than changing assignments

59

Page 60: Machine Learning Queens College Lecture 7: Clustering

Another problem for K-Means

60

Page 61: Machine Learning Queens College Lecture 7: Clustering

Next Time

• Evaluation for Classification and Clustering– How do you know if you’re doing well

61