the clustering problem
DESCRIPTION
The Clustering Problem. Yongsub Lim Applied Algorithm Laboratory KAIST. Contents. The Clustering Problem Basic Algorithms K-Means K-Clustering of Max. Spacing Two-Phase Algorithms Other Algorithms. The Clustering Problem. Given data, it is to discover “meaningful” groups - PowerPoint PPT PresentationTRANSCRIPT
The Clustering Problem
Yongsub LimApplied Algorithm Laboratory
KAIST
04/19/2023 The Clustering Problem 2
Contents
• The Clustering Problem• Basic Algorithms
K-Means K-Clustering of Max. Spacing
• Two-Phase Algorithms• Other Algorithms
04/19/2023 The Clustering Problem 3
The Clustering Problem
• Given data, it is to discover “mean-ingful” groups
• Data in same group are similar, and• Data between different groups are
not similar
04/19/2023 The Clustering Problem 4
Example of clustering
1x
2x
04/19/2023 The Clustering Problem 5
Example of clustering
1x
2x
04/19/2023 The Clustering Problem 6
Example of clustering
1x
2x
04/19/2023 The Clustering Problem 7
Applications of Clustering
• The image segmentation problem can be considered as a clustering of pixels of an image
• In unsupervised learning, before making a decision rule, we classify unlabeled training data through clus-tering
04/19/2023 The Clustering Problem 8
Applications of Clustering
• In a network or a graph, we can do grouping vertices which are highly connected within one group
• Clustering is also useful in biology to classify genes
04/19/2023 The Clustering Problem 9
Basic Algorithms
• Two algorithms will be introduced
• K-Means computes iteratively centers of K clusters
• K-Clustering of Max. Spacing uses a minimum spanning tree
• Objective functions of theses are dif-ferent
04/19/2023 The Clustering Problem 10
K-Means
• Determine means of K clusters ran-domly
• At each iteration, Every data belongs to a cluster whose
mean is the nearest one among K means
Re-compute means of all clusters
04/19/2023 The Clustering Problem 11
K-Means
• Objective is to minimize the sum of distance of centers of clusters and their members
• It is clustering for high density in one cluster
04/19/2023 The Clustering Problem 12
K-Means Algorithm
• Worst caseInitial two cen-ters randomly chosen
This may be not what we want!!!
04/19/2023 The Clustering Problem 13
K-Clustering of Max. Spacing
• Given data, find K clusters which maximize the minimum distances be-tween all pairs of clusters
• spacing: min. distance between any pair of data in different clusters
04/19/2023 The Clustering Problem 14
K-Clustering of Max. Spacing
04/19/2023 The Clustering Problem 15
K-Clustering of Max. Spacing
• Consider given data to a complete graph with Euclidean distance
• Compute a MST
• Delete the K-1 most expensive edges of a MST
04/19/2023 The Clustering Problem 16
K-Clustering of Max. Spacing
Calg
Copt
≤ spacing of Calg
04/19/2023 The Clustering Problem 17
K-Clustering of Max. Spacing
• It is no randomness
• Objective seems to be better or more reasonable than K-means
04/19/2023 The Clustering Problem 18
K-Means vs. Max. Spacing
• Good clustering is High density in one cluster (K-Means) Long dist. between clusters (Max. Spac-
ing)
>
04/19/2023 The Clustering Problem 19
K-Means vs. Max. Spacing
04/19/2023 The Clustering Problem 20
Two-Phase Algorithms
• Two algorithms will be introduced
• In the first phase, both do clustering without restriction on K
• In second phase, if # of clusters are larger than K, merge using Max. Spacing
04/19/2023 The Clustering Problem 21
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
04/19/2023 The Clustering Problem 22
Hierarchical EMST
• HEMST removes all edges with weights greater than the threshold (mean+std. of edges)
• If # of clusters is less than a given K, same with Max. Spacing
• If not, it runs Max. Spacing on data set each of which is nearest to the center of its cluster
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
04/19/2023 The Clustering Problem 23
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
04/19/2023 The Clustering Problem 24
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
04/19/2023 The Clustering Problem 25
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
04/19/2023 The Clustering Problem 26
Hierarchical EMST
Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms
04/19/2023 The Clustering Problem 27
Modified K-Means Process
• MKF, in the first phase, is similar to K-Means
• The difference is that if data is far enough from all clusters, it becomes the center of the new cluster
• While running, if # of clusters is larger than a threshold, the two nearest clus-ters are merged
• In the second phase, apply Max. Spaing
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
04/19/2023 The Clustering Problem 28
Modified K-Means Process
• This scheme can identify outliers by using Max. Spacing
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
04/19/2023 The Clustering Problem 29
Modified K-Means Process
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
04/19/2023 The Clustering Problem 30
Modified K-Means Process
M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection
04/19/2023 The Clustering Problem 31
Two-Phase Algorithms
• Both give more weights to members in small sets in the first phase
• A small set will be the most likely clustered data, so it is reasonable to decrease distances between them
04/19/2023 The Clustering Problem 32
Other Algorithms
• HCS uses min-cut of a graph
• It recursively separate data to dis-joint two subsets (min-cut) until all clusters are highly connected
• A graph is highly connected if the min. # of edges whose removal disconnects the graph is greater than |V|/2
Erez Hartuv, Ron Shamir, a clustering algorithm based on graph connectivity
04/19/2023 The Clustering Problem 33
Other Algorithms
• Voting
• Apply K-Means N times
• If any pair of data belonged to same cluster greater than threshold t times, they are grouped
Ana L.N. Fred, Anil K. Jain, Data Clustering Using Evidence ac-cumulation
04/19/2023 The Clustering Problem 34
Thanks