the clustering problem

34
The Clustering Problem Yongsub Lim Applied Algorithm Laboratory KAIST

Upload: gannon-winters

Post on 31-Dec-2015

37 views

Category:

Documents


0 download

DESCRIPTION

The Clustering Problem. Yongsub Lim Applied Algorithm Laboratory KAIST. Contents. The Clustering Problem Basic Algorithms K-Means K-Clustering of Max. Spacing Two-Phase Algorithms Other Algorithms. The Clustering Problem. Given data, it is to discover “meaningful” groups - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Clustering Problem

The Clustering Problem

Yongsub LimApplied Algorithm Laboratory

KAIST

Page 2: The Clustering Problem

04/19/2023 The Clustering Problem 2

Contents

• The Clustering Problem• Basic Algorithms

K-Means K-Clustering of Max. Spacing

• Two-Phase Algorithms• Other Algorithms

Page 3: The Clustering Problem

04/19/2023 The Clustering Problem 3

The Clustering Problem

• Given data, it is to discover “mean-ingful” groups

• Data in same group are similar, and• Data between different groups are

not similar

Page 4: The Clustering Problem

04/19/2023 The Clustering Problem 4

Example of clustering

1x

2x

Page 5: The Clustering Problem

04/19/2023 The Clustering Problem 5

Example of clustering

1x

2x

Page 6: The Clustering Problem

04/19/2023 The Clustering Problem 6

Example of clustering

1x

2x

Page 7: The Clustering Problem

04/19/2023 The Clustering Problem 7

Applications of Clustering

• The image segmentation problem can be considered as a clustering of pixels of an image

• In unsupervised learning, before making a decision rule, we classify unlabeled training data through clus-tering

Page 8: The Clustering Problem

04/19/2023 The Clustering Problem 8

Applications of Clustering

• In a network or a graph, we can do grouping vertices which are highly connected within one group

• Clustering is also useful in biology to classify genes

Page 9: The Clustering Problem

04/19/2023 The Clustering Problem 9

Basic Algorithms

• Two algorithms will be introduced

• K-Means computes iteratively centers of K clusters

• K-Clustering of Max. Spacing uses a minimum spanning tree

• Objective functions of theses are dif-ferent

Page 10: The Clustering Problem

04/19/2023 The Clustering Problem 10

K-Means

• Determine means of K clusters ran-domly

• At each iteration, Every data belongs to a cluster whose

mean is the nearest one among K means

Re-compute means of all clusters

Page 11: The Clustering Problem

04/19/2023 The Clustering Problem 11

K-Means

• Objective is to minimize the sum of distance of centers of clusters and their members

• It is clustering for high density in one cluster

Page 12: The Clustering Problem

04/19/2023 The Clustering Problem 12

K-Means Algorithm

• Worst caseInitial two cen-ters randomly chosen

This may be not what we want!!!

Page 13: The Clustering Problem

04/19/2023 The Clustering Problem 13

K-Clustering of Max. Spacing

• Given data, find K clusters which maximize the minimum distances be-tween all pairs of clusters

• spacing: min. distance between any pair of data in different clusters

Page 14: The Clustering Problem

04/19/2023 The Clustering Problem 14

K-Clustering of Max. Spacing

Page 15: The Clustering Problem

04/19/2023 The Clustering Problem 15

K-Clustering of Max. Spacing

• Consider given data to a complete graph with Euclidean distance

• Compute a MST

• Delete the K-1 most expensive edges of a MST

Page 16: The Clustering Problem

04/19/2023 The Clustering Problem 16

K-Clustering of Max. Spacing

Calg

Copt

≤ spacing of Calg

Page 17: The Clustering Problem

04/19/2023 The Clustering Problem 17

K-Clustering of Max. Spacing

• It is no randomness

• Objective seems to be better or more reasonable than K-means

Page 18: The Clustering Problem

04/19/2023 The Clustering Problem 18

K-Means vs. Max. Spacing

• Good clustering is High density in one cluster (K-Means) Long dist. between clusters (Max. Spac-

ing)

>

Page 19: The Clustering Problem

04/19/2023 The Clustering Problem 19

K-Means vs. Max. Spacing

Page 20: The Clustering Problem

04/19/2023 The Clustering Problem 20

Two-Phase Algorithms

• Two algorithms will be introduced

• In the first phase, both do clustering without restriction on K

• In second phase, if # of clusters are larger than K, merge using Max. Spacing

Page 21: The Clustering Problem

04/19/2023 The Clustering Problem 21

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 22: The Clustering Problem

04/19/2023 The Clustering Problem 22

Hierarchical EMST

• HEMST removes all edges with weights greater than the threshold (mean+std. of edges)

• If # of clusters is less than a given K, same with Max. Spacing

• If not, it runs Max. Spacing on data set each of which is nearest to the center of its cluster

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 23: The Clustering Problem

04/19/2023 The Clustering Problem 23

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 24: The Clustering Problem

04/19/2023 The Clustering Problem 24

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 25: The Clustering Problem

04/19/2023 The Clustering Problem 25

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 26: The Clustering Problem

04/19/2023 The Clustering Problem 26

Hierarchical EMST

Oleksandr Grygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

Page 27: The Clustering Problem

04/19/2023 The Clustering Problem 27

Modified K-Means Process

• MKF, in the first phase, is similar to K-Means

• The difference is that if data is far enough from all clusters, it becomes the center of the new cluster

• While running, if # of clusters is larger than a threshold, the two nearest clus-ters are merged

• In the second phase, apply Max. Spaing

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 28: The Clustering Problem

04/19/2023 The Clustering Problem 28

Modified K-Means Process

• This scheme can identify outliers by using Max. Spacing

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 29: The Clustering Problem

04/19/2023 The Clustering Problem 29

Modified K-Means Process

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 30: The Clustering Problem

04/19/2023 The Clustering Problem 30

Modified K-Means Process

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection

Page 31: The Clustering Problem

04/19/2023 The Clustering Problem 31

Two-Phase Algorithms

• Both give more weights to members in small sets in the first phase

• A small set will be the most likely clustered data, so it is reasonable to decrease distances between them

Page 32: The Clustering Problem

04/19/2023 The Clustering Problem 32

Other Algorithms

• HCS uses min-cut of a graph

• It recursively separate data to dis-joint two subsets (min-cut) until all clusters are highly connected

• A graph is highly connected if the min. # of edges whose removal disconnects the graph is greater than |V|/2

Erez Hartuv, Ron Shamir, a clustering algorithm based on graph connectivity

Page 33: The Clustering Problem

04/19/2023 The Clustering Problem 33

Other Algorithms

• Voting

• Apply K-Means N times

• If any pair of data belonged to same cluster greater than threshold t times, they are grouped

Ana L.N. Fred, Anil K. Jain, Data Clustering Using Evidence ac-cumulation

Page 34: The Clustering Problem

04/19/2023 The Clustering Problem 34

Thanks