a dissimilarity measure for the k-modes clustering algorithm

A dissimilarity measure for the K-Modes clustering algorithm

Presenter : Bo-Sheng Wang 　Authors : Fuyuan Cao, Jiye Liang, Deyu Li, Liang Bai, Chuangyin Dang

KBS, 2012

1

Outlines

• Motivation• Objectives• Methodology• Experiments• Conclusions• Comments

2

Motivation• In this paper, the limitations of simple matching

dissimilarity measure and Ng’s dissimilarity measure are revealed using some illustrative examples.

3

Limitations of simple matching dissimilarity measure

• Simple matching is a common approach, the simple matching dissimilarity measure is is defined as:

• However, simple matching often results ：– Weak intrasimilarity.– Disregards the similarity hidden between categorical values.

4

x≡y =1, if x≠y

0, otherwise

Limitations of Ng’s dissimilarity measure

• For the k-Modes algorithm with Ng’s dissimilarity measure, the simple matching dissimilarity measure is still used in the first iteration.

– Disregards the similarity hidden between categorical values.

5

Objectives• Based on the idea of biological and genetic taxonomy

and rough membership function, a new dissimilarity measure for the k-Modes algorithm is define.

• The dissimilarity measure between a mode of a cluster and an object is given by improving Ng’s dissimilarity measure.

6

Methodology• Review some basic concepts of rough set theory.– Definition 1 Categorical information system• IS = (U,A,V,f)

– Definition 2 Binary relation IND(P)• 1.• 2.

– .Definition 3 The rough membership function µPX: U→[0,1]

•

7

Methodology-A new dissimilarity measure between two objects• Definition 4 A similarity measure between objects x and y with respect to

a–

8

Methodology-A new dissimilarity measure between two objects• Definition 5 The dissimilarity measure between x and y with respect to P.

9

Methodology-A new dissimilarity measure between two objects

• Example ： A new dissimilarity measure between two objects– Simple Matching Dissimilarity Measure ：

– New Dissimilarity Measure ：

10

Methodology-A new dissimilarity measure between a mode and an object• Ng’s Dissimilarity Measure

11

Methodology-A new dissimilarity measure between a mode and an object• Definition 7

The new dissimilarity measure between xi and zl with respect to P

12

Methodology-A new dissimilarity measure between a mode and an objects• Example ： A new dissimilarity measure between a mode and an object

– Ng’s dissimilarity measure

– New dissimilarity measure

13

Methodology-Convergence and complexity analysis• The objective of clustering a set of n = |U| objects into k

clusters is to find W and Z that minimize:

14

Methodology-Convergence and complexity analysis• This process can be formulated as the following k-

Modes algorithm:

15

Methodology-Convergence and complexity analysis• Now we consider the convergence of the k-Modes algorithm

with the proposed dissimilarity measure NDisP(zl ,x i )

16

Methodology-Convergence and complexity analysis• Proof. For a given W. we have ：

17

Methodology-Convergence and complexity analysis

18

Methodology-Convergence and complexity analysis

19

Experiments• Evaluation on scalability

20

Experiments• Evaluation on scalability

21

Experiments• Evaluation on clustering efficiency

22

Conclusions• The new measure that unifies the dissimilarity measures

between two objects and between an object and a mode as well.

• The k-Modes algorithm using the new dissimilarity measure can be safely and effectively used in case of large data sets.

• The results of experiments using synthetic data sets and five real data sets from UCI show the effectiveness of the new dissimilarity measure.

23

Comments

• Advantages– The method that can save some time.

• Applications– Dissimilarity measure

24

a dissimilarity measure for the k-modes clustering algorithm

Documents

have17 methodologyconvergence

modes algorithm

complexity analysisnow

complexity analysisproof

objects x

u objects

categorical values

f definition