2002 hakidi cluster validity methods part ii

9
 Clustering Validity Checking Methods: Part II Maria Halkidi, Yannis Batistakis Michalis Vazirgiannis Department of Informatics, Athens University of Economics & Business Email: {mhalk, yannis, mvazirg}@aueb.gr ABSTRACT Clustering results validation is an important topic in the context of pattern recognition. We review approaches and systems in this context. In the first part of this paper we presented clustering validity checking approaches based on internal and external criteria. In the second, current part, we present a review of clustering validity approaches based on relative criteria. Also we discuss the results of an experimental study based on widely known validity indices. Finally the paper illustrates the issues that are under-addressed by the recent approaches and proposes the research directions in the field. Keywords: clustering validation, pattern discovery, unsupervised learning. 1. INTRODUCTION Clustering is perceived as an unsupervised process since there are no predefined classes and no examples that would show what kind of desirable relations should be valid among the data. As a consequence, the final partitions of a data set require some sort of evaluation in most applications [15]. For instance questions like how many clusters are there in the data set? , does the resulting clustering scheme fits our data set? , is there a better partitioning for our data set? call for clustering results validation and are the subjects of a number of methods discussed in the literature. They aim at the quantitative evaluation of the results of the clustering algorithms and are known under the general term cluster validity methods. In Part I of the paper we introduced the fundamental concepts of cluster vali&~y and we presented a review of clustering validity indices that are based on external and internal criteria. As mentioned in Part I these approaches are based on statistical tests and their major drawback is their high computational cost. Moreover, the indices related to these approaches aim at measuring the degree to which a data set confirms an a-priori specified scheme. On the other hand, clustering validity approaches which are based on relative criteria aim at finding the best clustering scheme that a clustering algorithm can define under certain assumptions and parameters. Here the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes, resulting by the same algorithm but with different parameter values. The remainder of the paper is organized as follows. In the next section we discuss the fundamental concepts and representative indices for validity approaches based on relative criteria. In Section 3 an experimental study based on some of these validity indices is presented, using synthetic and real data sets. We conclude in Section 4 by summarizing and providing the trends in cluster validity. 2. Relative Criteria. The basis of the above described validation methods is statistical testing. Thus, the major drawback of high computational demands. A different validation approach is discussed in this section. It is based on relative criteria and does not involve statistical tests. The fundamental idea of this approach is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion. More specifically, the problem can be stated as follows: Let Palg be the set of parameters assodated with a gpedflc clustering algorithm (e.g. the n umbe r o f clusters n~). Am on g the clustering schemes C, ~ i= 1,.., no defined by a ~ed flc algorithm, for di~rent values of the parameters in Poe, choose the one that best lts the data set. Then, we can consider the following cases of the problem: I) Pdg does not contain the number of clusters, no as a param eter. In this case, the choice of the optimal parameter values are described as follows: We mn the algorithm for a wide range of its parameters' values and we choose the largest range for which nc remains co nstant (usually ~ << N (number of tuples)). Then we choose as appropriate values of the Poe parameters the values that correspond to the middle of this range. Also, this procedure identifies the number of clusters that underlie our data set. II) P~/# con tains n, as a parameter. The procedure of identifying the best clustering scheme is based on a SIGMOD Record, Vol. 31, No. 3, September 2002 19

Upload: guelhanekindere

Post on 16-Jul-2015

40 views

Category:

Documents


0 download

TRANSCRIPT

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 1/9

Clustering Validity Checking Methods: Part II

Maria Halkidi, Yan nis Batistakis, Michalis Vazirgiannis

Department of Informatics,

Athens University of Economics & BusinessEmail: {mhalk, yannis, mvazirg}@aueb.gr

A B S T R A C T

Clustering resul ts val idat ion is an im porta nt topic in the

context of pa t t e rn recogni t ion . We review approaches

and sys tems in this context . In the f i rs t part of this

paper we presented clus tering val idi ty checking

approaches based on internal and external cri ter ia . In

the second, current part , we present a review of

clustering validity approaches based on relative criteria.

Also we discuss the resul ts of an experimental s tudybased on widely known val idi ty indices . Final ly the

paper i l lus trates the issues that are under-addressed by

the recent approaches and proposes the research

directions in the field.

Keywords : clustering validation, pattern discovery,

unsupervised learning.

1 . I N T R O D U C T I O N

Clustering is perceived as an unsupe rvised process s ince

there are no predefined classes and no examples that

would show wha t kind of des irable relat ions should be

val id among the data . As a consequence, the f inal

part i t ions of a data set require some sort of evaluat ionin most applications [15]. For instance questions like

"how many clus ters are there in the data set?" , "does

the resulting clustering scheme fits our data set?", "is

there a bet ter part i t ioning for our data set?" cal l for

clustering results validation and are the subjects of a

number of methods discussed in the l i terature. They

aim at the quant i ta t ive evaluat ion of the resul ts of the

clus tering algori thms and are known under the general

term cluster validity methods .

In Par t I o f the paper we in t roduced the

fundamenta l concept s of cluster vali&~yand we presented

a review of clustering validity indices that are based on

external and internal criteria. As mentioned in Part I theseapproaches are based on statistical tests and their major

drawback is their high computat ional cos t . Moreover,

the indices related to these approaches aim at measuring

the degree to which a data set confirms an a-priorispecified scheme.

On the other hand, c lus tering val idi ty approaches

which are based on relative criteria aim at f inding the bes t

clus tering scheme that a c lus tering algori thm can define

under certain assumptions and parameters . Here the

basic idea is the evaluation of a clustering structure by

comparing i t to other clus tering schemes, resul t ing by

th e same algori thm but w ith differen t parameter values .

The remainder of the paper is organized as fol lows.

In the next sect ion we discuss the fundamental

concepts and representat ive indices for val idi ty

approaches based on relative criteria. In Section 3 an

experimental s tudy based on some of these val idi tyindices is presented, using synthetic and real data sets.

We conclude in Sect ion 4 by summarizing and

providing the t rends in clus ter validity.

2 . R e l a t i v e C r i t e r i a .

The bas is of the above described val idat ion methods is

statistical testing. Thus, the major drawback of

techniques based on internal or external criter ia is their

high compu tat ional demands . A d ifferent val idation

approach is discussed in this section. It is based on

relative criteria and does no t inv olve statistical tests. Th e

fundamental idea of this approach is to choose the bes t

clus tering scheme of a set of defined schemes accord ing

to a pre-specified criterion. More specifically, theproblem can be s tated as fol lows:

"Let Palgbe the set ofparameters assodated with a gpedflc

clustering algorithm (e.g. the number of clusters n~). Amon g

the clustering schemes C,~ i= 1,.., no defined by a ~edf lc

algorithm, for di~rent values of the parameters in Poe,

choose the one that best lt s the data se t."

Then, we can consider the fol lowing cases of the

problem:

I) Pdg d o e s n o t c o n t a i n t h e n u m b e r o f c l u st e rs , n o

as a pa ram ete r . In this case, the choice of the

opt imal parameter values are described as fol lows:

We m n the algori thm fo r a wide range of its

parameters ' values and we choose the larges t rangefor whic h nc remains co nstan t (usually ~ << N

(number of tuples)) . Then we choose as appropriate

values of the Poe parameters the values that

correspond to the middle of this range. Also, this

procedure ident i f ies the number of c lus ters that

underlie our data set.

II) P~/# con tai ns n, a s a p a r a m e t e r . T h e p r o c e d u r e o f

ident i fying the bes t c lus tering scheme is based on a

S I G M O D R e c o r d , V o l . 3 1 , N o . 3 , S e p t e m b e r 2 0 0 2 1 9

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 2/9

validity index. Selecting a suitable performance

index, q, we proce ed with the fol lowing s teps :

• the clustering algorith m is run fo r all values of n,

between a minim um ncmin and a m aximum n . . . . .

The minimum and maximum va lues have beendefined a-priori by user .

• For each of the values of n,, the algori thm is run r

t imes , us ing different set of values for the o ther

parameters of the algorithm (e.g. different initial

condi t ions) .

• Th e bes t values of the index q obtained by each

n, is plotted as the func tion o f n,.

Based on this plot we may ident i fy the bes t

clus tering scheme. We have to s t ress that there are two

approaches for defining the bes t c lus tering depending

on the behaviour of q with respect to n, . Thus , i f the

val idi ty index does not exhibi t an increas ing or

decreas ing t rend as n, increases we seek the maximum(minimum) o f the plot . On the other hand, for indices

that increase (or decrease) as the number of c lus ters

increase we search for the values of n, a t which a

s ignif icant local change in value of the index occurs .

This change appears as a "knee" in the plot and i t i s an

indicat ion of the number of c lus ters underlying the

dataset . Moreover, the absence of a knee may be an

indicat ion that the data set possesses no clus tering

structure.

In the sequel, som e representat ive val idi ty indices for

crisp a nd fuzzy clus tering are presented.

2 .1 Cr i sp c lus te r ing .

Crib dasterin& considers non-overlapping part i t ionsmeaning that a data point either belongs to a class ornot. In this section discusses validity indices suitable for

crisp clustering.

2.1. I The modified Hub ert F statistic.Th e defini t ion of the m odified Hu ber t F [18] s tat ist ic is

given by the equat ionN-I N

F = (1 / M ) E E P(i, j) . a(i, j) (1)i=| j=i+l

where N is the num ber o f objects in a dataset , M=N (N -1)/2, P is the proximity matrix of the data set and Q is

an NxN matrix whose ( i , j ) e lement is equal to thedistance betw een the rep resentative points (v¢i, vc) o f

the clusters whe re the objects xi and x belong.Similarly, we can define the normalized Hubert F

statistic, given by equa tionN- I N

[(lfM) E E (X (i , j)- gx)(Y(i, J) - I.tv)] ( 2 )~ i=l j=i+l

CIXC~Y

If the d(v~i, vci) is close to d(xi, xj) for i , j=I,2,.. ,N , P and

Q will be in close agree men t and th e values of r" and ~"(normalized I) will be high. C onversely, a high value of

^

F ( r) indicates the exis tence of comp act c lusters. Thus ,

in the plot of normalized F versus n, , we seek as ignif icant knee that corresponds to a s ignif icantinc rease of normal ized F . The number of c lus te rs a twhich the knee occurs i s an indica tion of the n umb er o fclus ters that occurs in the data . We note, that for n, = Iand n, =N the index is not defined.

2.1.2 Dunn fam i~ of indices.A clus ter val idity index fo r crisp clus tering prop osed in

[5], a ims at the ident i f icat ion of "compact and wel lseparated clusters". Th e index is defined in equat ion (3)for a specific num ber of c lus ters _~

I f &,,c,)D. = m i n t m i n I ~ ,-I}

, ,=l.....,,,[/=,+i...... max diam[ck) l l 0)[ k*=i.....,,, j j

whe re d(ci, c o is the dissimilarity fun ction betw een two

clusters ci and c defin ed as d(c~ ,cj)= rain d ( x , y ) ,x~Ct,yeC 1

and diam(c) is the diameter of a clus ter , which may beconsidered as a measure of c lus ters ' dispers ion. Thediameter o f a clus ter C can be defined as fol lows:

diam( C ) = max d ( x, y) (4)x,yeC

If the dataset contains compact and wel l -separatedclusters, the dis tance between the clus ters is expec ted to

be large and the diameter of the clus ters is expected tobe small . Thus , based on the Dunn's index defini t ion,

we may conclude that large values of the index indicatethe presence o f com pact and wel l -separated clusters .

Index D,,, does not exhibi t any t rend with respect tonumb er of c lus te rs. Thus , the max imum in the p lo t ofD,, versus the number of c lus ters can be an indicat ionof the nu mb er o f c lusters that f its the data .

Th e problem s o f the Du nn index are: i ) i ts

considerable time complexity, ii) i ts sensitivity to thepresence of noise in datasets , s ince these are l ikely to

increase the values of diam(c) (i .e., dominator of

equat ion (3)) .Th ree indices, are prop osed in [14] that are more

robus t to the presence of noise . The y a re known as

Dunn-l ike indices s ince they are based on Dunn index.Moreover, the three indices use for their defini t ion theconcept s of Minimum Spanning Tree (MST) , there la t ive ne ighbourhood graph (RNG) and the Gabr ie l

graph respectively [18].Consider the index based on MST. Let a clus ter c i

and the comple te graph Gi whose ver t i ces cor respondto the vectors o f c i. Th e weight , we, of an edge, e, of th isgraph equals the dis tance between i ts two end points , x,y. Let EiMsT be the set of edges o f the M ST of the graphGi and eiMsT the edge in Ei MST with the maximum

weight . Th en the diameter o f Ci is defined as the weightof e iMsT. Dunn-l ike index based on the co ncep t of the

MS T is given by equat ion

2 0 S I G M O D R e c o r d , V ol . 3 1 , N o . 3 , S e p t e m b e r 2 0 0 2

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 3/9

O=m l=/ 1}i~.....,,Aj=~.l,...,,,,/ max diam usr[ ~,=, ........ l

The number of c lus te r s a t which DmMsT takes its

max imu m va lue indica tes the num ber o f dus te r s in theunder ly ing da ta . Based on s imi la r a r guments we may

def ine the Dunn- l ike ind ices f or GG and RGN gr aphs .

2.1.3 The Davies-Bouldin (DB) index.

A similar i ty measu re Rii betw een the clus ters Ci and C isde f ined based on a measur e o f d ispe r s ion of a dus te r Ci

, le t s i , and a diss imilar i ty measure between two clus ters

di b The Rii index is def ined to sat is fy the fol lowingcondit ions [4] :

1. Rij_>0

2. Rij = Rii3. if si = 0 and s = 0 then Rii = 0

4. if s > sk an d dii = dik th en Rii > Rik

5. if s = Sk an d di i < dik then Rii > Rik.

These condit ions s ta te that Rii is non-negative andsymmetr ic .

A s imple choice for Rii that sat is f ies the aboveconditions is [4]:

Rii = (si + si)/d i . (6)

The n the D B index i s de fined as

DB+ =--L R, (7 )nc i=1

R~ = max R , i= l, . . . ,n~t=l,...,nc,i# j

I t is c lear for the above def ini t ion that DB,c is the

average s imilar i ty betw een each clus ter c i, i=1, . . . , nc

and i ts most s imilar one. I t is des irable for the clus tersto have the min imu m poss ib le s imi lar ity to each o the r ;the r e f or e we seek c lus te r ings tha t min imize DB. The

DB= index exhib i t s no t r ends wi th r espec t to then u m b e r o f d u s t e r s a n d t h u s w e s e e k t h e m i n i m u m

value of DB,c in i t s p lo t ve r sus the n umb er of c lus te r s.

Some al ternat ive def ini t ions of the diss imilar i ty

between two clus ters as well as the dispers ion of acluster, ci, is define d in [4].

Thr ee va r ian ts o f the DBac index a r e pr oposed in

[ 14] . They a r e based on MST, RNG and GG concepts

s imilar ly to the cases o f the D unn- l ike indices.

Other val idi ty indices for cr isp clus ter ing have been

pr opo sed in [3] and [ 13]. The implem enta t ion o f mos t

of these ind ices i s computa t iona l ly ve r y expens ive ,espec ial ly when the n umb er of c lus te r s and ob jec t s in

the data set grow s ve ry large [19] . In [13], an eva luat ion

study of thir ty val idi ty indices proposed in l i terature is

presented. I t is based on t iny data sets (about 50 points

each) with well- separated clus ters . The results of this

s tudy [13] place Caliski and Harabasz(1974) , Je(2) /Je(1)(1984) , C- index (1976) , Gamma and Beale among thes ix bes t indices . However , i t is noted that a l though ther esu l t s concer n ing these methods a r e encour ag ing they

ar e l ike ly to be da ta dependent . Thus , the behaviour oindices may change if dif ferent data s tructures w erused . Also , some ind ices based on a s ample o

clus ter ing results . A representat ive example Je ( 2) / J e ( 1) whose computa t ions based on ly on th

inf or mat ion pr ovided by the i t ems involved in the la sclus ter merge.

2.1.4 RM SSD T, SPR, RS, CD.This family of val idi ty indices is applicable in the case

that hierarchical a lgor i thms are used to clus ter thdatasets . Hereaf te r we refer to the def ini t ions o f fou

validi ty indices , which have to be used s imultaneously tde te r mine the number of c lus te r s ex i s t ing in the da t

set . These four indices are applied to each s tep of hierarchical c lus te r ing a lgor i thm and they a r e known a[16]:

Root-mean-square standard deviation R M SS TD ) of the necluster

Semi-partial R-squared (SPR)

R-squared (RS) Distance between two clusters (CD).

Get t ing in to a mor e de ta i led desc r ip t ion of them we casay that:

RM SST D of a new c lus te r ing scheme def ined a t

level of a c lus ter ing hierarchy is the square root of th

var iance of a l l the var iables (at tr ibutes used in th

clus ter ing process) . This index measures th

homogene i ty of the f or med c lus te r s a t each s tep of th

hierarchical a lgor i thm. Since the object ive of c lus te

ana lys i s i s to f or m homogeneous gr oups the RMSSTD

of a clus ter should b e as small as poss ible . In case tha

the va lues of RMSST D ar e h igher than the ones of th

pr ev ious s tep , we have an ind ica t ion tha t the new

cluster ing schem e is worse.

In the fol lowing def ini t ions we shall use the term SS

which means Sum of Squares and refers to the equation:

N

( 83S =i=l

Along with this we shall use some addit ional symbolism

like:

i) SSw refer r ing to the sum o f squares within group,

i i ) SSb refer r ing to the sum of squares between

gr oups .

i ii ) SSt referr ing to the total su m of squares , of the

whole data set .

SPR for a the new clus ter is the dif ference between

SSw of the new clus ter and the sum of the SSw's value

of dus te r s jo ined to ob ta in the new c lus te r (loss ohomogeneity), divided by the SSt for the whole data set

This index measur es the los s o f homogene i ty a f temerging the two clus ters of a s ingle algor i thm s tep. Ithe index value is zero then the new clus ter is obtained

by mer g ing two pe r f ec tly hom ogen eous dus te r s . I f it

S I G M O D R e c o r d , V o l . 3 1, N o . 3 , S e p t e m b e r 2 0 0 2 21

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 4/9

value is high then the new clus ter is obtained by

mer ging two he te r ogeneous dus te r s .

RS o f the new clus ter is the rat io of SSb ov er SS,. SSb

is a measure of dif ference between groups . S ince SSt =

SSb + SSw. the great er the SSb the sm aller the SSw an d

vise versa. As a result , the greater the dif ferencesb e t w e e n g r o u p s a r e t h e m o r e h o m o g e n o u s e a c h g r o u pis and vise versa. Thus , RS may be considered as a

measure of diss imilar i ty between clus ters . Fur thermore,

i t measur es the degr ee of homogen e i ty be tween gr oups .The va lues of RS r ange be tween 0 and 1 . I n case tha tthe value of RS is zero (0) indicates that no dif ference

exis t s amo ng gr oups . On the o the r hand , when RS

equals 1 there is an indicat ion of s ignif icant dif ference

a m o n g g r o up s .T h e CD index measur es the d i s tance be tween the

two clus ters that are merged in a given s tep of the

hierarchical c lus ter ing. This dis tance depends on the

selected representat ives for the hierarchical c lus ter ing

we per f or m. For ins tance , in case of Centroid hierarchicalclustering the representat ives of the formed clus ters are

the cen te r s o f each dus te r , so CD i s the d i s tance

be tween the cen te r s o f the dus te r s . I n case tha t we use

single linkage C D m e a s u r e s t h e m i n i m u m E u c l i d e a ndis tance bet wee n al l poss ible pairs of points . In case o f

complete linkage CD is the maximum Euc l idean d i s tance

between al l pairs of data points , and so on.Us ing these f our ind ices we de te r mine the number

of c lus ters that exis t in a data set , plot t ing a gra ph of a l l

these indices values for a number of dif ferent s tages of

the clus ter ing algor i thm. In this graph we search for the

s teepes t knee , o r in o the r wor ds , the gr ea te r jump ofthese ind ices ' va lues f r om h igher to smal le r number of

clusters.

N on-hierarchical clustering. I n the case of nonhie r a r ch ica l

clus ter ing (e .g. K-Means) we may also use some of theseindices in order to evaluate the result ing clus ter ing. The

indices that are more meaningf ial to use in this case are

RMSSTD and RS . The idea , he r e , i s to r un the

a lgor i thm a number of t imes f or d i f f e r en t number of

clus ters each t ime. Then the respective graphs of the

validi ty indices is plot ted fo r these duster in gs and as theprevious example shows, we search for the s ignif icant

"knee" in these gr aphs . The number of c lus te r s a t

which the "kn ee" i s obse r ved ind ica tes the op t imalclus ter ing for our data set . In this case the val idi ty

indices descr ibed before take the fol lowing form:

RMSSTD

£ ~ ( x , - 7 - k ) 2i= I , , . nc k= |. / = 1 . . . v

~ , ( n o - 1 )i= ,. .j= I . . . v

(9)

an d

R S =SS b

S S ,

S S , - SS w

S S ,

R S =

RS =

S S b S S , - SS w=

S S ~ S S ,

Z ( x~ - x ~ ) -CS:l '"" L t:l ~ ~'

j= l . . .v k= l

( x ~ - 7 ,

(lo)

w h e r e n, i s the num ber of c luste rs , d the num ber o fvar iables(data dimensionali ty) , n i is the n um be r o f data

va lues of i d imens ion whi le n i i cor r esponds to the

number of da ta va lues of i d imens ion tha t be long to

cluster i. Also xj i s the mean of da ta va lues of J

d imens ion .

2.1 .5 The SD validity index.

Ano ther c lus ter ing val id ity appr oach i s p r oposed in [ 8].

The SD va l id i ty index de f in i t ion i s based on the

concepts o f average scatteringJbr clustersan d total separationbetween clusters. I n the seque l , we g ive the f undamenta l

def ini t ion for this index.

Average scattering J br clusters. The aver age sca t te r ing f or

clus ters is def ined asn¢

Scat, (n:) = ~ I I < > - ( v , ) l l / l l < , - ( a ' ) l l . = ( 1 1 )

Total separation between clusters. The de f in i t ion of to ta l

scat ter ing (separat ion) between clus ters is given by

equa t ion ( 12)

Dmax nc nc

Dis(n,) = vk _vz (12)

rain k=l ~ z= l

wh ere D,,~. = ma x(] Ivi - vii i) Vi, j e{ 1, 2,3, . . . , no} isthe max imum dis tance be tw een c lus te r cen te r s. The D,~i,

= min(I Iv i- vi i i ) Vi, j e{1 , 2 . . . . . nc } is the m inim umdis tance between clus ter centers .

No w, w e can de f ine a va l idi ty index based on equa t ions

(11) and (12) as fol lows

SD(n,) = a. Scat(n~) + Dis(n,) ( 13)

where a is a weighting factor equal to Dis(c,,,~) where c,,,,<

i s the maxim um nu mbe r of input c lus te r s.The first term (i.e. , gcat(n~) is def ined in equation (11)

indicates the average compactness of c lus ters ( i .e . , intra-

clus ter dis tance) . A small value for this term indicates

com pac t c lusters and as the scat ter ing within clus tersincreases ( i .e . , they become less compact) the value of

Scat(nJ also increases . The second term Dis(n~) indicates

the to ta l s epar a t ion be tween the n, dusters ( i.e. , anindicat ion of inter -clus ter dis tance) . Con trary to the f ir s t

t e r m the second one , Dis(nJ, is inf luenced by the

geomet r y of the dus te r s and inc r ease wi th the numbe rof dus te rs . The two te r ms of SD ar e of the d i f f e r en trange, thus a weighting factor is needed in order to

incor por a te bo th te r ms in a ba lanced way. The num ber

2 2 S I G M O D R e c o r d , V o l . 3 1 , N o . 3 , S e p t e m b e r 2 0 0 2

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 5/9

of dus te rs , n,, that minimizes the above index is anopt imal value. Also, the influence of the maximumnumber of c lus ters c . . . re lated to the weight ing factor ,in the select ion of the opt imal clus tering scheme isdiscussed in [8] . I t i s pro ved that SD proposes an

opt imal nu mb er o f c lus ters a lmost i rrespect ively of theC m a x value.

2.1.6 The S Dbw validity index.A recent val idi ty index is proposed in [10]. I t exploi ts

the inh erent features of c lus ters to assess the val idi ty ofresul ts and select the opt imal part i t ioning for the dataunder concern. Similarly to the SD index, i ts defini t ionis based on compactness and separat ion taking also intoaccou nt densi ty.

In the fol lowing, S_Dbw is formalized based on: i .c lus ters ' compa ctness ( in terms o f intra-clus ter

variance), and ii . density between clusters (in terms ofinter-cluster density).

L e t D = {v i l i = 1 , . . . , n, } be a part i t ioning of a dataset S into c conv ex clus ters where vi is the center of eachcluster as it results from applying a clustering algorithmto S.

Le t stdev be the average s tandard deviat ion of clus ters

defined as: sider= 7~l]a(v,) l ] •

Furthe r the term I I xl [ is defined as : I Ixl I = (xrx) /z,where x is a vector.

Th en the overal l inter-du ster densi ty is defined as:

D e f i n i t i o n 1 . Inter-cluster Density (ID) - It evaluates the

average density in the region am ong clus ters in relat ionto the densi ty of the clus ters . The goal is the densi tyamo ng duste rs b e low in comparison with the densi ty in

the clusters. Then, inter-cluster density is defined asfollows:

l ~ f ~ densiry(u,j) .l (14)

whe re vi, v I are the centers of clusters ci, ci, respectively

and ui j the middle point of the l ine segment defined bythe dusters ' centers V i , V j . The term densi ty(u) definedin equation (15):

no

d ~ . ~ ( ~ )= ~ s(-<,,,, ) , ( 1 5 )

where ni l is the number of tuples that belong to the

clu ster c~ an d % i.e., x~ e c~, q _G S. It re pr ese nt s th e

number of poin t s in the ne ighbourhood of u . In ourwork, the neighbourhood of a data point , u, is defined

to be a hyper-sphere with center u and radius the

average s tandard deviat ion of the clus ters , stdev. M o r especifically, functionf(x,t 0 is define d as:

f(x,u)=(~' if(x,u)>stdeVotherwise (16)

A poin t be longs to the ne ighbou rhood of u i f i tsdis tance from u is smaller than the average s tandarddeviat ion of clus ters . Here we assume that data ranges

are normalized across al l dimensions of f inding theneighbo urs of a mult idimensional point [1] .

D e f i n i t i o n 2 . Intrd-cluster yah'ante. I t measures theaverage scat tering of dusters . I ts defini t ion is given byequanon 11.

Th en the val idi ty index S Db w is defined as :

S_D bw (n,) = Scat(n<) + Dens_b w(m) (17)

The defini t ion of S_Dbw indicates that both cri ter ia of"go od " clus tering ( i.e. , compactness and separation) areproperly combined, enabl ing rel iable evaluat ion ofclustering results. Also, the density variations among

clus ters are taken into account to achieve in morerel iable resul ts . The number of dusters , no thatminimizes the above index an opt imal value indicat ing

the num ber of c lus ters present in the data set .Moreover , an approach based on the S_Dbw indexis propo sed in [9] . I t evaluates the clus tering schemes o fa data set as defined by different c lus tering algori thmsand selects the algorithm resulting m optimalpart i t ioning of the data .

In general terms, S Dbw enables the select ion both

of the algori thm and i ts parameters values for which theopt imal part i t ioning of a data set is defined (assuming

that the data set presents clustering tendency).However, the index cannot handle properly arbi t rari lyshaped clusters. The same applies to all theaforem entione d indices.

2 . 2 F u z z y C l u s t e ri n g .

In this section, we present validity indices suitable forfuzzy clustering. The objective is to seek clusteringschemes where most of the vectors of the datasetexhibi t high degree of mem bership in one clus ter . Fuzzy

clustering is defined by a matrix U=[uii], where uiidenotes the degree of membership of the vec tor x i incluster j . Also, a set of cluster representatives is defined.Similarly to crisp clustering case a validity index, q, isde f ined and we sea rch for the minimum or maximum inthe plot of q versus n,. Also, in case that q exhibits at rend with respect to the number of c lus ters , we seek as ignif icant knee of decrease (or increase) in the plot ofq.

In the sequel two categories of fuzzy validity indicesare discussed. The first category uses only the

mem berships values , ui i , of a fu z zy part i t ion of data .The second involves both the U matrix and the datasetitself.

2.2. I V alidily Indices involving on~ the membership values.Bezdek proposed in [2] the parfftion coeffcient, which isdefined as

n c

c=± E4 (is)N ~77~ =1

S I G M O D R e c o r d , V o l. 3 1 , N o . 3 , S e p t e m b e r 2 0 0 2 2 3

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 6/9

Th e PC index values range in [1/no, 1] , wher e m isthe num ber of c lus ters . The closer to uni ty the index the"crisper" the clus tering is . In case that a l l membership

values to a fuzz y partition are equal, that is , uii= l/n¢ , th e

PC obtains i ts lower value. Thus , the closer the value ofPC is to 1/no, the fuzzier the clus tering is . Furtherm ore,a value close to 1/n¢ indicates that there is no clusteringtendency in the considered dataset or the clus tering

algorithm failed to reveal it.

Th e partition entropy coqfident is another index of this

category. It is defined as follows• N nc

P E = - ~ i ~ l u o ' l ° g , ( u o ) == (19)

where a is the base of the logari thm. The index iscomputed for values of m greater than 1 and i ts valuesranges in [0, log, n,] . The closer the value o f PE to 0, thecrisper the clustering is. As in the previous case, indexvalues clo se to the u pp er bo un d (i.e., log, n,), indicate

absence of any clus tering s t ructure in the dataset orinability of the algorithm to extract it .

Th e drawbacks of these indices are:i ) the ir mon otono us dependency on the num ber of

clusters. Thus, we seek significant knees of increase(for PC) or decrease (for PE) in the plots of the

indices versus the nu mb er o f c lusters. ,fi) their sensitivity to the fuzzifier, m. Th e fuzzifier is a

parame ter of the fuzzy clus tering algori thm and

indicates the fuzziness of clustering results. Then, as

m ~ 1 the indices give the sam e values for all values

of n ,. On the o ther hand when m -) o0, both P C and

PE exhibi t s ignif icant knee at m =2.i ii ) the lack of direct conn ect ion to the geo metry of the

data [3], since they do no t use the data itself.

2.2.2 Indicesnvolving the membership values and thedataset.

T h e Xie-Beni index [19], XB, also called the compactness

and separation validity function, is a representative

index o f this category.Consider a fuzz y part it ion o f the data set X = {xj;

j=l , . . , n} w ith vi( i=l , . . . , no} the centers o f each clus ter

and ui i the membership of data point j wi th regards to

cluster i .Th e fuzzy deviat ion of x form clus ter i, di i, i s

defined as the dis tance between x and the center ofclus ter weighted by the fuzzy membership of data point

j belon ging to cluster i.

dii=uii] I xi vi I I (20)Also, for a clus ter i , the sum of the squares of fuzzy

deviat ion of the data point in X, denoted oi , i s cal led

var iat ion of dus te r i.The sum of the variations of all clusters, o, is called

total variationof the data set.The term n=(oi /ni) , i s cal led compactness of c lus ter

i . Since ni is the nu mb er of po int in clus ter belonging tocluster i , ~, is the average variation in cluster i .

Also, the separat ion o f the fu zzy part i t ions is definedas the minim um dis tance between clus ter centers , that is

Dmin -~ mi n I ] vi -vii IT h e n XB index is defined as

X B = n / N . d m i nwhere N is the num ber of points in the data set .I t i s c lear that small values of XB are expected for

compact and wel l -separated clus ters . We note, however,

tha t XB i s monotonica l ly decreas ing when the nu mber

of clus ters n, gets very large and close to n. O ne way toel iminate this decreas ing tendency of the index is tode te rmine a s t a rt ing poin t , c . . . o f the mo noto ny

behaviour and to s ea rch for the m inimum va lue of XBin the range [2, cm=]. Mo reove r, the values o f the inde x

XB depend o n the fuzz if i er va lues, so as i f m~ o0 then

XB'-~o0.An othe r index o f this category is the Fukqyama-

Sugeno index, which is defined as

i=1 j=l

where v is the mea n ve ctor o f X and A is an lxl pos i t ivedefini te , symmetric matrix. When A=I, the abovedis tance bec omes the Eucfide an dis tance. I t i s c lear thatfor com pact an d wel l -separated clus ters we expect smallvalues for FSm. Th e f i rs t term in brackets measures thecompactness of the clus ters while the second one

measures the dis tances of the clus ters representat ives .Oth er fuzzy val idi ty indices are pr opo sed in [6] ,

which a re based on the concept s of hypervolume anddensity. Let Z i the fuzzy covariance matrix of the j - th

clus ter defined as

E : u ~ ( x , - v , ~ x , - v j ) r (22)

~ j = N m

Ei=I I . /~

Th e fuz zy hyper volumeof j - th clus ter is given by equat ion:

V j = Ixj l * /z

where I ZJ l is the determ inant o f £j and is a measure of

clus ter compactness .T h e n t h e total fuz z y hyper volume is given by the

equat ionnc

FH = Z V, (23)

Small values of FH indicate the exis tence of compact

clusters.

T h e average partition density is also an index of thiscategory. I t can be defined as

1~ - , S j (24)

eA=.oT h e n Sj = Zx~ Xj u•, where X i is the set o f data points

that are within a pre- speeified region arou nd v i (i.e., thecenter of c lus ter j) , i s cal led the sum o f the central

mem bers o f the clus ter j .

2 4 S I G M O D R e c o r d , V o l. 3 1 , N o . 3 , S e p t e m b e r 2 0 0 2

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 7/9

0 20 4O

(a) DataSetl

Figure 1.Datasets

£~ 80

50 70 90 110 150 150 17 0 190 210 250 250

(b) DataSet2

1400

I ~ 0

I ~ 0

6 ~

4 ~

2 ~

0

20 0 4OO

(c) DataSet3

600 800

Table 1. Optimal nu mbe r of clusters proposed by validi ty indices

DataSet l I DataSet2 I DataSet3 I N d S e tO p t i m a l n u m b e r o f clusters

R S, R M SST D 3 2 5 3

DB 6 3 7 3

SD 4 3 6 3

S_D bw 4 2 7 3

A different measure is the partition density index tdef ined as

P D = S / F H ( 2 5 )

where S= E ; : S i .

A few other indices are proposed and discussed in [11,13].

2 .3 O t h e r ap p r o ach es f o r c l u s t e r v a l i d i t yAnoth er approach for f inding the best number ofcluster of a data set was prop osed in [17] . I t introducesa practical clustering algorithm based on Monte Carlo

cross-validation. More specifically, the algorithmconsists of M cross-validation runs over M chosentrain/ test par t i t ions of a data set , D. For each par t i t ion

u, the EM algorithm is used to define nc clusters to thetraining data, while nc is varied from 1 to c .. . The n, thelog-likelihood Lu(D) is calculated for each model withn¢ clusters. I t is defin ed usin g the probability densityfunction of the data as

Lk(D) = E, :, l og f , ( x , / ¢ , )(26)

where f i is the probabil i ty density function for the data

and Oh denotes parameters that have been est imatedfrom data. This is repeated M times and the M cross-validated estimates are averaged for each n,. Based on

these est imates we may def ine the poster ior

probabil it ies for each value o f the n umb er o f clusters nop(nc/D) . I f one of p(nc/D) is near 1 , there is s trongevidence that the par t icular number of dusters is thebest for ou r data set .

The evaluation approach proposed in [171 is basedon density functions considered for the data set . Thus,it is based on concepts related to probabilistic models inorder to est imate the num ber of clusters, bet ter f i tt ing a

data set, and it does not use concepts directly related tothe data, (i.e., inter-cluster and intra-clusters distances).

3 . A N E X P E R I M E N T A L S T U D Y

In this section we present a comparative exper imentalevaluation of the importa nt validity measures, aiming atillustrating their advantages and disadvantages.

We conside r relative validity indices prop osed in the

literature, such as RS-RM SSTD [16], DB [18] and th emos t recent ones SD [8] , and S_Dbw[10]. T he

defirfitions of these validity indices can be found inSection2.1. Fo r our study, we used four synthetic two-

dimensional data sets fur ther referred to as DataSetl ,DataSet2, DataSet3 (see Figure la-c) and a six-dimensional dataset N d S e t containing three clusters .

Table 1 summa rizes the results of the validity indices

(RS, RMSSDT, DB, SD and S_Dbw), for dif ferentcluster ing schemes of the above-mentioned data sets asresulting from a clustering algorithm. For our study, weuse the results of the algorithms K-Means[12] and

CURE[7] with their input value (number of clusters)ranging between 2 and 8. Indices RS, R.MSSTD proposethe par t i t ioning of DataSetl into three clusters whileDB selects six clusters as the best partitioning. On the

other hand, SD and S Dbw select four clusters as thebest par t i t ioning for DataS etl , wh ich is also the correct

number of clusters f i t t ing the under lying data.

Moreover , the indices S_Dbw and DB select the correctnum ber o f clusters(i.e., seven) as the optima lpar t i t ioning for DataSet3 while RS, RMSSTD and SD

select the clustering scheme of five and six clustersrespectively. Also, all indices propose three clusters asthe best par t i t ioning for Nd Set . In the case ofDataSet2, DB and SD select three clusters as theoptimal scheme, while RS-RMSSDT and S_Dbw select

S I G M O D R e c o r d , V o l. 3 1, N o . 3 , S e p t e m b e r 2 0 02 2 5

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 8/9

two clusters (i.e., the correct number of clusters fittingthe data set).

Here, we have to men tion tha t a validi ty index is not

a clustering algorithm itself but a measure to evaluate theresults of clustering algorithms and gives an indication

of a partitioning that best fits a data set. The semanticsof clustering is not a totally resolved issue and

depending on the applicat ion domain we may considerdifferent aspects as more significant. For instance, for a

specif ic applicat ion i t may be important to have wellseparated clusters while for another to consider morethe compacmess of the clusters . In the case of S_Dbw,the relat ive importance of the two terms on which the

index def ini t ion is based can be adjusted. Having anindication of a good par t i t ioning as proposed by theindex, the domain exper ts may analyse fur ther thevalidation procedure results . Thus, they could select

some of the par t i t ioning schemes proposed by S_Dbw,and select the one better f i t t ing their demands for cr isp

or over lapping clusters . For instance DataSet2 can beconsidered as having three clusters with two of themslighdy overlapping or having two well-separated

clusters. In this case we ob serve th at S D bw values fortwo an d three clusters are no t significandy differe nt(0.311, 0.324 respectively). This is an indication that wemay select ei ther of the two par t i t ioning schemesdepending on the cluster ing interpretat ion. Then, wecompare the values of Scat and Dens bw terms for the

cases of two and three clusters . We observe that the twoclusters scheme corresponds to well-separated clusters(Dens_bw(2)= 0.0976 < Dens_bw(3)= 0.2154) while

the three-clusters scheme contains more compactclusters (Scat(2)= 0.21409 > Sca t(3)= 0.1089).

4. C O N C L U S I O N S A N D T R E N D S INC L U S T E R I N G V A L I D IT Y

Cluster validi ty checking is one of the m ost importan tissues in cluster analysis related to the inherent features

of the data set under concern. I t aims at the evaluationof cluster ing results and the select ion of the scheme tha t

best fits the unde rlying data.The major i ty of algor i thms are based on cer tain

criteria in order to define the clusters in which a data setcan be partitioned. Since clustering is an unsupervisedme thod and there is no a-pr ior i indication for the actual

number of clusters presented in a data set, there is aneed of some kind of cluster ing results validation. Wepresented a survey of the most known validi ty cr i ter ia

available in literature, classified in three categories:external, internal, and relative. Moreover, we discussed

some representative validity indices o f these criteriaalong with a sample experimenta l evaluation.

As we discussed earlier the validity assessmentapproaches works better when the clusters are mosdycompact . However , there are a number of applicat ionswhere we have to handle arbitrarily shaped clusters (e.g.spatial data, medicine, biology). In this case the

traditional validity criteria (variance, density a nd itscontinuity, separauon) are no t any mo re suff icient .

There is a need for developing quali ty measures tha tassess the quality of the par t i t ioning taking into a ccou nti. intra-cluster quality, ii. inter-cluster separation and iii.

geometry of the clusters , using sets of representat ivepoints , or even mult idimensional curves rather tha n asingle representative point.

Another challenge is the def ini t ion of an integrated

quali ty assessment model for data mining results . Thefundamental concepts and cr i ter ia for a global datamining validi ty checking process have to be introducedand integrated to def ine a quali ty model. Th is will

contr ibute to more eff icient usage of data miningtechniques for the extraction of valid , interest ing, andexploitable patterns o f knowledge.

A C K N O W L E D G E M E N T S

This work was suppor ted by the G eneral Secretar iat forResearch and Technology th rough the PE NE D ("99E&85") project . We thank C. Amanatidis for hissuggestions and his help in the exper imental s tudy.

Also, we are grateful to C. Rodopoulos for theimplementat ion of CURE algor i thm as well as to DrEui-Hong (Sam) Han for providing information and

the source code for CUR E algor i thm.

R E F E R E N C E S

[1] Michael J . A. Berry, G ordo n Lin off . Data MiningTechniques For m arketing Sal es and C ustomer Support.

Joh n Willey & Sons, Inc, 1996.

[2] Bezdeck, J .C, Ehr l ich, R., Full , W.. "FC M:F uzzy C-Means Algor i thm", Computers and Geosdence, 1984.

[3] Dave, R. N .. "Validating fuzzy par ti t ions obtainedthrough c-shells clustering", Pattern Recognition Letters,

Vol .17, pp613-623, 1996.

[4] Davies, DL , Bouldin, D.W. "A cluster separat ionmeasure" . I E E E Transactions on Pattern Analys is andMachine Intelligence,Vol. 1, No2, 1979.

[5] Dun n, J . C .. "Well separated clusters and optimal

fuzzy pardtions ",J . Cybern.Vol.4, pp. 95-104, 1974.

[6] Gat h I . , Geva A.B. "Unsupervise d optimal fuzzyclustering", 1E EE Transactions on Pattern Ana ~sis andMachine Intdligence Vol. 11(7), 1989.

[7] Guha, S. , Rastogi , R., Shim K. (1998) . "CURE: AnEf f ic ien t Clus ter ing Algor i thm for LargeDatabases" , Publish ed in the Proceedings of the

ACM SIGMOD Conference .[8] Halkidi, M., Vazirgiannis, M., Badstakis, I .. "Quality

scheme assessment in the cluster ing process",Proceedings of P KD D, Lyon , France, 2000.

[9] Halkidi M, V azirgiannis M., " A da ta set orien tedapproach for cluster ing algor i thm selection",Proceedings ofP KD D, Freiburg, G ermany , 2001

[10] M. Halkidi, M. Vazirgiannis, "Clus tering ValidityAssessment: Finding the optimal par t i t ioning of a

26SI GM OD Record, Vol. 31, No. 3, Septe mber 2002

5/14/2018 2002 Hakidi Cluster Validity Methods Part II - slidepdf.com

http://slidepdf.com/reader/full/2002-hakidi-cluster-validity-methods-part-ii 9/9

data set'', to appear in the Proceedings of ICDM ,California, USA, November 2001.

[11] Krishnapuram, R., Frigui, H., Nasraoui. O."Quadratic shell clustering algorithms and thedetection of second-degree curves", Pattern

RecognitionLetters, Vol. 14(7), 1993[12] MaeQueen, J.B (1967). "Some Methods for

Classification and Analysis of MultivariateObservations", In Proceedings of 5th BerkleySymposium on Mathematical Statistics andProbability, Volume I: Statistics, pp281-297.

[13] Milligan, G.W. a nd Coo per, M.C.. "AnExamination of Procedures for Determining theNumber of Clusters in a Data Set", Pfychometrika,Vol.50, pp 159-179, 1985.

[14] Pal, N.R., Biswas, J.. "Clus ter Validation usinggraph theoretic concepts". Pa#em Recognition, Vol.30(6), 1997.

[15] Rezaee, R, Lelieveldt, B.P.F., Reiber, J.H .C. "Anew cluster validity index for the fuzzy c-mean",Pattern RecognitlbnLetters, 19, pp. 237-246, 1998.

[16] Sharma, S.C.. Applied Multivariate Techniques.JohnWillwy & Sons, 1996.

[17] Smyth, P. "Clustering using Mo nte Carlo Cross-Validation". Proceedingsof KD D Confirence,1996.

[18] Theodoridis, S., Ko utroubas , K.. Patternrecognition,Academic Press, 1999.

[19] Xie, X. L, Beni, G.. "A Validity measu re forFuzzy Clustering", IE E E Transactions on PatternAna~sis and machine ntelligence,Vol.13, No4, 1991.

S I G M O D R e c o r d , V o l . 3 1 , N o . 3 , S e p t e m b e r 2 0 0 2 27