enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and...

19
Please cite this article in press as: L.H. Son, Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.04.025 ARTICLE IN PRESS G Model ASOC-2296; No. of Pages 19 Applied Soft Computing xxx (2014) xxx–xxx Contents lists available at ScienceDirect Applied Soft Computing j ourna l h o mepage: www.elsevier.com/locate/asoc Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization Le Hoang Son VNU University of Science, Vietnam National University, Viet Nam a r t i c l e i n f o Article history: Received in revised form 14 February 2014 Available online xxx Keywords: Context clustering Fuzzy clustering type-2 Geo-demographic analysis Heuristic algorithms Particle swarm optimization a b s t r a c t Geo-Demographic Analysis, which is one of the most interesting inter-disciplinary research topics between Geographic Information Systems and Data Mining, plays a very important role in policies deci- sion, population migration and services distribution. Among some soft computing methods used for this problem, clustering is the most popular one because it has many advantages in comparison with the rests such as the fast processing time, the quality of results and the used memory space. Nonetheless, the state-of-the-art clustering algorithm namely FGWC has low clustering quality since it was constructed on the basis of traditional fuzzy sets. In this paper, we will present a novel interval type-2 fuzzy clustering algorithm deployed in an extension of the traditional fuzzy sets namely Interval Type-2 Fuzzy Sets to enhance the clustering quality of FGWC. Some additional techniques such as the interval context vari- able, Particle Swarm Optimization and the parallel computing are attached to speed up the algorithm. The experimental evaluation through various case studies shows that the proposed method obtains better clustering quality than some best-known ones. © 2014 Elsevier B.V. All rights reserved. Introduction Geo-Demographic Analysis (GDA), which was defined as the analysis of spatially referenced geo-demographic and lifestyle data[33], is one of the most interesting inter-disciplinary research topics between Geographic Information Systems and Data Mining, and is widely used in the public and private sectors for the planning and provision of products and services. There are various examples showing the needs of GDA in practical applications. Shelton et al. [34] performed a geo-demographic classification for mortality pat- terns in Britain and found the main causes of deaths in England and Wales from 1981 to 2000 associated with geographical loca- tions in a map so that they could assist decision makers in better understanding the distribution of major causes. Michael [23] con- ducted a GDA analysis to gather community attitudes on the future growth of Werri Beach and Gerringong, NSW (Nelson), Australia focusing primarily on what actions Council should take to manage population growth within existing neighborhoods. Páez et al. [29] presented a geo-demographic framework using data from Mon- treal, Canada to identify potential commercial partnerships that could exploit the characteristics of smart cards. Campbell et al. [8] Correspondence to: 334 Nguyen Trai, Thanh Xuan, Hanoi 010000, Viet Nam. Tel.: +84 904171284; fax: +84 0438623938. E-mail addresses: [email protected], [email protected] provided a detailed GDA of over 37,000 gifted and talented students admitted to the National Academy for Gifted and Talented Youth in England in 2003/2005 and showed that National Academy had nonetheless reached significant numbers of students in the poorest areas, something over 3000 students, and 8% of students identified as gifted and talented at this stage. Day et al. [11] took a survey that determined clusters of nations grouped by health outcomes by comparing life expectancy and a range of health system indicators within and between each cluster in order to provide sensible group- ings for international comparisons. Some other typical applications of GDA such as the spatial and socio-economic determinants of tuberculosis, urban green space accessibility for different ethnic and religious groups, children disorders investigation, etc. could be referenced in the articles [1,6,9,32,36,37]. In order to perform GDA, some soft computing methods are often used such as Principal Component Analysis (PCA), Self- Organizing Maps (SOM) and clustering. Walford [41] described a method using PCA to study the spatial distribution of the 1991 cen- sus data scores. However, results of PCA depend on the scaling of the variables, and its applicability is limited by certain assump- tions made in the derivation. Loureiro et al. [21] introduced the use of SOM as an adequate tool for GDA. Based on the variations in edge length in a path between two units on the SOM, the authors presented a new way of calculating fuzzy memberships of fuzzy clustering method. However, it requires a lot of memory spaces to store all neurons and weights; what is more the speed of training http://dx.doi.org/10.1016/j.asoc.2014.04.025 1568-4946/© 2014 Elsevier B.V. All rights reserved.

Upload: le-hoang

Post on 28-Dec-2016

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

A

Ec

LV

a

ARA

KCFGHP

I

adtaas[tatudgfpptc

T

h1

ARTICLE IN PRESSG ModelSOC-2296; No. of Pages 19

Applied Soft Computing xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l h o mepage: www.elsev ier .com/ locate /asoc

nhancing clustering quality of geo-demographic analysis usingontext fuzzy clustering type-2 and particle swarm optimization

e Hoang Son ∗

NU University of Science, Vietnam National University, Viet Nam

r t i c l e i n f o

rticle history:eceived in revised form 14 February 2014vailable online xxx

eywords:ontext clusteringuzzy clustering type-2eo-demographic analysis

a b s t r a c t

Geo-Demographic Analysis, which is one of the most interesting inter-disciplinary research topicsbetween Geographic Information Systems and Data Mining, plays a very important role in policies deci-sion, population migration and services distribution. Among some soft computing methods used for thisproblem, clustering is the most popular one because it has many advantages in comparison with therests such as the fast processing time, the quality of results and the used memory space. Nonetheless, thestate-of-the-art clustering algorithm namely FGWC has low clustering quality since it was constructed onthe basis of traditional fuzzy sets. In this paper, we will present a novel interval type-2 fuzzy clustering

euristic algorithmsarticle swarm optimization

algorithm deployed in an extension of the traditional fuzzy sets namely Interval Type-2 Fuzzy Sets toenhance the clustering quality of FGWC. Some additional techniques such as the interval context vari-able, Particle Swarm Optimization and the parallel computing are attached to speed up the algorithm. Theexperimental evaluation through various case studies shows that the proposed method obtains betterclustering quality than some best-known ones.

© 2014 Elsevier B.V. All rights reserved.

ntroduction

Geo-Demographic Analysis (GDA), which was defined as “thenalysis of spatially referenced geo-demographic and lifestyleata”[33], is one of the most interesting inter-disciplinary researchopics between Geographic Information Systems and Data Mining,nd is widely used in the public and private sectors for the planningnd provision of products and services. There are various exampleshowing the needs of GDA in practical applications. Shelton et al.34] performed a geo-demographic classification for mortality pat-erns in Britain and found the main causes of deaths in Englandnd Wales from 1981 to 2000 associated with geographical loca-ions in a map so that they could assist decision makers in betternderstanding the distribution of major causes. Michael [23] con-ucted a GDA analysis to gather community attitudes on the futurerowth of Werri Beach and Gerringong, NSW (Nelson), Australiaocusing primarily on what actions Council should take to manageopulation growth within existing neighborhoods. Páez et al. [29]

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

resented a geo-demographic framework using data from Mon-real, Canada to identify potential commercial partnerships thatould exploit the characteristics of smart cards. Campbell et al. [8]

∗ Correspondence to: 334 Nguyen Trai, Thanh Xuan, Hanoi 010000, Viet Nam.el.: +84 904171284; fax: +84 0438623938.

E-mail addresses: [email protected], [email protected]

ttp://dx.doi.org/10.1016/j.asoc.2014.04.025568-4946/© 2014 Elsevier B.V. All rights reserved.

provided a detailed GDA of over 37,000 gifted and talented studentsadmitted to the National Academy for Gifted and Talented Youthin England in 2003/2005 and showed that National Academy hadnonetheless reached significant numbers of students in the poorestareas, something over 3000 students, and 8% of students identifiedas gifted and talented at this stage. Day et al. [11] took a surveythat determined clusters of nations grouped by health outcomes bycomparing life expectancy and a range of health system indicatorswithin and between each cluster in order to provide sensible group-ings for international comparisons. Some other typical applicationsof GDA such as the spatial and socio-economic determinants oftuberculosis, urban green space accessibility for different ethnicand religious groups, children disorders investigation, etc. could bereferenced in the articles [1,6,9,32,36,37].

In order to perform GDA, some soft computing methodsare often used such as Principal Component Analysis (PCA), Self-Organizing Maps (SOM) and clustering. Walford [41] described amethod using PCA to study the spatial distribution of the 1991 cen-sus data scores. However, results of PCA depend on the scaling ofthe variables, and its applicability is limited by certain assump-tions made in the derivation. Loureiro et al. [21] introduced theuse of SOM as an adequate tool for GDA. Based on the variations in

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

edge length in a path between two units on the SOM, the authorspresented a new way of calculating fuzzy memberships of fuzzyclustering method. However, it requires a lot of memory spaces tostore all neurons and weights; what is more the speed of training

Page 2: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ING ModelA

2 omput

pciqisc[Aa

u

˛

w

(apbuam

GppcwiactddFctq

cnpdigcwtqficpoatstrt

ARTICLESOC-2296; No. of Pages 19

L.H. Son / Applied Soft C

hase is quite slow. Because of some limitations in those methods,lustering is often used instead because it has many advantagesn comparison with the rests such as the fast processing time, theuality of results and the used memory space. Our previous work

n [36] made an overview about some clustering methods for GDAuch as Fuzzy C-Mean (FCM) [3], the agglomerative hierarchicallustering [11], Neighborhood Effects (NE) [13], K-Means clustering20] and Fuzzy Geographically Weighted Clustering (FGWC) [24].mong them, FGWC was considered the most favorite algorithmnd was used in most of research articles about GDA applications.

′k = × uk + × 1

c∑j=1

wkj × uj (1)

+ = 1 (2)

kj =(popk × popj)

b

dakj

(3)

FGWC calculates the influence of one area upon another by Eqs.1)–(3) where uk

′ (uk) is the new (old) cluster membership of therea k. Two parameters and are the scaling variables. popk,opj are the populations of areas k and j, respectively. The num-er dkj is the distance between k and j. Two numbers a and b areser definable parameters. A is a factor to scale the “sum” termnd is calculated across all clusters, ensuring that the sum of theemberships for a given area for all clusters is equal to one.Although FGWC is the most popular clustering algorithm for

DA, it still contains some limitations such as the speed of com-uting and the clustering quality. One of our previous works in [35]resented a method so-called CFGWC to accelerate the speed ofomputing of FGWC by attaching the context variable terms. Otherorks in [36,37] have showed some preliminary results in improv-

ng the clustering quality of FGWC through intuitionistic fuzzy setsnd geographical spatial effects. Thus, our focus in this work is toontinue with the clustering quality problem of FGWC. Based uponhe observation that FGWC was constructed on the basis of the tra-itional fuzzy sets, which contain some limitations in membershipegrees as pointed out by Mendel [25], this fosters us to improveGWC in an extension of the traditional fuzzy sets to enhance thelustering quality of the algorithm. Now, let us explain why clus-ering algorithms on the traditional fuzzy sets have low clusteringuality.

According to Mendel [25], the traditional fuzzy sets cannot pro-ess some exceptional cases where the membership degrees areot the crisp values but the fuzzy ones instead. For example, theossibility to get tuberculosis disease of a patient concluded by aoctor is from 60 to 80 percents after examining all symptoms. Even

f some modern medical machines are provided, the doctor cannotive an exact number of that possibility. This shows the fact thatrisp membership values cannot model some situations in the realorld and should be replaced with the fuzzy ones. Rhee [30] stated

hat using the traditional fuzzy sets often results in bad clusteringuality because their uncertainties such as distance measure, fuzzi-er, centers, prototype and initialization of prototype parametersan create imperfect representations of the pattern sets. For exam-le, in case of pattern sets that contain clusters of different volumer density, it is possible that patterns staying on the left side of

cluster may contribute more for the other rather than this clus-er so that choosing suitable value for the fuzzifier is difficult. Bad

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

election can yield undesirable clustering results for pattern setshat include noise. Because of those limitations, some preliminaryesults of deploying fuzzy clustering methods in an extension ofhe traditional fuzzy sets so-called Interval Type-2 Fuzzy Sets (IT2FS)

PRESSing xxx (2014) xxx–xxx

have been introduced. Mendel [25] described the definition of IT2FSas follows.

A ={

(x, u, �A(x, u) = 1)|∀x ∈ A, ∀u ∈ JX ⊆ [0, 1]}

. (4)

From Eq. (4), we recognize that IT2FS is a generalization of thetraditional fuzzy sets since IT2FS will return to the traditional fuzzysets when there is no uncertainty in the third dimension. Basedupon this definition, some authors introduced several interval type-2 fuzzy clustering algorithms such as in the works of Hwang andRhee [15] and Rhee [30]. Specifically, Hwang and Rhee [15] pre-sented a type-2 fuzzy clustering algorithm to solve the problem ofchoosing distance measures in FCM algorithm, taking the differenceof each type-2 membership function area with the correspondingtype-1 membership value. Rhee [30] presented an improvementof this algorithm using two different values of fuzzifiers to solvethe uncertainty of fuzzifier in FCM. Some other variants of theinterval type-2 fuzzy clustering algorithms could be referenced in[2,10,12,14,17,19,22,26,27,31,42].

Motivated by those results, in this article, we will present a novelinterval type-2 fuzzy clustering algorithm so-called Context FuzzyGeographically Weighted Clustering on IT2FS or in short CFGWC2 toenhance the clustering quality of FGWC. The difference of CFGWC2with those interval type-2 fuzzy clustering algorithms above is twofold: Firstly, CFGWC2 is specially designed for the GDA problemthat requires the modification of geographical spatial effects tothe algorithm itself; secondly, it is equipped with some additionaltechniques to speed up the whole algorithm, namely:

• An interval context variable, which is an extension of the singlecontext variable of Pedrycz [28], is proposed and used to clarifythe clustering results and accelerate the computing speed.• In order to avoid bad initialization, which may occur in other

interval type-2 fuzzy clustering algorithms, and to convergequickly to the (sub-) optima solutions, a meta-heuristic optimiza-tion method namely Particle Swarm Optimization – PSO [18] isused to determine good initial centers for CFGWC2.• Since context values in the interval context variable can be simul-

taneously processed in CFGWC2, parallel computing technique isadapted to CFGWC2 to reduce the computational costs.

What have been listed in those bullets are our contributions inthis paper. The proposed algorithm will be implemented and com-pared with some relevant methods in term of clustering quality toverify its efficiency.

The rests of this paper are organized as follows. Section“The proposed methodology” elaborates the proposed method indetails including those additional techniques one-after-another.The numerical experiments through various case studies anddiscussions are given in Section “Results”. Finally, Section “Con-clusions” gives the conclusions and outlines future works of thisarticle.

The proposed methodology

In the previous section, we have known that CFGWC2 is aninterval type-2 fuzzy clustering algorithm equipped with someadditional techniques such as the interval context variable, PSOand the parallel computing for the GDA problem. Since those tech-niques are necessary for the description of CFGWC2, they are firstly

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

presented in Sections “Using PSO for the determination of initialcenters” and “The interval context”. The CFGWC2 algorithm accom-panied with the parallel computing mechanism will be describedin Section “Evaluation by various case studies”.

Page 3: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ING ModelA

omput

U

tt“ti

J

∣∣otapbn

)wzsd

MtzvrptbSi(aobbVa

T

pci

wrw

ARTICLESOC-2296; No. of Pages 19

L.H. Son / Applied Soft C

sing PSO for the determination of initial centers

This section mentions the technique that finds good initial cen-ers for clustering algorithms by PSO. The idea of this technique iso give a preliminary classification of the original pattern set so thattemporal” cluster results can be used to orient the classification inhe main algorithm. The objective function is shown in Eq. (5), andts constrains are given in Eqs. (6)–(7):

=N∑

k=1

C∑j=1

∥∥Xk − Vj

∥∥2 → min . (5)

minj = 1, C

j /= i

∥∥Vi − Vj

∥∥> maxs=1,POP(i)

∥∥Xs − Vi

∥∥Xs ∈ Cluster(i)

i = 1, C

(6)

Cluster(i)∣∣ ≤ ε1 where POP(i) = 1 and i = 1, C (7)

Constrain (6) requires that all clusters are separated from thethers. Alternatively, the minimal distance from a cluster’s centero the others is not shorter than the maximal one from this center toll data points in the cluster. POP(i) is the population or number ofatterns in the cluster Cluster(i). Constrain (7) minimizes the num-er of outliers in the result. Accordingly, the number of outliers isot greater than a pre-defined threshold ε1.

For the problem (5)–(7), we use PSO [18] to determine the (sub- optima solutions with the beginning population being initiatedith P particles. Each particle is a vector z = (z1, z2, .., zC ) where

i (i = 1, C) is a pattern randomly chosen from the original patternet. The velocities of zi are set to zeros. Details of the algorithm areescribed by the pseudo-code in Table 1.

Notice that Eq. (9) is used solely for the first iteration ofaxStep PSO. In the next iterations, the centers are calculated from

he previous one. Additionally, the value of MDi in Eq. (10) is set toero in case that this cluster has not got any element. The fitnessalue of a particle is calculated by Eq. (13) where (�1,�2) are theatio constants. Eqs. (14)–(16) are used to update the velocities andositions of all particles. In those equations, c1 is the ratio to keephe velocity intact, c2 is the ratio to change the velocity followingy pBest and c3 shows the influence level of gBest to the velocity.ince the role of zi (i = 1, C) from the second iteration afterwardss replaced with center Vi, the domain of random number in Eq.14) is set to (−1, 1) in order to ensure the values of the centersre bounded within the domain of the pattern set. After a numberf iteration steps defined by MaxStep PSO, the solution is gettingetter because of the amelioration process after each “flying step”ased on the fitness function. The outputted result V(0) = (V1, V2, ..,C) can be found from the particle holding current gBest and is useds the initial center for CFGWC2.

he interval context

In order to clarify the clustering results and accelerate the com-uting speed of the clustering algorithms, the context variableould be used. According to Pedrycz [28], a (single) context variablen Y ⊂ X is defined through the map below.

A : Y → [0, 1]y → f = A(y ),

(17)

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

k k k

here fk can be understood as the representation for the level ofelation of the kth point to the supposed context fk. There are someays to define the relation between fk and the membership of kth

PRESSing xxx (2014) xxx–xxx 3

point to the ith cluster, for instance, using the sum operator (18) ormaximum operator (19).

c∑i=1

uki = fk, k = 1, N, (18)

cmaxi=1

uki = fk, k = 1, N (19)

In our previous work in [35], we defined a context variable tonarrow the original geographical dataset under some conditions ofcertain dimensions. The reason to use the term of context for theclustering algorithm is twofold. Firstly, a context variable is usefulto clarify the results following by users’ purposes. Because only asubset of the original dataset which has considerable meaning tothe context is invoked, the result focuses on the area that reallyhas many relevant points. Secondly, it helps improving the speed ofcomputing. In the traditional clustering method, it not only takeslong time to process the whole data, but also makes the results lessmeaning to the considered context. On the contrary, the context-based clustering methods both accelerate the speed and improvethe semantic. Nevertheless, there are some limitations in defini-tion (17). Firstly, the importance of the kth point to the supposedcontext is decided by a value fk. In fact, it is not enough to reflecta variety of different evaluations of many people to this relation.In the other words, one can assume that the importance is only 0.3while other affirms that it should be 0.6. Due to this fact, the use of avalue fk is not enough. Secondly, the old approach excludes the rolesof other data points to the context. It is a misleading assumptionsince all characteristics always have relationships either directlyor indirectly with the others. From these limitations, we extendthe use of context by introducing a new term: “the interval contextvariable”. An interval context is defined as f = [f1,f2] where each fi(i = 1,2) is stated through the map in Eq. (17). For the most impor-tant points, the value of f is high, e.g. [0.6,0.8]. Similarly, the valueof f in case of less important points is low, e.g. [0,0.15]. This intervalreflects the “fuzziness” of the context. In the other words, we havejust performed a “fuzzy” step for the considered context. It helpsus overcome the shortcomings of the single context variable andis suitable for CFGWC2, which works on IT2FS. Details of applyingthe interval context variable for CFGWC2 will be presented in theSection “The CFGWC2 algorithm”.

The CFGWC2 algorithm

We have had a general background of choosing initial centersby PSO in Section “Using PSO for the determination of initial cen-ters” and the basic definition of the interval context in Section “Theinterval context”. Now, we use both of them accompanied with theparallel computing mechanism in the main activity of the CFGWC2algorithm. Let us see the mechanism of CFGWC2 illustrated by Fig. 1below.

According to Fig. 1, the parallel computing mechanism ofCFGWC2 requires three machines whose first one (Machine 1)is responsible for generating initial centers for the remainingmachines. Nevertheless, the centers values of Machine 2 andMachine 3 are different since the stopping conditions of PSO are notidentical. After (MaxStep PSO/2) iteration steps, the first center V(0)

is outputted and transferred to Machine 2, and the second centeris sent to Machine 3 after (MaxStep PSO) iterations. This guaran-tees different results in Machine 2 and Machine 3, and is suitablefor the determination of the upper and lower centers and mem-

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

bership degrees of the clustering algorithms on IT2FS, i.e. U(1), V(1)

(Machine 2) and U(2), V(2) (Machine 3) in Fig. 1.In Machine 2 and Machine 3, we send the initial centers V(0) to

a type-2 fuzzy clustering procedure accompanied with the interval

Page 4: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

4 L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx

Table 1The pseudo-code of PSO procedure.

Input - The pattern set X whose dimension is r- The number of elements (clusters) – N(C)- The number of particles in the beginning population – P- Maximal number of iteration steps in PSO – MaxStep PSO

Output - Final center V(0)

Particle Swarm Optimization (PSO)

1: Initialization2: Repeat3: For each particle4: Assign remaining patterns to its clusters:

Xj ∈ Cluster(i) ⇔∥∥zi − Xj

∥∥ = min{∥∥zk − Xj

∥∥ |k = 1, C}

(8)

5: Calculate population POP(i)from current clusters6: Calculate center Vi and the maximal distance from Vi to cluster’s elements:

V (l)i=

( ∑Xs∈Cluster(i)

X(l)s

)/POP(i), l = 1, r, i = 1, C (9)

MDi = maxs=1,POP(i)

{∥Xs − Vi

∥} = maxs=1,POP(i)

⎧⎨⎩√√√√ r∑

l=1

(Xs(l) − Vi

(l))2

⎫⎬⎭ ,

Xs ∈ Cluster(i),

(10)

7: Calculate the separated status and the number of outliers:

SEP(z) =∣∣Cluster(i)

∣∣ where

minj = 1, C

j /= i

∥∥Vi − Vj

∥∥MDi

≤ 1; i = 1, C, (11)

OUT(z) =∣∣Cluster(i)

∣∣ where POP(i) ≤ 1; i = 1, C. (12)

8: Compute the fitness value of particles:

f (z) = 1(�1/1 + SEP(z)) + (�2/1 + OUT(z))

(13)

9: Calculate pBest and gBest as in the traditional PSO algorithm [18]10: End For11: For each particle12: Update new velocity and position:

velocityij = c1 ∗ velocityij + c2 ∗ rand(−1, 1) ∗ (zpBest,j − zij) + c3 ∗ rand(−1, 1) ∗ (zgBest,j − zij), (14)zij = zij + velocityij, (15)

cVtmtbaVu

cCo

V

to(ti

P

c1 + c2 + c3 = 1.

13:

14:

ontext variable so-called Context-FGWC2 to get the crisp center(1) (Machine 2) and V(2) (Machine 3). If the difference betweenhe initial and crisp centers is smaller than a threshold (Eps) or the

aximal number of iterations (MaxStep) is reached then we stophe Context-FGWC2 procedure and take the crisp center and mem-ership degree, i.e. U(1), V(1) (Machine 2) and U(2), V(2) (Machine 3)s the final results. Otherwise, we assign V(0) = V(1) in Machine 2 and(0) = V(2) in Machine 3 and start a new iteration in Context-FGWC2ntil the stopping conditions hold.

Once the upper and lower centers and membership degrees arealculated, we use a defuzzification method so-called the Partitionoefficient and Exponential Separation (PCAES) [40] validity index tobtain the final center and membership degree as below.

(∗) ={

V (1) if PCAES(V (1)) ≥ PCAES(V (2))

V (2) otherwise(20)

This index measures the potential, whether the identified clus-er has an ability to be a good cluster or not. It was compared withther indexes such as Partition Entropy (PE), Partition CoefficientPC), Fuzzy Hypervolume (FHV), Xie & Beni, Pal & Bezdek, Modifica-ion PC (MPC), Zahid et al., and showed the impressive results, evenn a noisy environment. The definition of PCAES is given below.

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

CAES(C) =C∑

j=1

PCAES[j] (21)

(16)End For

Until MaxStep PSO

where

PCAES[j] =N∑

k=1

ukj2

uM− exp

⎛⎜⎝−min

i /= j{∥∥Vj − Vi

∥∥2}

ˇT

⎞⎟⎠ , (22)

uM = min1≤i≤C

{N∑

k=1

u2ki

}(23)

ˇT =∑C

l=1

∥∥Vl − V∥∥2

C(24)

V = (V1, V2, .., .Vr) where Vi (i = 1, r) is calculated as,

Vi =∑C

l=1Vli

C. (25)

PCAES[j] is used to measure the compactness and separation forcluster j (j = 1, C). They are summed up to calculate PCAES(C) ∈ (− C,C). The large PCAES(C) value means that each of these C clustersis compact and separated from other clusters. It is a criterion tochoose the suitable clustering’s output. Depending on which centeris opted, the related membership degree is used as final member-

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

ship U(*).Now, we describe the Context-FGWC2 procedure. Remember-

ing in Section “The interval context” that an interval context wasdefined as f = [f1,f2] so that we could apply fi (i = 1,2) in each machine.

Page 5: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx 5

Table 2The pseudo-code of Context-FGWC2 procedure.

Input - Initial center V(0), the pattern set X, an interval fuzzifier [m1,m2],- The number of elements (clusters) – N(C), the dimension of dataset r,- Geographic parameters ˛, ˇ, a and b, precision ε, MaxStep iteration.

Output - Final center V(3)

Context-FGWC2

1: V(3)← V(0)

2: Repeat3: V(0)← V(3)

4: Compute U(x) =[

U(x), U(x)]

by (26)–(29)

5: V(A)← V(0)

6: For l = 1, r:7: Sort X following by lin ascending order8: Find index k0 satisfying (30). Otherwise, k0← N − 19: Calculate U(1)(l) , V(1) by (31)–(32)10: If V(1) = V(A)

11: For s = l + 1, r: Ukj(1)(s) ← Ukj (j = 1, C, k = 1, N)

12: Go to Step 1613: Else V(A)← V(1)

14: End If15: End For16: VR← V(1)

17: Calculate U(1) by (33)18: Repeat from Step 5 to 17 to calculate VL , U(2)

19: Perform Type-Reduction by (36)20: Determine the population of each cluster by (37)21: Update U(C)(x) by geo-characteristics in (2), (3) and (38)–(40)22: Perform Type-Reduction and compute center V(2) by (41) and (42) to get UGT(x)23: V(B)← V(2)

24: Repeat from Step 6 to 18 to calculate VR , VL from V(B) and UGT(x)

otmbiobC

mcmcvc

U

U

U

25: Perform defuzzification to calculate V(3) by (43)

26: Until∥∥V (3) − V (0)

∥∥ ≤ ε or MaxStep is reached

Specifically, f1 (f2) was used in the Context-FGWC2 proceduref Machine 2 (3). Because of using different context values and ini-ial centers in those machines, the upper and lower centers and

embership degrees totally reflect the basic principle of IT2FS. Theasic idea of the Context-FGWC2 procedure in Machine 2 is using an

nterval of primary membership consisting of the lower and uppernes calculated from the initial center and updating the intervaly geo-characteristics and context value f1. The pseudo-code ofontext-FGWC2 is shown in Table 2.

In Step 4 of the Context-FGWC2, the intervals of primaryembership consisting of the upper and lower memberships are

alculated by Eqs. (26)–(29). Notice that in (26)–(27), the sum ofembership degrees in all clusters is equal to f1k where f1k is a

ontext value of the kth point in the pattern set. Analogously, thealues of the upper and lower memberships are depended by thisontext value as shown in (28)–(29).

(x) =

⎧⎨⎩Ukj ∈ (0, 1)|k = 1, N; j = 1, C;

C∑j=1

Ukj = f1k

⎫⎬⎭ (26)

(x) =

⎧⎨⎩Ukj ∈ (0, 1)|k = 1, N; j = 1, C;

C∑j=1

Ukj = f1k

⎫⎬⎭ (27)

kj =

⎧⎪⎪⎪⎪⎪⎪⎪⎨f1k

C∑i=1

(∥∥Xk − Vj(0)∥∥∥∥Xk − Vi

(0)∥∥)2/m1−1

, iff1k

C∑i=1

(∥∥Xk − Vj(0)∥∥∥∥Xk − Vi

(0)∥∥) ≥ 1/C

(28)

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

⎪⎪⎪⎪⎪⎪⎪⎩f1k

C∑i=1

(∥∥Xk − Vj(0)∥∥∥∥Xk − Vi

(0)∥∥)2/m2−1

, otherwise

Ukj =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

f1k

C∑i=1

(∥∥Xk − Vj(0)∥∥∥∥Xk − Vi

(0)∥∥)2/m1−1

, iff1k

C∑i=1

(∥∥Xk − Vj(0)∥∥∥∥Xk − Vi

(0)∥∥) < 1/C

f1k

C∑i=1

(∥∥Xk − Vj(0)∥∥∥∥Xk − Vi

(0)∥∥)2/m2−1

, otherwise(29)

After we have the interval of primary membership, the maxi-mum (minimum) center VR (VL) and the related membership matrixU(1) (U(2)) are calculated by the same steps from Step 6 to 17. Specif-ically, in Step 8 index k0 in the range [1,N − 1] satisfying Eq. (30) willbe selected as a pivot to calculate U(1)(l) in Eq. (31).

Xk0l ≤∑C

j=1vjl(A)

C ≤ X(k0+1)l(30)

Ukj(1)(l) =

{Ukj if k ≤ k0

Ukj otherwise, (j = 1, C, k = 1, N) (31)

Using the average operator of fuzzifier, center V(1) is calculatedbelow.

V (1)ji=∑N

k=1(Ukj(1)(l))

[m1+m2/2]Xki∑N

k=1(Ukj(1)(l))

[m1+m2/2], (j = 1, C, i = 1, r) (32)

Next, in Step 10 we check whether V(1) = V(A) or not. If this con-dition holds, we conclude that the maximum center VR = V(1) andthe related membership matrix U(1) is found in Eq. (33).∑r (1)(l)

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

U(1) = l=1U

r. (33)

Otherwise, we make another loop with the next feature l in thepattern set. By the similar process, in Step 18 we can compute the

Page 6: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE ING ModelASOC-2296; No. of Pages 19

6 L.H. Son / Applied Soft Comput

mE

U

U

U

mElo

I

wd(

U

Fig. 1. The mechanism of CFGWC2.

inimum center VL and the related membership matrix U(2) whereqs. (31) and (33) are replaced with (34) and (35), respectively.

(2)(l)kj

={

Ukj if k ≤ k0

Ukj otherwise, (j = 1, C, k = 1, N) (34)

(2) =∑r

l=1U(2)(l)

r(35)

(C) = U(1) + U(2)

2(36)

From these related membership matrices, Step 19 obtains theembership degree of traditional fuzzy sets (a.k.a. type-1) through

q. (36). This process is called the type-reduction and used to calcu-ate the population of each cluster. Step 20 calculates the populationf each cluster by this rule:

f U(C)kj

> U(C)ki

and i /= j then Xkis assigned to cluster j, (37)

(k = 1, N; i = 1, C)Based on the population, Step 21 determines the geographical

eights of all areas by Eq. (3), and the modification of membershipegree following by geo-characteristics is performed through Eqs.

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

2), (3) and (38)–(40).

G(x) = G(U(C)(x)) =[

UkjG, Ukj

G]

, (j = 1, C, k = 1, N) (38)

PRESSing xxx (2014) xxx–xxx

UkjG = × U(2)

kj+ × 1

C∑i=1

wji × U(2)ki

(39)

UkjG = × U(1)

kj+ × 1

C∑i=1

wji × U(1)ki

,

(i, j = 1, C, i /= j, k = 1, N).

(40)

Notice that parameter A in Eqs. (39) and (40) is a factor to scalethe “sum” term and is calculated across all clusters, ensuring thatthe sum of the memberships for a given area k for all clusters isequal to the context value f1k (k = 1, N). Step 22 performs the type-reduction for the modified membership degree and calculates newcenter V(2) by Eqs. (41) and (42), respectively.

UkjGT =

UkjG + Ukj

G

2, (j = 1, C, k = 1, N), (41)

Vji(2) =

∑Nk=1(Ukj

GT )[m1+m2/2]

Xki∑Nk=1(Ukj

GT )[m1+m2/2]

, (j = 1, C, i = 1, r) (42)

Now, we have modified membership degree UG and crisp centerV(2). Since we work on IT2FS, V(2) should be an interval contain-ing the minimum and maximum centers VL, VR. This work is donethrough Step 23 and 24. In order to verify whether the outputtedcenters is the solution or not, Step 25 performs the defuzzificationfor the interval center as in Eq. (43) and get crisp one V(3). Thiscenter is used to check the stopping condition described in Step 26.

V (3) ={

VL if∥∥VL − V (0)

∥∥ ≤ ∥∥VR − V (0)∥∥

VR otherwise(43)

In order to avoid unstoppable iteration, we limit the maximalnumber of iteration steps to MaxStep. If the number of iterationsteps exceeds this threshold, the Context-FGWC2 procedure willstop immediately. Once the stopping condition holds, we receivethe type-2 membership degree UG and the interval center [VL,VR].The crisp center V(3) and the distribution of pattern set after clus-tering can be extracted from them. (UG,V(3)) are the output ofContext-FGWC2, and the crisp center V(3) is denoted in Fig. 1 asV(1) (Machine 2) and V(2) (Machine 3).

The works of Context-FGWC2 in Machine 3 is analogous to thosein Machine 2 except the maximal number of iteration steps inMachine 3 is equal to half of that in Machine 2 (∼MaxStep/2). Thereason for this alteration lies in the synchronization process. Specif-ically, the results in Machine 2 and 3 are transferred to Machine 1after completion so that if a machine takes too much time to gen-erate the outputs, it will cause large delayed time of the overallsystem. Because the initial center of Machine 3 is somehow betterthan that of Machine 2, the convergence may be faster and is notaffected by the number of iteration steps. In practical, the numberof machines can be reduced, for instance the works of the Machine1 can be assigned to one of two left machines. Because it takesmuch time to transfer data between machines, it is better if we candecrease the waiting time. If so, the number of transferred stepsbetween machines is reduced by half and the overall processingtime is reduced remarkably.

The advantages of CFGWC2 are fourth-fold: Firstly, it is capa-ble to handle the bad initialization and immature convergence bythe PSO procedure; secondly, the clustering results focus on theusers’ purposes by the interval context; thirdly, the computing

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

speed of CFGWC2 is ameliorated through the interval context andthe parallel computing mechanism; fourthly, the most importantadvantage of CFGWC2 is the high clustering quality in comparisonwith some relevant methods since this algorithm was deployed on

Page 7: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx 7

al dis

Iictdc

R

E

n

Fig. 2. The two-dimension

T2FS, which is more general and able to handle the existing lim-tations of the traditional fuzzy sets. The disadvantage of CFGWC2ould be the computational costs and its complex activities. Never-heless, by employing some additional techniques we hope that theisadvantages could be ameliorated, and CFGWC2 achieves goodlustering results.

esults

xperimental environment

This section describes the experimental environment used inext ones.

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

Experimental tools: We have implemented the proposed algo-rithm (CFGWC2) in addition to these algorithms: NE [13], FGWC[24] and CFGWC [35] in MPI/C programming language and exe-cuted them on a Linux Cluster 1350 with eight computing nodes

Fig. 3. The two-dimensional distrib

tribution of UNO dataset.

of 51.2 GFlops. Each node contains two Intel Xeon dual core3.2 GHz, 2 GB Ram. The experimental results are taken as theaverage values after 10 runs.• Cluster validity: We use PCAES validity function described in Eqs.

(21)–(25).• Dataset: We use two kinds of datasets below.

- A real dataset of socio-economic demographic variables fromUnited Nation Organization (UNO) [39] containing the statisticabout population of 230 countries over ten years (2001–2010).Missing data were processed by Binning method [16]. The two-dimensional distribution is illustrated in Fig. 2.

- A benchmark demographic dataset from The University of Edin-burgh, Scotland (Fig. 3) including expression levels of 2880genes taken in 11 different areas [7]. This dataset was used

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

in many different research papers on gene expression by geo-graphical factors such as in [4,5].

• Objective: We compare the clustering quality of CFGWC2 withthose of other algorithms through PCAES index. Additionally, the

ution of Colon Cancer dataset.

Page 8: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

8 L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx

Table 3PCAES values of all algorithms in Case 1 on UNO dataset.

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 1091.30832 11.49441 106.87815 106.87815 730.86493 15.80779 107.95304 107.953043 3508.71041 14.20249 102.97090 103.08807 1764.55205 15.48401 104.51216 104.624304 1026.1004 9.66077 101.00239 101.05883 1882.45315 9.60082 102.01264 102.072795 851.56196 13.83029 98.86012 98.89076 828.00298 20.09243 98.70007 98.734466 734.85210 23.45840 105.61367 105.11415 713.06259 13.36007 106.82538 95.32594

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 435.14908 15.35085 110.80574 110.80576 222.59648 14.84918 111.54395 111.543973 699.52639 17.05059 112.36477 112.46454 448.65676 18.15664 121.39454 121.452594 758.04253 12.13725 111.70188 111.77472 530.12028 15.16747 123.22859 123.308325 729.73602 13.80425 109.59175 109.64291 544.21607 17.33470 122.96865 123.038076 660.41492 21.53153 107.14039 107.19830 534.99351 18.78905 122.06920 123.31178

ithms

E

pab

Cu

-

-

Fig. 4. Average PCAES of algor

evaluation about the computational times of these algorithms isalso mentioned.

valuation by various case studies

In this section, we evaluate the proposed algorithm in com-arison with the relevant methods by various case studiesbout the parameters of algorithms. Main findings are foundelow.

ase 1. In this case, some parameters of these algorithms are setp as below.

The default geo-characteristics are: a = b = 1, = 0.7, = 0.3. Thesevalues determine the geo-modification process stated in Eqs.(1)–(3). Our previous work [35] suggested using value ≥ 0.6 inorder to increase the clustering quality.

We use the default context values in [35] for CFGWC algorithmbelow.

f = (f1, f2, .., fN),⎧⎪⎨ 0 if k = 0

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

where fi =⎪⎩ rand(0, 1)2k

otherwise, k = imod4, i = 1, N.

(44)

on UNO dataset by fuzzifiers.

- In CFGWC2, m2 = 2 × m1 = 2 × m where m is the fuzzifier of NE,FGWC and CFGWC. The interval context f =

[f 1, f 2

]where f1 = f

and f2 = 1. A broad interval of fuzzifiers and contexts will createmore distinct results than a narrow one.

- In PSO, MaxStep PSO = 100 and population size is 500. Otherparameters are (c1, c2, c3) = (0.2, 0.3, 0.5) and (�1, �2) = (1, 1). Assuggested by Thien et al. [38], these values will make the conver-gence to the optimum faster.

- Threshold ε and MaxStep of all algorithms are 10−3 and 500,respectively.

Table 3 describes the PCAES values of all algorithms on UNOdataset. The experiments are performed following by different val-ues of the number of clusters and fuzzifiers. Results show thatPCAES values of CFGWC2 are the largest among all. This meansthat the clustering quality of CFGWC2 is better than those of otheralgorithms. In order to comprehend the experimental results, weillustrate the PCAES values of all algorithms through various casesof fuzzifiers in Fig. 4. From this figure, we recognize that PCAESvalues of CFGWC2 are larger than those of other algorithms.

For example, PCAES of CFGWC2 in Fig. 4 is 13 times greaterthan that of FGWC when m = 1.5. These numbers in cases of NE and

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

CFGWC are 14 and 99 times, respectively. Similarly, when m = 3.0,PCAES of CFGWC2 is still larger than those of other algorithms, i.e.3.79 (FGWC), 3.78 (NE) and 27 times (CFGWC). These evidencesconfirm that the clustering quality of CFGWC2 is the best among

Page 9: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE ING ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Comput

Table 4The computational time of all algorithms in Case 1 on UNO dataset (s).

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 7.68 0.04 0.04 0.03 10.165 0.04 0.04 0.043 14.55 0.03 0.09 0.11 14.31 0.04 0.10 0.134 12.94 0.07 0.08 0.12 12.86 0.08 0.11 0.145 11.14 0.07 0.16 0.12 17.49 0.07 0.17 0.146 20.94 0.07 0.24 0.19 24.56 0.11 0.30 0.22

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 5.23 0.03 0.04 0.03 10.06 0.04 0.04 0.043 14.98 0.04 0.08 0.15 15.40 0.06 0.09 0.124 15.96 0.09 0.17 0.21 18.06 0.11 0.19 0.17

atmaeCtt1fasetoscaobFlt

rt

Ftmcsp(spapcprb

trsa

5 17.57 0.11 0.19 0.19 22.02 0.27 0.23 0.186 24.82 0.17 0.31 0.36 24.87 0.23 0.36 0.30

ll. Nonetheless, PCAES values of CFGWC2 tend to decrease whenhe fuzzifier increases. For instance, PCAES values of CFGWC2 from

= 1.5 to m = 3.0 are 1442, 1183, 656 and 456, respectively. Theverage reducing ratio per half of a fuzzifier is 31%. This means thatach time the value of fuzzifier is increased by 0.5, PCAES value ofFGWC2 is reduced by 31 percents on average. On the other hands,he average PCAES values of other algorithms seem to be stablehrough different values of fuzzifier, i.e. 109 (FGWC), 108 (NE) and5 (CFGWC). By rough calculation, we can easy find the value ofuzzifier that makes PCAES value of CFGWC2 is smaller than otherlgorithms, i.e. m ≥ 5.0. This fact tells us the truth that CFGWC2hould be used when the fuzzifier is small. As mentioned by Bezdekt al. [3] when designing FCM algorithm, the authors stated thathe fuzzifier should be from 1.5 to 2.5, ideally m = 2.0, for the sakef optimal centers found by the algorithm. Thus, we may see thatome cases such as m ≥ 5.0 will never happen in practical appli-ations. However, this finding may be useful for us to choose theppropriate value of parameters. Is there any change of the orderf algorithms in terms of PCAES values by different values of num-er of clusters? Following by Table 3, the answer is absolutely no.or a given number of clusters, PCAES value of CFGWC2 is alwaysarger than those of algorithms. Indeed, this shows the stability ofhe proposed algorithm.

The computational time of all algorithms for exporting theesults in Table 3 is described in Table 4. Clearly, the computationalime of CFGWC2 is longer than those of other algorithms.

When m = 3.0, the average computational time of CFGWC2,GWC, NE and CFGWC are 18.1, 0.182, 0.162 and 0.142 s, respec-ively. Similar results are obtained in m = 2.0 and m = 2.5. As we

ay see in the pseudo-code of Context-FGWC2, it requires hugeomputation to process the interval membership matrix. By usingome additional techniques to speed up this algorithm, the com-utational time of CFGWC2 is reduced remarkably. The maximalminimal) computational time of CFGWC2 in Table 4 is 24.87 (5.23). With the increasing of computing powers nowadays, the com-utational cost in this case is acceptable. Table 4 also gives us theverage increment levels of the computational time of algorithmser fuzzifier. Each time the fuzzifier is increased by one unit, theomputational time of CFGWC2 is increased by 16.8 percents. Theercent values of FGWC, CFGWC and NE are 29.5%, 57% and 64.9%,espectively. When the fuzzifier is large enough, these times coulde approximate to the others.

Now, we evaluate the proposed algorithm on a larger datasethan UNO. In Fig. 5, we measure the average PCAES values of all algo-

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

ithms on Colon Cancer dataset following by fuzzifiers. The resultshow that PCAES values of CFGWC2 are larger than those of otherlgorithms. For example, when m = 1.5, the average PCAES value of

PRESSing xxx (2014) xxx–xxx 9

CFGWC2 is 1.13 times larger than that of CFGWC. These numbersin cases of FGWC and NE are 2.2 and 2.19 times, respectively. Sim-ilarly, when m = 3.0, the average PCAES of CFGWC2 is 1.32 times,1.15 times and 1.16 times larger than those of CFGWC, FGWC andNE, respectively. These evidences confirm that the clustering qual-ity of CFGWC2 is the best among all even on a large dataset suchas Colon Cancer. Nonetheless, PCAES values of CFGWC2 and otheralgorithms tend to decrease when the fuzzifier increases. The val-ues of CFGWC2 from m = 1.5 to m = 3.0 are 48.77, 34.18, 26.95 and22.94, respectively. This result is similar to that on the UNO datasetand shows that we should choose the small value of fuzzifier in thiscase in order to obtain good clustering quality of CFGWC2. Evenwhen PCAES values of CFGWC2 reduce, they are still better thanthose of other algorithms. The average PCAES value of CFGWC2is approximately 1.4 times larger than those of other algorithmsthrough various cases of fuzzifiers. This means that when the fuzzi-fier increases, PCAES values of both CFGWC2 and other algorithmsreduce, but the values of CFGWC2 are still larger than those of otheralgorithms. However, small PCAES values of CFGWC2 in cases oflarge fuzzifier are not a good choice for us, and we should keep thefuzzifier is as small as possible.

In Fig. 6, we verify whether or not PCAES values of CFGWC2 arelarger than those of other algorithms by the number of clusters. Thisfigure clearly points out that the line of PCAES values of CFGWC2 ishigher than those of other algorithms. The started point of all lines(C = 2) shows that PCAES values of algorithms are approximate tothe others, i.e. 7.87 (CFGWC2), 8.67 (CFGWC), 7.182 (FGWC) and7.184 (NE). However, the differences between those lines are get-ting obvious when the number of clusters increases. For example,when C = 3, PCAES values of CFGWC2, CFGWC, FGWC and NE are23.4, 19.3, 16.67 and 16.62, respectively. When C = 6, the differ-ence between CFGWC2 and other algorithms is maximal since theamplitudes of those lines expand. PCAES values of those algorithmsin this case of clusters are 56.2, 47.5, 33.8 and 33.2, respectively.Thus, three remarks are extracted from this figure: (i) the clusteringquality of CFGWC2 is the best even when all algorithms are testedfollowing by the number of clusters; (ii) The higher the number ofclusters is, the larger PCAES value of CFGWC2 is; (iii) The value offuzzifier should be inversely proportional to that of the number ofclusters for the sake of high PCAES values of CFGWC2 as shown inFigs. 5 and 6.

In Fig. 7, we verify the changes of PCAES values of CFGWC2by fuzzifiers on various datasets. Clearly, PCAES values on a largedataset (Colon Cancer) are much smaller than those on smalldataset (UNO). For example, the average PCAES values of CFGWC2on UNO and Colon Cancer are 1442 and 48.77, respectively whenm = 1.5. Similar results can be seen when m = 3.0 with PCAES valueson UNO and Colon Cancer being 456 and 22.94, respectively. Thus,two remarks are found from this test: Firstly, the sizes of inputteddatasets should be small or medium for the high PCAES values ofCFGWC2; secondly, the changes of PCAES values through variousfuzzifiers on a large dataset are smaller than those on a small one.

Running on a large dataset such as Colon Cancer results in highcomputational time of CFGWC2 as shown in Fig. 8. This figurecompares the average computational time of CFGWC2 on UNOand Colon Cancer datasets by fuzzifiers. The average processingtime of CFGWC2 per fuzzifier on Colon Cancer is 418 s whilst thatprocessing time on UNO is 15.7 s. From this result, we should con-sider the first remark about small or medium inputted datasetswhen running CFGWC2 algorithm.

The major remark in this case is the confirmation of the bestclustering quality of CFGWC2 among all.

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

Case 2. In Case 2, we make some changes of the parameters of allalgorithms. Specifically, geo-characteristics are = 0.4 and = 0.6.Other parameters are kept intact as in Case 1. The aim is to verify

Page 10: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

Please cite this article in press as: L.H. Son, Enhancing clustering quality of geo-demographic analysis using context fuzzy clusteringtype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.04.025

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

10 L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx

Fig. 5. Average PCAES of algorithms on Colon Cancer dataset by fuzzifiers.

Fig. 6. Average PCAES of algorithms on Colon Cancer dataset by number of clusters.

Fig. 7. Changes of PCAES values of CFGWC2 by fuzzifiers on various datasets.

Page 11: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx 11

CFGW

wts

smatwontotF8Cdtoopfc

TP

Fig. 8. Average computational time of

hether the clustering quality of the proposed algorithm is betterhan that of others or not when value (geographic parameter) ismaller than that of Case 1.

The results in Table 5 show that PCAES values of CFGWC2 aretill the largest among all of other algorithms. For example, when

= 1.5, the average PCAES value of CFGWC2 is 959. It is 9.42, 9.44nd 28.4 times larger than those of NE, FGWC and CFGWC, respec-ively. Similar results are found with three left cases of fuzzifier inhich the PCAES values of CFGWC2 are still larger than those of

ther algorithms. Thus, the change of geographic parameters doesot affect the outcome results of algorithms. Now, we investigatehe impact of reducing the value of parameter to PCAES valuesf all algorithms. Firstly, the average PCAES values of CFGWC2 perhe number of clusters do not reduce when the fuzzifier increases.or example, these values in cases from m = 1.5 to m = 3.0 are 959,77, 1144 and 696, respectively. In Table 3, we got a remark thatFGWC2 should be used when the fuzzifier is small. Nonetheless, itoes not hold in this case since the reduction of value will increasehe change of the membership degree of an area following by othernes’ as shown in Eq. (1). As a result, PCAES value does not depend

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

n the fuzzifier. This fact shows that the changes of geographicarameters can help us reduce the dependence of CFGWC2 on theuzzifier. Secondly, the average PCAES values of CFGWC2 in thisase are smaller than those in the previous one when m ≤ 2.0 and

able 5CAES values of all algorithms in Case 2 on UNO dataset.

C m = 1.5

CFGWC2 CFGWC FGWC NE

2 1063.54223 20.3629 106.61419 106.61419

3 1159.25575 33.53309 102.97252 103.20172

4 999.55488 31.28929 101.34948 101.45983

5 883.39827 30.07119 99.71663 99.77627

6 691.62333 53.32656 97.22073 97.75520

C m = 2.5

CFGWC2 CFGWC FGWC NE

2 617.29692 20.00514 109.12593 109.12593

3 2612.09686 42.52125 108.10623 108.30283

4 890.07623 49.62835 106.85262 106.98115

5 813.10817 67.13987 104.97583 105.06321

6 790.12919 78.78573 102.57698 102.95150

C2 on UNO and Colon Cancer datasets.

are larger than those in the previous one for the rests. For exam-ple, PCAES values of CFGWC2 in Case 1 and Case 2 when m = 1.5are 1442 and 959, respectively. Nonetheless, these values in caseof m = 3.0 are 456 and 696, respectively. This means that reducing

value will decrease the clustering quality of CFGWC2. Neverthe-less, the reducing ratio of PCAES values is not as large as that of theprevious case when the fuzzifier increases. Each time the fuzzifieris increased by 0.5, PCAES values of CFGWC2 in Case 1 and Case2 are reduced by 31% and 5.76%, respectively. This explains whyPCAES values of CFGWC2 in Case 2 are larger than those in Case 1when m > 2.0. Thus, we should set the value of fuzzifier m > 2.0 when

value decreases or < 0.5 for the large PCAES values in CFGWC2algorithm. Finally, the difference of PCAES values between CFGWC2and other algorithms in Case 2 is smaller than that in Case 1. Themaximal difference in Case 2 is recorded at m = 1.5 when the aver-age PCAES value of CFGWC2 is 9.42, 9.44 and 28.4 times greaterthan NE, FGWC and CFGWC, respectively. In Case 1, the maximaldifference is also recorded at m = 1.5 when the average PCAES valueof CFGWC2 is 14, 13 and 99 times larger than NE, FGWC and CFGWC,respectively. The minimal difference in Case 2 is (6.25, 6.26, 11.67)for the list above. These numbers in Case 1 are (3.78, 3.79, 27.05),

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

respectively. Thus, the reduction of value makes the difference ofPCAES values between algorithms be small.

The computational time of algorithms on UNO dataset inthis case are described in Table 6. Similar to previous case, the

m = 2.0

CFGWC2 CFGWC FGWC NE

856.68444 20.06651 107.24664 107.246641070.81389 36.77730 103.81761 104.03909

974.06185 36.35071 101.77679 101.89110823.12417 52.03082 99.33932 99.40264664.32020 60.78891 96.34740 96.57951

m = 3.0

CFGWC2 CFGWC FGWC NE

427.75450 19.79795 110.32253 110.32243974.07089 48.02871 112.83833 112.97839755.75772 62.70671 112.48978 112.62659691.34934 80.02289 111.26262 111.37412632.35675 87.51471 108.95160 109.66323

Page 12: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

12 L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx

Fig. 9. Average PCAES of algorithms in Case 2

Table 6Computational time of all algorithms in Case 2 on UNO dataset (s).

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 9.00 0.01 0.02 0.03 4.37 0.02 0.03 0.043 13.30 0.02 0.07 0.09 15.17 0.02 0.06 0.294 8.94 0.03 0.07 0.12 7.69 0.04 0.07 0.135 9.75 0.06 0.06 0.09 11.28 0.05 0.14 0.156 16.12 0.09 0.19 0.12 25.16 0.05 0.11 0.15

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 5.63 0.03 0.03 0.04 9.48 0.03 0.03 0.043 11.50 0.04 0.08 0.17 14.57 0.03 0.07 0.104 9.78 0.05 0.08 0.09 13.17 0.06 0.09 0.09

crtm1r

dfoda4NeCwaip

CttCc

5 11.02 0.09 0.10 0.12 22.22 0.18 0.10 0.116 25.42 0.12 0.12 0.13 25.56 0.23 0.12 0.16

omputational time of CGWC2 is larger than those of other algo-ithms. Nonetheless, the average computational times of CFGWC2hrough various fuzzifiers are smaller than those of Case 1. From

= 1.5 to m = 3.0, these values in Case 1 and Case 2 are (13.45, 15.87,5.71, 18.08) and (11.42, 12.73, 12.67, 17), respectively. Therefore,educing value makes CFGWC2 run faster.

In Fig. 9, we verify the effectiveness of CFGWC2 on Colon Cancerataset by comparing the average PCAES values of all algorithmsollowing by fuzzifiers. This figure clearly shows that PCAES valuesf CFGWC2 are larger than those of other algorithms. The maximalifference of PCAES values between those algorithms is recordedt m = 2.0 when the average PCAES value of CFGWC2 is 4.87 times,.67 times and 4.93 times larger than those of CFGWC, FGWC andE, respectively. The minimal difference is at m = 3.0 when thosequivalent values are 2.28, 2.11 and 2.19 times. PCAES values ofFGWC, FGWC and NE are approximate to the others in this caseith the domain of values belonging to the interval [22.18, 25.49]

s shown in the figure. Obviously, the clustering quality of CFGWC2s still the best among all even though some changes of geographicarameters and datasets have been done.

In Fig. 10, we study the changes of average PCAES values ofFGWC2 with different datasets and cases. The aim of this test is

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

o investigate the impact of geographic parameters and datasetso PCAES values of CFGWC2. Results show that PCAES values ofFGWC2 in this case are larger than those of Case 1 of Colon Can-er dataset. For example, when m = 1.5, the average PCAES values

on Colon Cancer dataset by fuzzifiers.

of CFGWC2 in Case 2 and Case 1 are 79.6 and 48.7, respectively. Inm = 2.0, the difference of PCAES between those cases are maximalwith PCAES values being 119 (Case 2) and 34.1 (Case 1). This meansthat the change of geographic parameter, especially reducing thevalue of �, enhance PCAES values of the proposed algorithm. Nev-ertheless, PCAES values of CFGWC2 on Colon Cancer dataset aremuch smaller than those on UNO. When m = 2.5, the average PCAESvalue of CFGWC2 in Case 2 of UNO dataset is 1144.5. These valuesin cases of Case 1 and Case 2 of Colon Cancer are 26.95 and 60.27,respectively. Similar results are found with other cases of fuzzi-fiers. Obviously, using small datasets obtains better PCAES valuesof CFGWC2 than large ones.

Is there any change of computational time of CFGWC2 with dif-ferent cases and datasets? Fig. 11 helps us answer this questionby drawing three lines represented for the computational time ofCFGWC2 in Case 2 of Colon Cancer dataset (gray line), in Case 2 ofUNO (blue, dot line) and in Case 1 of Colon Cancer dataset (green,double dot line). This figure states that using low values of geo-graphic parameters (˛) in CFGWC2 reduces the computational timeof this algorithm. The proof for this consideration is that the line of“Case 2 – Colon Cancer” is always lower than that of “Case 1 – ColonCancer”. However, the “Case 2 – Colon Cancer” line is much higherthan that of “Case 2 – UNO”. Since the size of Colon Cancer dataset is14 times larger than that of UNO, this increases the computationaltime of CFGWC2 as shown in the figure. Even in this situation, thecomputational time of CFGWC2 is not much higher than those ofother algorithms in this case because these times increase concur-rently. Thus, the computational time of CFGWC2 is acceptable inthis situation.

In short, the changes of geographic parameters in this case donot affect the order of algorithms in terms of clustering quality, andthe clustering quality of CFGWC2 is proved to be the best among all.

Case 3. In this case, we narrow the interval context and theinterval fuzzifier. Specifically, the interval fuzzifier of CFGWC2 is[m1, m2] = [m, 1.5 × m] where m is the fuzzifier of NE, FGWC andCFGWC. The interval context is f =

[f 1, f 2

]where f2(f1) is the max-

imal (minimal) value between the function in Eq. (44) and thestandard Gaussian function in Eq. (45). Other parameters are keptintact as in Case 1.

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

f = (f1, f2, .., fN),

where fi =1√2˘

e−1/2i2 , (i = 1, N)(45)

Page 13: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

Please cite this article in press as: L.H. Son, Enhancing clustering quality of geo-demographic analysis using context fuzzy clusteringtype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.04.025

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx 13

Fig. 10. Changes of PCAES values of CFGWC2 in Case 2 with different datasets & cases.

Fig. 11. Changes of computational time of CFGWC2 in Case 2 by datasets & cases.

Table 7PCAES values of all algorithms in Case 3 on UNO dataset.

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 139.95086 6.04220 106.87815 106.87815 5258.34615 5.76239 107.95304 107.953043 568.21771 51.29265 102.97089 103.08807 15,285.74240 52.38541 104.51216 104.624304 448.02083 225.29319 101.00240 101.05884 292.73635 171.86304 102.01262 102.072815 988.99686 326.01488 98.86013 98.89074 1098.36009 134.54998 98.70013 98.734486 6640.59369 259.32112 104.96678 105.43589 7153.92664 1286.04887 95.29829 95.32741

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 1577.22397 8.16377 110.80570 110.80575 499.29655 7.53170 111.54397 111.543903 365.63816 53.58911 112.36477 112.46454 1861.65345 63.74519 121.39453 121.452504 478.69445 353.06716 111.70188 111.77475 2435.806574 415.55560 123.22859 123.308325 15,865.95189 376.79995 109.59178 109.64291 4064.86640 285.29656 122.96855 123.038136 617.06103 165.71077 107.14640 107.19815 15,167.8462 332.15287 121.56386 121.64574

Page 14: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE ING ModelASOC-2296; No. of Pages 19

14 L.H. Son / Applied Soft Comput

Table 8Computational time of all algorithms in Case 3 on UNO dataset (s).

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 4.83 0.01 0.04 0.03 5.18 0.01 0.05 0.043 12.84 0.02 0.09 0.14 9.85 13.12 0.08 0.434 13.95 12.97 0.10 0.15 15.72 15.79 0.13 0.185 18.27 0.02 0.19 0.19 18.85 12.42 0.14 0.156 17.09 13.37 0.29 0.20 19.49 14.27 0.44 0.37

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 5.59 0.01 0.03 0.05 7.89 0.01 0.04 0.063 11.93 0.02 0.09 0.13 10.51 0.03 0.09 0.144 15.47 13.24 0.14 0.20 17.52 13.34 0.16 0.30

tTuoor2siShto3TtficoCvPateioedmtreCtt

Cmrc

oCpB

are 1266, 116, 109 and 110, respectively. Obviously, PCAES of

5 17.22 11.52 0.16 0.18 15.96 12.45 0.17 0.236 20.11 13.71 0.43 0.32 19.70 13.88 0.33 0.30

The results of algorithms with the new configuration are illus-rated in Tables 7 and 8. Table 7 mentions PCAES values whilstable 8 shows the computational time of algorithms. PCAES val-es of algorithms in Table 7 point out that the clustering qualityf CFGWC2 is the best among all. With m = 1.5, the PCAES valuesf (CFGWC2, CFGWC, FGWC and NE) are (1757, 173, 102, 103),espectively. Analogously, when m = 3.0, these values are (4805,20, 120.13, 120.19), respectively. This clearly shows that CFGWC2till obtains the best clustering quality among all even though thenterval context and the interval fuzzifier have been narrowed.ome changes of PCAES values of algorithms in this case are hereinighlighted. Firstly, PCAES values of CFGWC2 are directly propor-ional to the fuzzifier. For example, when m = 1.5, the average PCAESf CFGWC2 is 1757. When m increases to 2.5, PCAES of CFGWC2 is780. PCAES value is continued to increase to 4805 when m = 3.0.his result is opposite to that of Case 1 when we receive a remarkhat the PCAES value of CFGWC2 tends to decrease when the fuzzi-er increases. Thus, we should set high value of fuzzifier with theonfiguration in this case in order to get good clustering qualityf CFGWC2. Secondly, we compare the average PCAES values ofFGWC2 in Table 7 with those in Table 3 and get the remark that thealues in Table 7 are much higher than those in Table 3. The pairs ofCAES values of CFGWC2 in (Table 3, Table 7) from m = 1.5 to m = 3.0re (1442, 1757), (1183, 5817), (656, 3780) and (456, 4805), respec-ively. Indeed, the impact of narrow context and fuzzifier reallynhance the clustering quality of CFGWC2 as shown in the compar-son above. Thirdly, the difference of PCAES between CFGWC2 andther algorithms is smaller than that of Case 1. Besides, this differ-nce is stable through various fuzzifiers. For example, the maximalifference between CFGWC2 and other algorithms is recorded at

= 3.0 when the average PCAES value of CFGWC2 is 21 times, 40imes and 39 times larger than those of CFGWC, FGWC and NE,espectively. The minimal difference is recorded at m = 1.5 when thequivalent values are 10 times, 17 times and 17 times, respectively.omparing those results with ones in Case 1, we can recognize thathe changes of narrow context and fuzzifier in CFGWC2 result inhe stable difference between CFGWC2 and other algorithms.

Table 8 shows the similar results with Table 4 in Case 1 whenFGWC2 runs longer than other algorithms. The maximal andinimal computational times of CFGWC2 are 20.11 and 4.83 s,

espectively. Because these numbers are small, the computationalost of CFGWC2 can be acceptable.

In Fig. 12, we illustrate the average PCAES of all algorithmsn Colon Cancer dataset by fuzzifiers. Intuitively, PCAES line of

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

FGWC2 is higher than those of other algorithms. This clearlyroves that the clustering quality of CFGWC2 is the best among all.esides, PCAES values of CFGWC2 do not reduce when the fuzzifier

PRESSing xxx (2014) xxx–xxx

increases. This result is similar to that in Case 2, and is opposite tothat in Case 1. These evidences stress that the changes of geographicparameters, contexts and fuzzifiers can help CFGWC2 reduce thedependence on the fuzzifier. Fig. 13 shows the changes of PCAESvalues of CFGWC2 by different datasets and cases. Similar to Case 2,the comparisons between the results in this case and those in Case1 on Colon Cancer and in Case 3 on UNO dataset are highlighted.The results point out that the average PCAES values of CFGWC2 inthis case are much smaller than those in Case 3 of UNO dataset.The maximal and minimal PCAES values of CFGWC2 in Case 3 are5817 and 1757, respectively. Those values in this case are 60.1 and46.3, respectively. Obviously, the difference of PCAES between twocases is quite large, even be larger than that in Case 2 shown inFig. 10. Thus, the recommendation is that we should not use largedatasets with the configuration of parameters in this case in orderto avoid small PCAES values of CFGWC2 as such. Nonetheless, PCAESvalues of CFGWC2 in this case and in Case 1 on Colon Cancer areapproximate to the others. Fig. 13 shows that the bars of these casesare nearly equal. The maximal difference of PCAES between twocases is 32.7. Comparing with equivalent results in Fig. 10, we mayrecognize that there is not much change of PCAES value if somemodifications of fuzzifiers and contexts are performed like whatwere done in this case. In Fig. 14, we examine the changes of com-putational time of CFGWC2 by different datasets and cases. Resultsshow that the average computational time of CFGWC2 in this caseis larger than those in Case 1 on Colon Cancer. This result is oppositeto that of Case 2 and tells us the fact that using new interval con-texts and fuzzifiers makes CFGWC2 run slower than the algorithmwithout these configurations. However, both the time of “Case 3 –Colon Cancer” and “Case 1 – Colon Cancer” are much slower thanthat of “Case 3 – UNO”, which takes approximately 15 s on averageto process a given value of fuzzifier.

Experiments with the changes of context and fuzzifier in Case 3re-confirm the superiority of CFGWC2 to other algorithms in termsof clustering quality.

Case 4. The interval context in Case 3 is near to zero value. Inthis case, we perform the experiment with another interval contextwhose values are near to one.

f = (f1, f2, .., fN ),

where fi =

⎧⎨⎩

1 if k = 0

rand(0, 1)

2k+ 1

2otherwise

, (k = imod4, i = 1, N)

(46)

f = (f1, f2, .., fN),

where fi =12+ 1√

2˘e−1/2i2 , (i = 1, N)

(47)

The new interval context is defined as f = [f1, f2] where f2(f1) isthe maximal (minimal) value between the function in Eq. (46) andthe modified Gaussian function in Eq. (47). The interval fuzzifier ofCFGWC2 is still [m1, m2] = [m, 1.5 × m]. Other parameters are keptintact as in Case 1.

Table 9 describes PCAES values of all algorithms in Case 4 onUNO dataset. Results affirm the remark achieved in the previouscases in which PCAES values of CFGWC2 are much larger thanthose of other algorithms. The average PCAES values of CFGWC2,CFGWC, FGWC and NE by the number of clusters and fuzzifiers

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

CFGWC2 is 10.8 times, 11.57 times and 11.51 times higher thanCFGWC, FGWC and NE, respectively. Thus, the clustering quality ofCFGWC2 is the best among all. In order to investigate the impact of

Page 15: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

Please cite this article in press as: L.H. Son, Enhancing clustering quality of geo-demographic analysis using context fuzzy clusteringtype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.04.025

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx 15

Fig. 12. Average PCAES of algorithms in Case 3 on Colon Cancer dataset by fuzzifiers.

Fig. 13. Changes of PCAES values of CFGWC2 in Case 3 by different datasets & cases.

Fig. 14. Changes of computational time of CFGWC2 in Case 3 by datasets & cases.

Page 16: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

16 L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx

Table 9PCAES values of all algorithms in Case 4 on UNO dataset.

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 5045.66670 2.90326 106.87815 106.87815 1581.05213 3.00038 107.95304 107.953043 3875.38385 357.41475 102.97089 103.08807 3661.44151 59.75581 104.51216 104.624304 1558.83769 353.10979 101.00240 101.05883 1607.32553 84.23204 102.01262 102.072795 1304.13622 351.67087 98.86010 98.89074 1382.93286 76.83167 98.70007 98.734476 1122.92581 336.08991 104.96682 105.11414 1133.25453 244.14817 106.40734 118.10708

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 832.81547 14.86731 110.80575 110.80574 489.21252 2.61660 111.54398 111.543953 137.44910 66.55067 112.36477 112.46454 545.80669 46.76435 121.39451 121.452614 132.35684 65.09628 111.70188 111.77476 364.85714 52.27548 123.22859 123.308315 119.87765 62.66657 109.59177 109.64291 95.71117 48.18992 122.96870 123.03806

815

tt1Orav

TvrtwociavticC

dpm(l

TC

6 109.09223 57.22595 107.14639 107.09

he new interval context to PCAES values of CFGWC2, we calculatehe average PCAES values from m = 1.5 to m = 3.0 such as (2581,873, 266, 344), respectively. These results show two remarks: (i)pposite to the remark in Case 1, PCAES values of CFGWC2 do not

educe when the fuzzifier increases; (ii) PCAES values of CFGWC2re large when the fuzzifier is small, i.e. m ≤ 2.0. Otherwise, PCAESalues are small. The second remark is similar to that in Case 2.

Comparing the PCAES values of CFGWC2 in Table 9 with those inable 3, we recognize that when the fuzzifier is small (m ≤ 2.0), thealues in Table 9 are larger than those in Table 1. Nevertheless, theesults are reversed for the left cases of fuzzifier. This result reflectshe large distinction between PCAES values when m ≤ 2.0 and thosehen m ≤ 2.0 in this case. Thus, a remark is extracted through this

bservation is that we should choose the fuzzifier m ≤ 2.0 with theonfiguration in this case in order to obtain high value of PCAESn CFGWC2. The maximal difference of PCAES between CFGWC2nd other algorithms is found at m = 2.0 when the average PCAESalues of CFGWC2 is 20 times, 18 times and 17 times larger thanhose of CFGWC, FGWC and NE, respectively. This difference is smalln comparison with those in Table 3. Indeed, using the new intervalontext results in the small difference of PCAES values betweenFGWC2 and other algorithms.

We also measure the computational times of algorithms andescribe them in Table 10. This table points out that the com-

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

utational time of CFGWC2 is longer than those in Table 4. Theaximal and minimal computational times of CFGWC2 are 27.08

m = 2.5) and 4.49 (m = 2.0), respectively. The maximal value isarger than those of previous cases. However, the minimal one is

able 10omputational time of all algorithms in Case 4 on UNO dataset (s).

C m = 1.5 m = 2.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 12.16 0.02 0.03 0.03 4.49 0.03 0.04 0.033 14.56 0.06 0.10 0.10 14.35 0.09 0.10 0.194 18.82 0.08 0.12 0.14 16.48 0.18 0.23 0.145 23.82 0.17 0.14 0.15 22.46 0.22 0.16 0.126 24.53 0.24 0.20 0.24 24.86 0.32 0.35 0.21

C m = 2.5 m = 3.0

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 5.91 0.05 0.04 0.04 5.37 0.04 0.05 0.043 15.27 0.09 0.20 0.31 15.15 0.11 0.09 0.124 18.91 0.38 0.37 0.15 18.66 0.18 0.17 0.165 17.02 0.40 0.33 0.14 25.06 0.25 0.34 0.166 27.08 0.24 0.40 0.45 24.66 0.39 0.33 0.28

227.36512 47.98189 123.39548 122.64568

the smallest among all. Thus, the remark above about choosingm ≤ 2.0 is re-confirmed.

In Fig. 15, we measure the average PCAES values of all algorithmson Colon Cancer dataset by fuzzifiers.

The results show that the average PCAES value of CFGWC2 islarger than those of other algorithms. For example, PCAES valuesof CFGWC2, CFGWC, FGWC and NE are 55.51, 21.96, 23.08 and21.90, respectively when m = 1.5. However, PCAES values of notonly CFGWC2 but also other algorithms tend to decrease when thefuzzifier increases. Thus, the differences of PCAES values betweenthose algorithms are getting smaller. When the fuzzifier is smallenough, PCAES values of all algorithms are quite small. In the otherwords, the clustering qualities of all algorithms are inversely pro-portional to the fuzzifier. Thus, an important remark of this case isthat we should not choose large values of fuzzifier in order to keepgood clustering quality of CFGWC2. In Fig. 16, we investigate theimpact of parameters to PCAES values of CFGWC2. Obviously, usingnarrowed interval context and fuzzifier whose values are near toone as in this case do not improve PCAES values of CFGWC2 signif-icantly. From m = 1.5 to m = 3.0, the average PCAES values of “Case4 – Colon Cancer” bar are not always larger than those of “Case 1– Colon Cancer”. For example, when m=1.5, PCAES values of thesebars are 55.51 and 48.77, respectively. When m = 2.5, these valuesare 26.71 and 26.95, respectively. We also draw another bar of “Case3 – Colon Cancer” to clearly recognize the impact of parameters.Fig. 16 points out that the average PCAES values of “Case 3 – ColonCancer” are not only better than those of “Case 1 – Colon Cancer” butalso better than those of “Case 4 – Colon Cancer”. This means thatthe impact of parameters in this case to PCAES values of CFGWC2is not equal to that in Case 3.

The impact of datasets to PCAES values of CFGWC2 is illustratedin Fig. 17. PCAES values of CFGWC2 in “Case 4 – Colon Cancer”are much smaller than those in “Case 4 – UNO”. Thus, we also getthe similar remark with that of previous cases. In Fig. 18, we com-pare the average computational time of CFGWC2 through variousdatasets and cases. Through this figure, we recognize that CFGWC2in “Case 4 – Colon Cancer” runs slower than those in “Case 1 – ColonCancer” and in “Case 4 – UNO”. When m < 2.7, it is slower than thatin “Case 3 – Colon Cancer”. Thus, the fuzzifier should be set small ifthe configuration of parameters in this case is used for CFGWC2.

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

Summary of the findings

In this section, we sum up the main findings in Section “Evalu-ation by various case studies” as follows:

Page 17: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

Please cite this article in press as: L.H. Son, Enhancing clustering quality of geo-demographic analysis using context fuzzy clusteringtype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.04.025

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx 17

Fig. 15. Average PCAES of algorithms in Case 4 on Colon Cancer dataset by fuzzifiers.

Fig. 16. Impact of parameters to PCAES values of CFGWC2 in Case 4.

Fig. 17. Impact of datasets to PCAES values of CFGWC2 in Case 4.

Page 18: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ARTICLE IN PRESSG ModelASOC-2296; No. of Pages 19

18 L.H. Son / Applied Soft Computing xxx (2014) xxx–xxx

e of C

C

ifatIaObCapalts

A

RfqC

[

[

[

[

[

[

[

[

Fig. 18. Changes of computational tim

The clustering quality of CFGWC2 is the best among all even on alarge dataset such as Colon Cancer.CFGWC2 is stable through various numbers of clusters and fuzzi-fiers.PCAES values of CFGWC2 are directly proportional to the numberof clusters.In order to achieve the best clustering quality in CFGWC2, someparameters should be set up as follows. Geographic parameters:

< 0.5, fuzzifier m > 2.0, and the interval context and intervalfuzzifier are narrowed as in Case 3.The changes of PCAES values of CFGWC2 by fuzzifiers on a largedataset are smaller than those on a small one.The sizes of inputted datasets should be small or medium for thehigh PCAES values of CFGWC2.The computational cost of CFGWC2 can be acceptable.

onclusions

In this paper, we concentrated on improving the clustering qual-ty of the state-of-the-art clustering algorithm so-called FGWCor the GDA problem. A novel interval type-2 fuzzy clusteringlgorithm namely CFGWC2 deployed in an extension of the tradi-ional fuzzy sets namely Interval Type-2 Fuzzy Sets was presented.t integrated some additional techniques to speed up the wholelgorithm such as the interval context variable, Particle Swarmptimization and the parallel computing. The experimental resultsy various case studies on two benchmark datasets showed thatFGWC2 obtained better clustering quality than other relevantlgorithms. The experiments also suggested us which values ofarameters should be chosen for the best quality of the proposedlgorithm. Further works will examine CFGWC2 for handling veryarge datasets, partly classified and time-series datasets. Addi-ionally, some applications of the proposed method in real-lifeituations will be considered.

cknowledgements

The authors are greatly indebted to the editor-in-chief: Prof. R.

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

oy, anonymous reviewers, Ms. Hoang Thi Thu Huong, FPT, Vietnamor their valuable comments and suggestions which improved theuality and clarity of the paper. We kindly acknowledge Mr. Truonghi Cuong, Ms. Hoang Thi Tuan Dung and Ms. Bui Thi Huong Lan for

[

[

FGWC2 in Case 4 by datasets & cases.

some calculations on this research. This work is sponsored by theVNU Project under contract No. QG.13.01.

References

[1] G. Alvarez-Hernandez, F. Lara-Valencia, P.A. Reyes-Castro, R.A. Rascon-Pacheco,An analysis of spatial and socio-economic determinants of tuberculosis in Her-mosillo, Mexico, 2000–2006, Int. J. Tuberc. Lung Dis. 14 (6) (2010) 708–713.

[2] Abhishek, A. Jeph, F.C.H. Rhee, Interval type-2 fuzzy C-means using multiplekernels, in: Proceeding of 2013 IEEE International Conference on Fuzzy Systems(FUZZ 2013), 2013, pp. 1–8.

[3] J.C. Bezdek, R. Ehrlich, et al., FCM: the fuzzy c-means clustering algorithm,Comput. Geosci. 10 (1984) 191–203.

[4] A. Ben-Dor, et al., Tissue classification with gene expression profiles, J. Comput.Biol. 7 (2000) 559–584.

[5] A. Brazma, J. Vilo, Gene expression data analysis, FEBS Lett. 480 (1) (2000)17–24.

[6] D.J. Baumgardner, A.L. Schreiber, J.A. Havlena, F.D. Bridgewater, D.L. Steber, M.A.Lemke, Geographic analysis of diagnosis of attention-deficit/hyperactivity dis-order in children: Eastern Wisconsin, USA, Int. J. Psychiatry Med. 40 (4) (2010)363–382.

[7] Colon Cancer, The colon cancer data, 2000 http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html

[8] R.J. Campbell, R.D. Muijs, J.G.A. Neelands, W. Robinson, D. Eyre, R. Hewston,The social origins of students identified as gifted and talented in England: ageo-demographic analysis, Oxford Rev. Educ. 33 (1) (2007) 103–120.

[9] A. Comber, C. Brunsdon, E. Green, Using a GIS-based network analysis to deter-mine urban greenspace accessibility for different ethnic and religious groups,Landsc. Urban Plan. 86 (1) (2008) 103–114.

10] O. Castillo, P. Melin, Recent advances in interval Type-2 fuzzy systems, Springer,USA, 2012.

11] P. Day, J. Pearce, D. Dorling, Twelve worlds: a geo-demographic comparisonof global inequalities in mortality, J. Epidemiol. Community Health 62 (2008)1002–1010.

12] D. Dinh Nguyen, L.T. Ngo, L.T. Pham, GMKIT2-FCM: a genetic-based improvedmultiple kernel interval Type-2 fuzzy C-means clustering, in: Proceeding of2013 IEEE International Conference on Cybernetics (CYBCONF 2013), 2013, pp.104–109.

13] Z. Feng, R. Flowerdew, Fuzzy Geodemographics: A Contribution from FuzzyClustering Methods, Taylor & Francis, London, 1998.

14] M.H. Fazel Zarandi, R. Gamasaee, I.B. Turksen, A type-2 fuzzy c-regression clus-tering algorithm for Takagi–Sugeno system identification and its application inthe steel industry, Inf. Sci. 187 (2012) 179–203.

15] C. Hwang, F. Rhee, Uncertain fuzzy clustering: interval type-2 fuzzy approachto c-means, IEEE Trans. Fuzzy Syst. 15 (1) (2007) 107–120.

16] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd edition,Morgan Kaufmann, CA, USA, 2011.

17] Z. Ji, Y. Xia, Q. Sun, G. Cao, Interval-valued possibilistic fuzzy C-means clusteringalgorithm, Fuzzy Sets Syst. (2013), http://dx.doi.org/10.1016/j.fss.2013.12.011.

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

18] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of IEEEInternational Conference on Neural Networks IV, Perth, Australia, 1995, pp.1942–1948.

19] P. Kaur, I.M.S. Lamba, A. Gosain, Kernelized type-2 fuzzy c-means clusteringalgorithm in segmentation of noisy medical images, in: Proceeding of 2011

Page 19: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

ING ModelA

omput

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

ARTICLESOC-2296; No. of Pages 19

L.H. Son / Applied Soft C

IEEE International Conference on Recent Advances in Intelligent ComputationalSystems (RAICS 2011), 2011, pp. 493–498.

20] J.C. Lee, M. Jhun, S. Jin, Geo-demographic analysis for marketing applications:Megatrending lifestyles in Korea, in: Proceeding of Bulletin of the InternationalStatistical Institute, Finland, 1999, pp. 1–4.

21] M. Loureiro, F. Bac ão, V. Lobo, Fuzzy classification of geodemographicdata using self-organizing maps, in: Proceeding of 4th International Con-ference of GIScience 2006, 20–23 September, Münster, Germany, 2006,pp. 123–127.

22] O. Linda, M. Manic, General type-2 fuzzy c-means algorithm for uncertain fuzzyclustering, IEEE Trans. Fuzzy Syst. 20 (5) (2012) 883–897.

23] K. Michael, The importance of conducting geodemographic market analysison coastal areas: a pilot study using Kiama Council, in: Proceeding of CoastalGIS 2003 an Integrated Approach to Australian Coastal Issues, Wollongong,Australia, 2003, pp. 481–496.

24] G.A. Mason, R.D. Jacobson, Fuzzy geographically weighted clustering, in:Proceeding of the 9th International Conference on GeoComputation, Maynooth,Eire, Ireland, 2007.

25] J.M. Mendel, Advances in type-2 fuzzy sets and systems, Inf. Sci. 177 (2007)84–110.

26] P. Melin, O. Mendoza, O. Castillo, Face recognition with an improved intervaltype-2 fuzzy logic Sugeno integral and modular neural networks, IEEE Trans.Syst. Man Cybern. A: Syst. Humans 41 (5) (2011) 1001–1012.

27] D.D. Nguyen, L.T. Ngo, Multiple kernel interval type-2 fuzzy c-means clustering,in: Proceeding of 2013 IEEE International Conference on Fuzzy Systems (FUZZ2013), 2013, pp. 1–8.

28] W. Pedrycz, Conditional fuzzy C-mean, Pattern Recogn. Lett. 17 (1996) 625–632.

Please cite this article in press as: L.H. Son, Enhancing clustering quatype-2 and particle swarm optimization, Appl. Soft Comput. J. (2014),

29] A. Páez, M. Trépanier, C. Morency, Geodemographic analysis and the iden-tification of potential business partnerships enabled by transit smart cards,Transport. Res. A 45 (2011) 640–652.

30] F. Rhee, Uncertain fuzzy clustering: insights and recommendations, IEEE Com-put. Intell. Magazine 2 (2007) 44–56.

[

PRESSing xxx (2014) xxx–xxx 19

31] M.A. Raza, F.C.H. Rhee, Interval type-2 approach to kernel possibilistic c-meansclustering, in: Proceeding of 2012 IEEE International Conference on Fuzzy Sys-tems (FUZZ-IEEE 2012), 2012, pp. 1–7.

32] D.K. Rossmo, Recent developments in geographic profiling, Policing 6 (2) (2012)144–150.

33] P. Sleight, Targeting Customers: How to Use Geodemographics and LifestyleData in Your Business, NTC Publication, Henley-on-Thames, 1993.

34] N. Shelton, M. Birkin, D. Dorling, Where not to live: a geo-demographic clas-sification of mortality for England and Wales, 1981–2000, Health Place 12 (4)(2006) 557–569.

35] L.H. Son, P.L. Lanzi, B.C. Cuong, H.A. Hung, Data mining in GIS: a novel context-based fuzzy geographically weighted clustering algorithm, Int. J. Mach. Learn.Comput. 2 (3) (2011) 235–238.

36] L.H. Son, B.C. Cuong, P.L. Lanzi, N.T. Thong, A novel intuitionistic fuzzy clus-tering method for geo-demographic analysis, Exp. Syst. Appl. 39 (10) (2012)9848–9859.

37] L.H. Son, B.C. Cuong, H.V. Long, Spatial interaction–modification model andapplications to geo-demographic analysis, Knowledge-Based Syst. 49 (2013)152–170.

38] N.D. Thien, L.H. Son, P.L. Lanzi, P.H. Thong, Heuristic optimization algorithmsfor terrain splitting and mapping problem, Int. J. Eng. Technol. 3 (4) (2011)376–383.

39] UNSD Statistical Databases, Demographic Yearbook, 2011 http://unstats.un.org/unsd/databases.htm

40] K.L. Wu, M.S. Yang, A cluster validity index for fuzzy clustering, Pattern Recogn.Lett. 26 (2005) 1275–1291.

41] N. Walford, An Introduction to Geodemographic Classification (Census Learn-

lity of geo-demographic analysis using context fuzzy clustering http://dx.doi.org/10.1016/j.asoc.2014.04.025

ing), 2011 http://cdu.mimas.ac.uk/materials/unit5/index.html42] G. Zheng, J. Xiao, J. Wang, Z. Wei, A similarity measure between general type-

2 fuzzy sets and its application in clustering, in: Proceeding of 2010 8thWorld Congress on Intelligent Control and Automation (WCICA 2010), 2010,pp. 6383–6387.