cluster cat vars

Upload: fab101

Post on 04-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Cluster Cat Vars

    1/17

  • 7/30/2019 Cluster Cat Vars

    2/17

    usually they are the average values of individual variables. Then the distances of eachobject from all centroids are calculated again. If an object is closer to the centroid of anyother cluster, it is moved to that cluster. This process is repeated as long as any object canbe moved. If the centroid is created from average values of individual variables, themethod is called k-means. If the centroid is created from medians, the method is called

    k-medians. In the first case, the Euclidean distance (see [13]) is used. However, some soft-

    ware systems (SYSTAT) also offer further measures (see below). In thek-medoids method,a certain vector of observations is taken for the center of the cluster.

    Methods of hierarchical cluster analysis can be agglomerative (step-by-step clusteringof objects and groups to larger groups) or divisive (step-by-step splitting of the whole setof objects into the smaller subsets and individual objects). Further, we can distinguish clus-tering monothetic (only one variable is considered in individual steps) and polythetic (allvariables are considered in individual steps).

    Methods of hierarchical cluster analysis (as well as some other methods) are based on

    the proximity matrix. This matrix is symmetric with zero on the diagonal. The values outof the diagonal express dissimilarities for the corresponding pairs of objects, variables orcategories. These dissimilarities are the values of certain coefficients, or they are derivedfrom the values of similarity coefficients. For example, if we consider a similarity measu-re then the dissimilarity measure is obtained by subtraction of this value fromone, i.e.D = 1 S.

    More information and examples of the methods of cluster analysis can be found inbooks [2], [5], [9] and [13]. Implementation in the SAS system is described in [14].

    3. Categorical data

    Categorical variables are characterized by values which are categories. Two main typesof these variables can be distinguished: dichotomous, for which there only are two catego-ries, and multi-categorical. Dichotomous variables are often coded by the values zero andone. For similarity measuring it is necessary to take into account whether the variables aresymmetric or asymmetric. In the first case, both categories have the same importance(male, female). In the second case, one category is more important (presence of the wordin a textual document is more important than its absence).

    Multi-categorical variables can be classified into three types: nominal, ordinal andquantitative. Unlike the other types, categories of nominal variables cannot be ordered(from the point of view of intensity etc.). Categories ofordinal variables can be orderedbut we cannot usually do the arithmetic operations with them (it depends on the relationsamong categories, see below). We can do arithmetic operations with quantitative variables(number of children). We can apply traditional distance measures in this case and so thistype will not be considered in the paper.

    For this reason, we will further denote nominal, ordinal and dichotomous variables ascategorical. These variables are also called qualitative. We will suppose that dichotomousvariables are binary with categories zero and one. The same similarity measures areusually used for clustering of both objects and variables in the case of binary data.

    2173/2OO9

  • 7/30/2019 Cluster Cat Vars

    3/17

    If binary variables are symmetric, one can apply the same measures as for quantita-tive data. Moreover, many specific coefficients have been proposed for this kind of data, aswell as for data files with asymmetric binary variables.

    If there are no special means for clustering multi-categorical data in a software packa-ge, then transformation of the data file to a file with binary data is usually needed. The

    distinction between nominal and ordinal types is necessary.First, we will mention the data file with nominal variables. In comparison with clas-

    sification tasks involving a target variable (regression and discriminant analyses, decisiontrees), the number of dummy variables must be equal to the number of categories, seeTable 1. In this way it is guaranteed that one can obtain only two possible values of simi-larity: one for the matched categories, and the second for unmatched categories.

    Table 1 Recoding of the nominal variable School for three binary variables P1 to P3

    There are two processes for transforming ordinal data. The first one consists of trans-formation of a data file to a binary data file. In comparison to the case with nominal va-riables, kpossible values of similarity should be considered where kis a number of cate-gories. It is guaranteed by the coding shown in Table 2.

    Table 2 Recoding of the ordinal variable Reaction for three binary variables P1 to P3

    The second process makes use of the fact that values of an ordinal variable can be orde-red. Under the assumption of the same distances between categories, the arithmetic opera-tions can be done. It is recommended to code categories from 1 to kand divide these codes

    by the maximum value. In this way, the values will be in the interval from 0 to 1. Then wecan apply the techniques designed for quantitative data.

    4. Object clustering

    In the following text, we will consider a simplifier case in which all variables are of thesame type (for other case see [13]).

    Binary variables

    If objects are only characterized by binary variables, then the usual process consistsof creating the proximity matrix, followed by application ofhierarchical cluster analysis.

    218

  • 7/30/2019 Cluster Cat Vars

    4/17

    Some software systems offer special measures (the SPSS system offers 26 measures inclu-ding general ones; the SYSTAT system offers 5 measures). Formulas of these measures areusually expressed by means of frequencies from the contingency table. Let the symbolsfrom Table 3 be given.

    Table 3 Two-way frequency table for objects xi a xj

    In the case of symmetric variables, Sokal and Micheners simple matching coeffi-cient is used for example. For two objects, it is a ratio of the number of variables with thesame values (0 or 1) in both objects, and the total number of variables:

    (1)

    The similarity between two objects characterized by asymmetric variables can be mea-sured by Jaccards coefficient. Its value is expressed as a ratio of the number of variableswhich have the value 1 for both objects, and the number of variables with at least one valueequal to 1:

    (2)

    Further, we can apply Yules Q which is calculated by the formula

    (3)

    The publications [11] and [13] provide a more detailed treatment of these measures forbinary variables.

    However, general measures can be also applied. For example, Euclidean distance andcoefficient of disagreement (designed for data files with nominal variables, see below) canbe used. The latter is a complement of the simple matching coefficient to the value 1. Furt-her, gamma coefficient (designed for clustering ordinal variables, see below) is a measure

    suitable for this purpose. In the case of binary variable analysis, it is called Yules Q, seeformula (3).

    Coefficient of disagreement is provided by systems SYSTAT and STATISTICA,gamma coefficient is provided by SYSTAT. In addition, a proximity matrix created by othermeans can serve as an input for hierarchical cluster analysis. The SYSTAT system providesa possibility to create such matrices on the basis of 13 measures applicable to binary vari-ables.

    Monothetic divisive cluster analysis can be applied to objects characterized bysymmetric binary variables. It starts from one cluster which is split into two clusters. Anyvariable can serve for this purpose (one group will contain ones in this variable, the secondgroup will contain zeros). If we denote the number of variables as m, then m possibilities

    2193/2OO9

  • 7/30/2019 Cluster Cat Vars

    5/17

    exist for splitting a data file into two groups of objects. For the next splitting, m 1possibilities exist, etc. The criterion for splitting is based on measurement of dependencyof two variables. This method is called MONA (MONothetic Analysis) in [8] and in theS-PLUS system. In this algorithm, the measure

    is used for evaluation of dependency between the k-th and l-th variables where akl, bkl, ckland dkl are frequencies in the contingency table created for these variables. For each l-thvariable the value

    is calculated. The objects are split according to the variable for which the maximum ofthese values is achieved.

    Further, thek-means andk-medians methods with Yules Q, see formula (3), can beapplied on data files with asymmetric dichotomous variables in SYSTAT.

    Nominal variables

    Typical process for the data files with nominal variables is creation of the proximitymatrix on the basis of the simple matching coefficient and application of hierarchical clus-ter analysis. The simple matching coefficient is a ratio of the number of pairs with thesame values in both elements, and the total number of variables (when objects are cluste-red). Sokal-Michener coefficient (1) is a special case of that. For the i-th andj-th objects

    can be written as

    where m is the number of variables, Sijl = 1 xil = xjl (the values of the l-th variable areequal for the i-th and j-th objects) and in other cases Sijl = 0. Dissimilarity is a comple-ment of the simple matching coefficient to the value 1, i.e. Dij = 1 Sij. This coefficientof disagreement expresses a ratio of the number of pairs with distinct values and the total

    number of variables (it is implemented in the STATISTICA and SYSTAT systems).Another measure of the relationship between two objects (and also between two clus-

    ters) is the log-likelihood distance measure. Its implementation in the software systems islinked with two-step cluster analysis in SPSS. This method has been designed for cluste-ring a large number of objects and it is based on the BIRCH method, which uses the prin-ciple of trees, see [15] and [16]. The log-likelihood distance measure is determined for datafiles with combinations of quantitative and qualitative variables. Dissimilarity is expressedon the basis of variability, whereas the entropy is applied to categorical variables. For thel-th variable in the g-th cluster, the formula for the entropy can be written as

    220

  • 7/30/2019 Cluster Cat Vars

    6/17

    where Kl is the number of categories of the l-th variable, nglu represents the frequency ofthe u-th category of the l-th variable in the g-th cluster, and ng is the number of objects inthe g-th cluster. Two objects are the most similar if the cluster composed of them has thesmallest entropy.

    Other specific methods exist in addition to the mentioned techniques, for clustering

    objects characterized by nominal variables. There are both the k-clustering methods andmodifications of the hierarchical approaches. The k-means and k-medians methods are thebasis for the former. It assumes that each variable has values vlu (u = 1, 2, ..., Kl). Eachcluster is represented by the m-dimensional vector which contains either the categories withthe highest frequencies (in thek-modes method, see [6] and [7]), or the figures about fre-quencies of all individual-variable categories (in thek-histograms method, see [4]). Thesevectors are special types of centroids. Specific dissimilarity measures are applied. In thecase of the k-modes algorithm, measurement based on the simple matching coefficient isused. However, we obtain only a locally optimal solution which depends on the order ofobjects in a data file as well as in the case of the clustering by the k-means algorithm.

    ROCK and CACTUS are additional special methods. The ROCK (RObust Clusteringusing linKs) algorithm, see [3], is based on the principle of hierarchical clustering. First,a random sample of objects is chosen. These objects are clustering to the desired numberof clusters, and then the remaining objects are assigned to the created clusters. The methoduses a graph concept, whose main terms are neighbors and links. A neighbor of a certainobject is such an object to which similarity with the investigated object is equal to or grea-ter than a predefined threshold. A link between two objects is the number of commonneighbors of these objects. The principle of the ROCK methods lies in maximization of thefunction which takes into account both maximization of sums of links for the objects from

    the same cluster, and minimization of sums of links for the objects from different clusters.Let us denote by S(xi, xj) the similarity measure between objects xi and xj; this mea-

    sure can achieve values between 0 and 1. If we define the threshold Tin such a way that, then the objects xi and xj are neighbors if the condition S(xi, xj) Tis satis-

    fied. For the binary data, Jaccards similarity coefficient, see formula (2), is used in thealgorithm. The similarity in the case of the data files with multi-categorical variables isinvestigated within the same principle. If a value is missing, the corresponding variable isomitted from the comparison.

    The second means to be used is a link, i.e., the number of common neighbors of objects

    xi and xj. It will be denoted as link (xi, xj) in the text that follows. The greater value of thelink implies the greater probability of objects xi and xj belonging to the same cluster. Theresulting clusters are determined by minimization of the function

    where nh is the size of cluster Ch. Each object belonging to the h-th cluster has approxima-tely nfh

    (T) neighborhoods in this cluster, whereas for the binary data the f(T) function is

    determined by the formula

    2213/2OO9

    T 0; 1

  • 7/30/2019 Cluster Cat Vars

    7/17

    The value n1h+2f(T) is the expected number of links between pairs of objects in the h-th clus-

    ter. The merging of clusters Ch and Ch' is realized by means of the measure

    The pair most suitable for clustering is the pair of clusters for which this measureattains its maximum value. In the final phase, remaining objects are assigned to the createdclusters. From each h-th cluster, the set of objects is selected according to which the re-maining objects should be assigned (this set will be denoted asLh and the number of objectsin this set as Lh). Each remaining object is assigned to a cluster in which it has the mostneighborhoods from the set after normalization. If we denote the number of neighbor-hoods

    in theLh set asNh, then the object is assigned to such a cluster for which the value of theexpression

    is maximal whereas (Lh + 1)f(T) is the number of neighborhoods for the objects comparedwith theLh set.

    Algorithm CACTUS (CAtegorical ClusTering Using Summaries), see [1], is basedon the idea of the common occurrences of certain categories of different variables. If the

    difference in the number of occurrences for the categories vkt and vlu of the k-th and l-thvariable, and the expected frequency (on the assumption of uniform distribution in theframe of the certain categories of the remaining variables, and the assumption of the inde-pendency) is greater than a user-defined threshold, the categories are strongly connected.The algorithm has three phases: summarization, clustering and verification. During cluste-ring, the candidates for clusters are chosen from which the final clusters are determined inthe verification phase.

    Ordinal variables

    From the specialized methods we can recall thek-median method, in which the vec-tors of medians of the individual variables are used as centroids. Application of the Man-hattan distance (city block distance) is recommended, which is defined as

    for vectors xi a xj. In the SYSTAT system, the gamma coefficient can be used. It will bedescribed in the following section in connection with measurement of ordinal variable simi-larities.

    222

  • 7/30/2019 Cluster Cat Vars

    8/17

    5. Variable clustering

    Clustering of categorical variables is usually realized by application of hierarchicalcluster analysis on a proximity matrix, performed on the basis of suitable similarity mea-sures.

    Binary variables

    If variables are binary and symmetric, then one can use both the simple matchingcoefficient, see formula (1), and Pearsons correlation coefficient, which can be expres-sed (with symbols from Table 1) as

    (4)

    For asymmetric variables, gamma coefficient can be applied for example. It is called

    Yules Q in the binary data analysis, see formula (3). Moreover, some other specific coe-fficients and proximity matrices created by different means can be used.

    Nominal variables

    For determination of dissimilarity ofnominal variables, the coefficient of disagree-ment is offered in some software packages (STATISTICA, SYSTAT). It expresses a ratioof the number of the pairs of different values and the total number of objects. It can be cal-culated from the simple matching coefficient by subtracting it from one. For the k-th andl-th variables, the simple matching coefficient can be expressed as

    where n is the number of objects, Skli = 1xik = xil (the values of the l-th and k-th varia-bles are the same for the i-th object). The disagreement coefficient is then calculated accor-ding to the formulaDkl = 1 Skl.

    Theoretically, there is a wider range of possibilities because symmetric measures ofdependency can be used. They do not occur in the procedures for cluster analysis but the

    proximity matrix created by different means can serve as a basis for the analysis. The well-known measures are derived from Pearsons chi-square statistic, which is calculatedaccording to the formula

    (5)

    where Kk is the number of categories of the k-th variable, Kl is the number of categories ofthe l-th variable, nrs is the frequency in the contingency table (in the r-th row and s-thcolumn), andMrs is the expected frequency under the assumption of independency, i.e.

    2233/2OO9

  • 7/30/2019 Cluster Cat Vars

    9/17

    where nr+ and n+s are marginal frequencies, expressed as

    This statistic is a basis for phi coefficient, which is calculated by the formula

    Further, we can mention Pearsons coefficient of contingency, calculated as

    Cramrs V is another example of this type of similarity coefficients. It is expressed as

    where q = min{Kk, Kl}. For two binary variables, the value is the same as the value of thephi coefficient.

    More symmetric dependency measures can be derived from the pairs of asymmetricmeasures. The contingency table is again the basis. We can distinguish row variables Xkandcolumn variables Xl. If we investigate the dependency of a column variable Xl on a rowvariable Xk, two situations can occur:

    {1}the columns are independent of the rows,{2}the columns depend on the values of variable Xk .

    Let us have a new object for which we know the values of variable Xk but we do notknow the values of variable Xl. If we suppose situation {1}, we will estimate the value ofvariable Xl according to the category with relation p+Mo = maxs(p+s) where p+s is thecolumn subtotal of relative frequencies (p+s = n+s/n). The probability of error can beexpressed as P{1} = (1 p+Mo). When we suppose situation {2}, we will estimate thevalue of variable Xl according to the row maximum corresponding to the known value ofvariable Xk. Let us denote this maximum asprMo = maxs(prs) where prs is a relative fre-

    quency in the r-th row and s-th column (prs = nrs/n). Then the probability of error equalsP{2} = (1 prMo). The proportional reduction in error can be calculated accordingto the scheme

    Goodman and Kruskals lambda coefficient is based on this formula. Asymmetriccoefficient can be written as

    224

  • 7/30/2019 Cluster Cat Vars

    10/17

    For symmetric coefficients, the following probabilities are considered:

    The final formula is either

    or

    The uncertainty coefficient investigates dependency in more detail. It is based on theprinciple of analysis of variance. If variability is expressed by other characteristics thanvariance, then the measure of dependency of variable Xl on variable Xk can be written asthe following ratio:

    where var(Xl) is variability of the dependent variable, var(Xl|Xk) is variability withina group and vkr is the r-th category of the independent (explanatory) variable Xk. Variabi-lity of the nominal variable can be measured by different means. The uncertainty coeffi-cient is based on the entropy, which can be written as

    where plu is relative frequency of the u-th category of the l-th variable. The symmetricmeasure is calculated as a harmonic means of both asymmetric measures. The final for-mula is usually written in the simplified form as

    Ordinal variables

    Dependency of the ordinal variables is denoted as a rank correlation and their inten-sity is expressed by correlation coefficients. The best known among them is Spearmanscorrelation coefficient. If investigated ordinal variables express the unambiguous rank, thefollowing formula can be used:

    2253/2OO9

  • 7/30/2019 Cluster Cat Vars

    11/17

    If this assumption is not satisfied, the process described in [12] must be applied.

    Further measures investigate pairs of objects. If, in a pair of objects, the values of bothinvestigated variables are greater (less) for one of these objects, this pair is denoted as con-cordant. If for one variable the value is greater and for the second one it is less, then thepair is denoted as discordant. In other cases (the same values for both objects exist for at

    least one variable), the pairs are tied. For the sake of simplification, we will use the follo-wing symbols:

    number of concordant pairs, number of discordant pairs,k number of pairs with the same values of variable Xkbut distinct values of variable Xl,l number of pairs with the same values of variable Xl but distinct values of variable Xk.

    Goodman and Kruskals gamma is a symmetric measure. It is expressed as

    For two binary variables, it can be written as

    and it coincides with Yules Q, see formula (3).

    Another symmetric measure is Kendalls tau-b (Kendalls coefficient of the rank corre-lation). It is expressed as

    For two binary variables, the value of this coefficient is the same as the value of Pearsonscorrelation coefficient, see formula (4).

    Another correlation coefficient is the tau-c coefficient, which is denoted either Ken-dalls tau-c (SPSS) or Stuarts tau-c (SYSTAT, SAS). The formula is following:

    where q = min{Kk, Kl}.Further, Somers d is used. Both symmetric and asymmetric types of this measure

    exist. The asymmetric one is expressed as

    The symmetric measure is calculated as a harmonic mean of both asymmetric measures,i.e., the final formula is

    Features of measures mentioned in this chapter are described in [10] and [12].

    226

  • 7/30/2019 Cluster Cat Vars

    12/17

    As concerns the possibilities of software packages in the area of creation of dependen-cy matrices that can be used as input matrices for cluster analysis, the SPSS system offersPearsons and Spearmans coefficients and Kendalls tau-b. The offer of the SYSTATsystem is larger. It includes phi coefficient, Goodman and Kruskals lambda, uncertaintycoefficient, Pearsons and Spearmans correlation coefficients, Kendalls tau-b, Stuartstau-c, and Goodman and Kruskals gamma.

    6. Category clustering

    In the case of clustering categories of a nominal variable, hierarchical cluster analysisis usually applied to the proximity matrix, which is created on the basis of a suitable mea-sure. The contingency table for two categorical variables is an input for the correspondingprocedure in a software package. These processes can be applied in SPSS and SYSTAT.Moreover, in the SYSTAT system the suitable similarity measures can also be used in the

    k-means and k-medians methods.The relationships between categories can be measured by means of special coefficients

    based on the chi-square statistic, see formula (5). For the determination of dissimilarity be-tween categories vki and vkj of the k-th (row) variable, we consider the contingency table ofthe dimension 2 x Kl, where Kl is the number of categories of the l-th (column) variable.We can use chi-square dissimilarity measure which is written as

    where

    Further, phi coefficient can be used. It is calculated according to the formula

    In both cases, the coefficients measure dissimilarity andDrr

    = 0.

    7. Examples of applications

    In this chapter, two examples will be presented variable clustering and category clus-tering. The data file is from the research Male and female with university diploma, No.0136, Institute of Sociology of the Academy of Sciences of the Czech Republic. Theauthor of this research is the Gender in sociology team; the data collection was performedby Sofres-Factum (Praha, 1998).

    2273/2OO9

  • 7/30/2019 Cluster Cat Vars

    13/17

    Example 1 variable clustering

    For this purpose, 13 variables expressing satisfaction concerning a job of the respon-dent from different points of view were analyzed. Respondents evaluated their satisfactionon the scale from 1 (very satisfied) to 4 (very dissatisfied). Similarity matrix based on Ken-dalls tau-b was created in the SPSS system. This matrix was transformed to dissimilarity

    matrix by subtraction of the values from 1 in Microsoft Excel. The transformed matrix wasanalyzed by complete linkage method of hierarchical cluster analysis (the distance betweentwo clusters is determined by the greatest distance between two objects from these clusters)in the STATISTICA system (for the reason of better quality of graphs). The resulting dend-rogram is shown in Figure 1.

    If we do a cut in the distance 0.6 in the dendrogram, we obtain 6 clusters. The first clus-ter represents satisfaction with salary, remuneration and evaluation of working enforce-ment. The further clusters represent the following groups of variables: satisfaction withperspective with the company and possibility of promotion, satisfaction with relationshipsin the company and relationships between males and females, satisfaction with scope ofemployment and use of respondent degree of education, satisfaction with managementof the company, respondent supervisor and possibility to express own opinion, and satis-faction with working burden.

    Figure 1 Dendrogram of relationships among variables

    228

  • 7/30/2019 Cluster Cat Vars

    14/17

    Example 2 category clustering

    In this case, the categories of the variable expressing specialization in university stu-dies were clustered on the basis of the categories of the variable containing informationabout the following university diploma: magister Mgr., engineer Ing., doctor Dr.(RNDr., MUDr. JUDr. etc.). The respondents with the bachelor diploma (Bc.) were omit-

    ted from the analysis. The contingency table for these two variables is in Table 4.

    Table 4 Contingency table for variables Diploma and Specialization

    In Table 5, the proximity matrix with using the chi-square dissimilarity measure is dis-played. It was created with using the SPSS system.

    Table 5 Proximity matrix for categories of variable Specialization

    The proximity matrix was analyzed by complete linkage method of hierarchical clusteranalysis in the STATISTICA system. The resulting dendrogram is shown in Figure 2.

    2293/2OO9

  • 7/30/2019 Cluster Cat Vars

    15/17

    Figure 2 Dendrogram of relationships among categories

    This example is only illustrative. It is well known that graduates from certain facultiesand universities have a specific diploma. Graduates from faculties with specialization fornatural a social sciences (including law) usually have the Mgr. diploma first but some ofthem continue their studies for a doctoral diploma (the RNDr. diploma for natural sciencesand JUDr. diploma for law). The physicians have the MUDr. diploma. Graduates fromfaculties with specialization for pedagogy and art usually have the Mgr. diploma. Gradua-tes from universities with specialization for technical sciences, economy and agriculturalsciences usually have the Ing. diploma. We obtain these three clusters if we do a cut in thedistance 10 in the dendrogram.

    8. Further directions of development

    Although a lot of approaches and methods for clustering of categorical data have beenproposed in the literature, capabilities of statistical software packages are limited. Oneexpected direction of development is implementation of more algorithms into software pro-ducts. Besides clustering, the programs should offer other processing and analyses: missingvalue imputation, choice of variables for object clustering and dimensionality reduction,identification of outliers, and determination of the optimal number of clusters.

    Researchers are presently focusing on two areas: clustering of large data files, and on-line clustering when some additional objects arise during analysis (web pages). Another

    230

  • 7/30/2019 Cluster Cat Vars

    16/17

    area which should be solved is clustering of data files with different types of variables. Incommercial software packages, only two-step cluster analysis in the SPSS system makessuch a clustering possible.

    References

    0[1] Ganti, V., Gehrke, J., Ramakrishnan, R. CACTUS Clustering categorical data using summa-

    ries. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discove-

    ry and Data Mining, San Diego: ACM Press, 1999, 7383.

    0[2] Gordon, A. D. Classification, 2nd ed. Boca Raton: Chapman & Hall/CRC, 1999.

    0[3] Guha, S., Rastogi, R., Shim, K. ROCK: A robust clustering algorithm for categorical attributes.

    Information Systems, 25 (5), 2000, 345366.

    0[4] He, X., Ding, C. H. Q., Zha, H., Simon, H. D. Automatic topic identification using webpage

    clustering. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01),

    2001, 195203.0[5] Hebk, P., Hustopeck, J., Peckov, I., Plail, M., Pra, M., ezankov, H., Svobodov, A.,

    Vlach, P. Vcerozmrn statistick metody (3). 2nd ed. Praha: Informatorium, 2007.

    0[6] Huang, Z. A fast clustering algorithm to cluster very large categorical data sets in data mining.

    Proc. of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discove-

    ry, University of British Columbia, 1997, 18.

    0[7] Huang, Z. Extensions to the k-means algorithm to clustering large data sets with categorical

    values. Data Mining and Knowledge Discovery, 2, 1998, 283304.

    0[8] Kaufman, L., Rousseeuw, P. Finding Groups in Data: An Introduction to Cluster Analysis.

    Hoboken: Wiley, 2005.

    0[9] Mirkin, B. Clustering for Data Mining: A Data Recovery Approach. Boca Raton: Chapman &

    Hall/CRC, 2005.

    [10] Peckov, I. Statistika v ternnch przkumech. Praha: Professional Publishing, 2008.

    [11] ezankov, H. Measurement of binary variables similarities. Acta Oeconomica Pragensia, 9

    (3), 2001, 129136.

    [12] ezankov, H. Analza dat z dotaznkovch eten. Praha: Professional Publishing, 2007.

    [13] ezankov, H., Hsek, D., Snel, V. Shlukov analza dat. 2nd. ed. Praha: Professional Pub-

    lishing, 2009.[14] Stankoviov, I., Vojtkov, M. Viacrozmern tatistick metdy s aplikciami. Bratislava: Iura

    Edition, 2007.

    [15] Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: An efficient data clustering method for very

    large databases. ACM SIGMOD Record, 25(2), 1996, 103114.

    [16] ambochov, M. Algoritmus BIRCH a jeho varianty pro shlukovn velkch soubor dat.

    Mezinrodn statisticko-ekonomick dny [CD-ROM]. Praha: VE, 2008.

    Hana ezankov, Fakulta informatiky a statistiky VE v Praze, nm. W. Churchilla 4, 130 67 Praha 3 ikov,

    e-mail: [email protected]

    2313/2OO9

  • 7/30/2019 Cluster Cat Vars

    17/17

    Abstract

    This paper deals with specific techniques proposed for cluster analysis if a data file includes cate-

    gorical variables. Nominal, ordinal and dichotomous variables are considered as categorical. Three

    types of clustering are described: object clustering, variables clustering and category clustering. Both

    specific coefficients for measurement of similarity and specific methods are mentioned. Two illustra-

    tive examples are included in the paper. One of them shows variable clustering (variables express

    satisfaction concerning a job of the respondent from different points of view) and the second one con-

    cerns category clustering (specializations of respondents are clustered according to the type of the

    university diploma); combination of the SPSS and STATISTICA software systems is applied in both

    example.

    Key words: Cluster analysis, categorical data analysis, similarity measures, dissimilarity measu-

    res.

    232