journal of computational and graphical statistics volume 12 issue 1 2003 [doi 10.2307_1391072] r. j....

Interface Foundation of America

Projection Pursuit Clustering for Exploratory Data AnalysisAuthor(s): R. J. Bolton and W. J. KrzanowskiSource: Journal of Computational and Graphical Statistics, Vol. 12, No. 1 (Mar., 2003), pp. 121-142Published by: American Statistical Association, Institute of Mathematical Statistics, and InterfaceFoundation of AmericaStable URL: http://www.jstor.org/stable/1391072 .

Accessed: 28/06/2014 08:08

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of America arecollaborating with JSTOR to digitize, preserve and extend access to Journal of Computational and GraphicalStatistics.

http://www.jstor.org

This content downloaded from 185.31.194.94 on Sat, 28 Jun 2014 08:08:38 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/action/showPublisher?publisherCode=ims

http://www.jstor.org/action/showPublisher?publisherCode=interface

http://www.jstor.org/action/showPublisher?publisherCode=interface

http://www.jstor.org/stable/1391072?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Projection Pursuit Clustering for Exploratory Data Analysis

R. J. BOLTON and W. J. KRZANOWSKI

Exploratory projection pursuit is a set of data analytic techniques for finding "interesting" low-dimensional projections of multivariate data. One particular interesting structure is that of clusters in the data. Projection pursuit clustering is a synthesis of projection pursuit and nonhierarchical clustering methods that simultaneously attempts to cluster the data and to find a low-dimensional representation of this cluster structure. We introduce a projection pursuit clustering index based on orthogonal canonical variates that takes account of scale in the data, and compare our index with one previously suggested. We show that the two indexes are identical when the data are sphered, and discuss when such sphering should be done. We also propose diagnostics for finding the optimum number of groups in projection pursuit clustering, and extend the technique to search for clusters in higher dimensions. All the methodology is illustrated and evaluated on one simulated and two real datasets.

Key Words: Dimension reduction; Multivariate data analysis; Segmentation; Sphering; Visualization.

1. INTRODUCTION

Suppose we have a dataset with N observations on P continuous variables that we wish to automatically cluster into k groups, where k is predetermined. We are particularly interested in recovering any cluster structure that may be present in subspaces of the data,

especially in two or three dimensions for ease of visualization.

The standard approach taken to uncover clusters is to perform a cluster analysis, a set of techniques that use internal measures of distance or similarity in the data to partition the observations into k groups. A good overview of cluster analysis can be found in Everitt (1993); here we merely highlight the four main types of such analysis.

R. J. Bolton is Research Associate, Department of Mathematics, South Kensington Campus, Imperial College London, London, SW7 2AZ, United Kingdom (E-mail: [email protected]). W. J. Krzanowski is Professor, School of Mathematical Sciences, Laver Building, North Park Road, Exeter, EX4 4QE, United Kingdom (E-mail: [email protected]).

?2003 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America

Journal of Computational and Graphical Statistics, Volume 12, Number 1, Pages 121-142 DOI: 10.1198/1061860031374

121



122 R. J. BOLTON AND W. J. KRZANOWSKI

Hierarchical techniques (see, e.g., Gordon 1987) are the most common clustering methods. Observations are either put initially into one group, which is then split successively to form increasing numbers of groups (divisive), or they are treated initially as individual groups which are then combined successively into decreasing numbers of groups (agglom- erative). Nonhierarchical clustering describes partition methods in which the number of clusters is predefined. Cluster centers are updated and observations are allocated to clusters by optimizing some criterion of separation. See Krzanowski and Marriott (1995, secs. 10.14 and 10.15). A third approach is to fit mixtures of distributions, generally multivariate normal, to the data (McLachlan and Basford 1988). Finally, nonparametric clustering methods avoid distributional assumptions but can be very computer intensive. Examples include mode analysis (Wishart 1969) and nearest-neighbor clustering (Wong and Lane 1983).

Most clustering techniques use information from all variables in the dataset, defining similarities in the full dimension of the data space. This is often not an optimal cluster recovery approach, especially when the cluster structure is contained in a subspace of the data. The remaining data space can include noisy data that hides the cluster structure when all variables are analyzed. Methods for weighting variables for cluster analysis have been proposed and assessed (Gnanadesikan, Kettenring, and Tsao 1995), but produce mixed results.

Our aim in this article is to recover clusters in lower dimensional subspaces of the data by simultaneously performing dimension reduction and clustering. We are especially (but not exclusively) interested in subspaces of dimension two and three so that results can be plotted and any discovered clusters can be visualized. Such simultaneous clustering and dimension reduction can be achieved with projection pursuit clustering (PPC), a methodology that iteratively finds both an optimal clustering for a subspace of given dimension and an

optimal subspace for this clustering. We implement a PPC index designed for multidimensional scaling by Bock (1986) and introduce a new PPC index based on canonical variates. We examine their performance when applied to some real datasets and discuss possible alterations to the indexes and transformations to the data to improve cluster recovery. In

particular we discuss the practice of sphering the data to make these indexes affine invariant

and to remove correlation, a consequence of which is that the PPC indexes become identical.

We also investigate some PPC diagnostics for choosing the correct number of groups and dimensions for good cluster recovery. Overall, the methods provide a useful addition to the

existing battery of techniques for graphical display and analysis.

2. PROJECTION PURSUIT CLUSTERING

2.1 BACKGROUND

Exploratory projection pursuit (PP) covers data analytic techniques for finding "inter-

esting" low-dimensional projections of multivariate data. These techniques require us to define a projection index, I, that represents our notion of what is interesting; this index



PROJECTION PURSUIT CLUSTERING 123

should be large for "interesting" projections and small for "un-interesting" projections. We then aim to maximize this index over all projections of the data to obtain the most interesting view. For instance we can derive the first principal component from a unidimensional projection index

IPCA (a) =a(X - 1'NX)'(X - 1'X)a'; aa' 1 (2.1)

where a is the 1 x P projection vector, X is the N x P data matrix with N measurements on P variables, X is the sample mean vector of X and 1N is a vector of one's of length N, the number of observations in the data matrix. The maximization of this index results in a simple eigenanalysis of (X - 1X)(X- 1X); the value of a that maximizes IPCA (a)

is the first principal component. This projection index is simple to maximize and has an algebraic solution; however,

this is the exception rather than the rule. Most projection indexes require an algorithm that will calculate I at values of a and maximize I according to some numerical optimization routine. Although this appears inefficient, one of the attractions of these projection pursuit algorithms is their capacity for local maximization of projection indexes. Local maxima represent interesting views of the data and it is natural to suppose that there will be several of these. See, for example, Huber (1985), Posse (1995), and Bolton and Krzanowski (1999) for more details. Some research has already associated projection pursuit with clustering; for example, Eslava and Marriott (1994) designed PP indexes specifically for uncovering low-dimensional cluster structure and Kwon (1999) discussed the use of the Holes PP index to obtain a projection of the data before applying cluster analysis.

Bock (1986), and later Heiser and Groenen (1997), looked at the connection between multidimensional scaling (MDS) and cluster analysis, their stress functions accounting for the presence of clusters in the data. We can characterize MDS in terms of projection pursuit if we identify the Stress function with the projection index and constrain the multidimensional configuration to orthogonal projections of the data. However, we perform projection pursuit on the data matrix with N measurements on P variables, as opposed to the N x N dissimilarity matrix in MDS. The matrix we analyze is thus different in the two techniques, but if squared Euclidean distances constitute the interobject (dis)similarities then MDS is equivalent to PCA (Borg and Groenen 1997).

Bock (1987) extended his ideas to develop a clustering method in MDS based on finding a low-dimensional configuration (projection) that minimizes the within-groups sum- of-squares. He also suggested an efficient nonhierarchical clustering algorithm to optimize his projection pursuit clustering. The clustering mechanism is based on a partition of the data into k clusters, where k is fixed. Observations are reallocated to the k clusters in order to minimize the within-group sum-of-squares in the lower-dimensional projection. The index Bock optimized was a function both of the projection and of the clustering, so he was able to design a dual optimization algorithm to make his method more efficient. The algorithm proceeds as follows: first define the projection pursuit clustering index, I(A, C), as a function of the d-dimensional projection, A, and the clustering, C. The projection matrix is formed as a partition of mutually orthogonal one-dimensional projections, ai, such that




A = (al :a a.)' Y and AA' = Id, while the clustering of objects into k classes at step s of the procedure is denoted by Ck,S.

To initialize the process, let s = 0 and take an initial clustering of k groups, Ck,o. The iterative scheme then contains three stages:

1. Optimize I(A, C = Ck,s) over all possible projections A. Call this optimal projection A,.

2. Optimize I(A = A, C) over all possible clusterings C with k groups. That is, project into the subspace defined by A, and cluster within this projected subspace. Call this optimal clustering Ck,s?+I

3. Lets=s+1. Iterate until there is no improvement in I(A, C).

The key to the efficiency of the algorithm is in step two; calculations in the projected subspace involve fewer dimensions and thus fewer computations than in the whole space. Bock showed that the algorithm converges to local optima, which is what we require of a good projection pursuit algorithm.

Bock's approach involved maximization of the between-groups sum-of-squares in the projection step of the algorithm and minimization of the within-groups sum-of-squares by a standard k-means clustering algorithm (Hartigan and Wong 1979) in the clustering step. Since

T = Bc + WC, (2.2)

where T is the total sums-of-squares-and-products matrix and Bc and Wc are the between- and within-groups sums-of-squares-and-products matrices (all P x P) for a clustering C, we can see that this procedure is equivalent to maximizing the index

IdDBock(A,C) = Trace(ABcA') (2.3)

such that AA' = Id. The number of groups remains fixed throughout the PPC algorithm for the PPC index to reach a local optimum. We can always inflate the value of the PPC index by partitioning a cluster into two subclusters. However, the Bock PPC index tends to find groups in the direction of the data that has largest variance, even when the clustering structure is better defined in other directions. The Bock PPC index is therefore not necessarily the best index for PPC and we now describe an alternative PPC index based on canonical variates.

2.2 A NEW PPC INDEX

Discriminant analysis provides us with mechanisms for separating predefined groups in low-dimensional space that we can apply as the projection step in PPC; one such discriminant method is canonical variate analysis. A canonical variate is the linear combination of the variables that maximizes the ratio of the between group sums-of-squares to the within group sums-of-squares. Using the above notation, we do this by maximizing successively (aiBca7/aiWcaa) with respect to the coefficients ai, the complete set of canonical variates




being the eigenvectors of WC-1BC arranged in descending order of their corresponding eigenvalues. This is equivalent to Fisher's linear discriminant function when there are just two groups (see Mardia, Kent, and Bibby 1979 for details).

In projection pursuit the projections are constrained to be unit length and mutually orthogonal. Orthogonal canonical variates (Krzanowski 1995), therefore, appear to be ideal candidates for PPC. In the context of PPC the orthogonal canonical variate (OCV) PPC index can be written as

Iocv(A,C) z , Wa al aaa1 aia' =0 (i#j), (2.4)

which incorporates canonical variates but imposes an orthogonality constraint. This PPC index involves only ratios of sums-of-squares and thus, unlike Bock's PPC index, scaling effects are taken into account. We can replace a1Bca' in the numerator with a,Ta' since a,Ta' = a1Bca' + aiWca' and thus the a, that maximizes a1Ta;/a1Wca' and

a1Bca'/a1Wca' is the same. We can obtain the projection in stage one of the PPC algorithm through successive eigendecompositions of W- 1 T subject to orthogonality constraints (see Krzanowski 1995 for details). We transfer observations between groups to optimize the PPC index in the second stage of the iteration using a transfer algorithm (Banfield and Bassill 1977).

A disadvantage of the orthogonal canonical variates PPC index becomes evident when the groups are not well defined. In this case the maximum of the index tends to occur with all groups separated on the first projection and no separation on subsequent projections; the index becomes one-dimensional in nature.

We require a PPC index that will capture the cluster structure more efficiently, leading us to refine the d-dimensional OCV PPC index into

IdD ocv(A, C) -= Wd/ alal I, aia) 0(iii (2.5) Z~1alWca

(*A

with notation as before. The optimal projection must now be found by projection pursuit, as there is no simple eigendecomposition except for the first d-dimensional OCV, a1, which is obtained by normalizing the first canonical variate. The second dDOCV is found by fixing a1 and then maximizing I2D ocv over all a2 orthogonal to a1; subsequent dD OCVs are found successively by fixing a1, . . . , ad_ 1 and maximizing IdDOCV over all ad subject to the mutual orthogonality of the a's. The optimal clustering can be obtained in the d-dimensional subspace by using a standard k-means clustering algorithm (Hartigan and Wong 1979).

The IdD ocv PPC index is subtly different from the Iocv index in that it is the ratio of sums, rather than the sum of ratios; this prevents the ballooning of Id DOCV on the first projection that can happen with Iocv. To see this, consider the two-dimensional case which represents the dimension of our visual space; this can be extended trivially to more dimensions. We write the two PPC indexes explicitly in terms of the two projection vectors a, and a2:

Iocv ((a 2), C) a1Ta'

+ a2Ta2 (2.6) x~~/ aiWca'~1 22c~ 26




I2D OCV ((aj:a2) C) (a, Ta' + a2Ta') ( 1W ') (2.7)

with notation as before. These two expansions highlight the different behavior of the indexes. A large value of a2Wca' has little effect on the value of the Iocv index if a1 Wca a is very small. That is, if we partition into multiple groups on the first dimension, then we need have little or no separation on the second dimension and the index becomes essentially one- dimensional. However, a large value of a2Wca' will reduce the value of I2D ocv. Both axes have an equal effect on the value of the index and the index is structurally two-dimensional. The advantage in using this index over the Bock index is that it is scale invariant.

The IdD ocv PPC index works well until we encounter highly correlated datasets; then, even this index is affected by highly correlated directions. A combination of these highly correlated directions will be effectively colinear and thus reduce the within-groups sum-of-squares artificially.

In the following examples we apply each PPC index in two dimensions (d = 2) for visualization purposes. Both datasets contain a priori groups, so we set the algorithm to look for the number of groups described in the original classification and set the starting seed as the original classification. This may appear to be leading the witness somewhat, but this method is in lieu of an ideal situation in which we would employ a visual grand tour to assist us in choosing our initial clustering.

3. EXAMPLES

We look at three example datasets. The first set, CRAB, consists of five (highly corre-

lated) measurements on each of 200 Australian crabs (Campbell and Mahon 1974). Crabs

are either orange or blue, male or female, and there are equal numbers of each combination

giving 4 groups of size 50. There is good separation between the colors but within each

color the male/female separation is not so well defined since "for smaller crabs the sexes

are physiologically less distinguishable" (Swayne, Cook, and Buja 2001); see Posse (1990)

for a previous analysis of this dataset.

The second real dataset is the forensic GLASS dataset consisting of 214 observations

(glass fragments) with measurements of the proportions of eight elements and of the refrac-

tive index, making a total of nine variables. The 214 fragments are classified into 6 groups

of varying sizes with groups labeled 1-3 and 5-7; group 4 is omitted from the dataset, which

is freely available from the StatLib Web site ([email protected]). The GLASS dataset is re-

garded as difficult to cluster and was analyzed by Ripley (1996) who suggested that a good

classifier will produce a misclassification rate of about 24%. This is a higher dimensional

dataset than the CRAB dataset, but contains fewer well-defined groups that are additionally

of unequal size. The last dataset is an artificial creation of the authors and is made up of 400 observations

on 8 variables. The data arise from a mixture of eight multivariate normal distributions with

mean vectors at the corners of a rectangular block in three dimensions combined with five




dimensions of Gaussian noise. The mean vector for each group and the common covariance matrix can be found in Appendix A. We expect poor group recovery in two dimensions, but will also investigate this dataset in higher dimensions in a later section.

3.1 APPLICATION OF THE PPC INDEXES TO THE EXAMPLES

We applied each PPC index to each of the three example datasets, setting the dimensionality equal to two for visualization and the number of groups equal to those in the original classification. For each application of the PPC index we display a plot of the resulting projection with points labeled according to the clustering found by the PPC algorithm. We show the PPC local optima for the CRAB data in Figures 1-2, for the GLASS data in Figures 3-4 and for the simulated data in Figures 5-6.

For the CRAB data, the clustering corresponding to the optimization of I2D Bock is in the direction of largest variance and appears to be simply a partition in the direction of the first projection (Figure 1). The clustering obtained from PPC with I2D ocv shows separation between the orange and blue crabs; partitions within these groups fall in the same direction within these clusters (Figure 2).

The GLASS data appear to be better clustered by optimizing the I2D Bock index (Figure 3) rather than the I2D ocv index (Figure 4). The latter produces a one-dimensional solution on two highly correlated projections, although we will see that its misclassification rate is the same as that for the Bock index.

The results for the simulated data show better cluster recovery for the optimum of the

I2D ocv index (Figure 6) than for the optimum of the I2D Bock index (Figure 5). We display the number of misclassifications for each dataset and for each PPC index in

Table 1. We include the misclassifications from three-dimensional implementations of the PPC indexes and from a standard k-means algorithm (Hartigan and Wong 1979) initialized with the original classification, for comparison purposes. Although these misclassification rates are not a perfect guide to the cluster recovery success of the algorithms (the original classifications do not necessarily define distinct clusters), we can see that none of these methods are consistently successful in their cluster recovery. The IdD Bock index can fail because it is affected by scale and the IdD ocv index can fail because of its susceptibility to correlation in the data. Neither index consistently outperforms the k-means algorithm and both two-dimensional versions are inferior to it in their recovery of clusters in the GLASS data. The generally lower misclassification rates for the three-dimensional indexes indicate that dimensionality is something we should consider. We look at diagnostics for choosing the dimensionality of a PPC index in a later section of this article.

We require a PPC method that is unaffected by both scale and correlation in the data. Although we may be able to introduce a factor to account for correlation in the I2D ocv

index, this is a somewhat ad-hoc approach. The usual procedure in PP is to center and sphere the data such that they have zero mean vector and covariance equal to the identity matrix. We examine this practice in the next section.




1 0 2 + 3 V 4 0

I I I

0

4 vv o + +

3 0 V 70 8

1 s P0 0 et0 + ~~0 00 W7v 706 a

C + ~~+ 0 09VV7 7a

1+ 000 77 0 3

? + + v v VWt 0

0.0- ++ + VV WVV ,OO

. O) + ++ 00 v a 0

2' + +O +

+ ++

*2 + + 0 0 71

CM. + + ++ 0 777_0

+ + 4 C% 7v 0

0,~ ~ ~ ~ ~~

0 00 1 + 80 0 07 vV

-3 0~~~~7 0

V

-44 1 1 s Proecio

30 40 50 60 70 80

1 st Projection

Figure 1. Local maximum Of I2D Bock for CRAB dataset. Two dimensions, four groups.

1 0 2 + 3 V 4 0

0.50 v~~~~~ + 0

v o ++ + VW

00

0.0 V a0

o + 1 v 0 0 ,0 +++ + vvV VV f 03 00

0 0 + v 0

V ~~~~~~+ vv V "V 00

0 VV +00 0 0

,00 +

-1.0

0 0

0

-1 0 1 2 3

1 st Projection

Figure 2. Local maximum Of I2DOCV for CRAB dataset. Two dimensions, four groups.




1 0 2 + 3 V 4 0 5 x 6 A

16 -A

14 - x X

x '

U + +

0 + + ++

1- ? + + C\- +

+ l 0

C\j 0~~~~~~~

33.0 0 0 0~~~~~~~ 80

2 4 6 8 120

1st Projection

Figure 3. Local maximum Of I2DBOCk for GLASS dataset. Two dimensions, six groups.

1 0 2+ 3 V 4 0 5 x 6 A

35.0A

34.5

0

0

33.05

2.24 2.25 2.26 2.27

1st Projection

Figure 4. Local maximum of I2DOCV for GLASS dataset. Two dimensions, six groups.




1 0 2 + 3 V 4 0 5 x 6 A 7 * 8*

* 0 ~~~~~~~+ 0.5 + +++v

-2 . cP 0 ++ 4 6

00 0 0 + 3 V 4 * ~ ~~~~ COD + ++ ++ v

0.0 * X /++v V

1.5 ~ ~ ~ ~ 0 A AS XX XXV

0,0~~ ~~~~~ot 13 C O C+ ++ v. v

O A +A

-0.5~~~~~~~~~ Qo , _

0 +

-2 0 2A 4 6

1 st Projection

Figure 5. Local maximum of I2DBOCV for simulated dataset. Two dimensions, eight groups.

10-02 3V 4 'B0 A A 6A 7f I I I~~~~~~~~~~lfcx

AXA

*d Al x X)?X

1.0 A

0.0 0~~~~~~~~ t ]C ++

+ ++

-2.0~~~~~

-1 0 1 2

1 st Projection

Figure 6. Local maximum Of I2DBocV for simulated dataset. Two dimensions, eight groups.




Table 1. Table Showing the Number of Misclassifications at Local Optima of the Clustering Algorithms for the Example Datasets

Dataset CRAB GLASS Simulated

2D Bock 132 107 210 12D OCV 90 107 36 '3D Bock 132 96 179 13D OCV 15 99 155

k-means 132 98 233

4. CENTERING AND SPHERING

Huber (1985) suggested that a good PP index should be invariant to location and scale (affine invariant). Jones and Sibson (1987) agreed that "visual interestingness is an affine invariant notion," but preferred instead to center and sphere the data before applying PP; all projections of the centered, sphered data Xcs have unit variance and are uncorrelated. That is,

var(XcsA') = AX'SXcsA' = AA'= I. (4.1)

However, doubts have been raised as to the effectiveness of sphering the data before applying PP. Cook, Buja, Cabrera, and Hurley (1995) discussed the impact of sphering and echoed the views of Gower in the discussion of Jones and Sibson (1987), noting that "sphering is graphically distracting because it changes the shape of the data and may in some cases hide features that were previously visible." Despite this observation, Cook et al. (1995) decided to incorporate sphering into the XGobi data visualization tool since it was essential for the effectiveness of several PP indexes.

Sphering the data has a useful simplifying effect on the PPC indexes; in fact, centering and sphering the data makes the Bock and dDOCV indexes identical. The optimization of the Bock index becomes the maximization of

IdDBoCk(A,C) = Trace(A(I - Wc)A'); AA' = Id (4.2)

or, equivalently, the minimization of

IdDBock(A,C) = Trace(AWcA'); AA' = Id. (4.3)

The criterion we minimize is thus

SalWcaf; ala f = 1, aial}= j =

,.....................,-1) .. (4.4)

When we apply the dDOCV PPC index to sphered data the optimization reduces to maximizing

IdDv (A,C) = 1d alai = l. aiaj =n (j = 1, . 1 (4.51




which is equivalent to minimizing Equation (4.4). We can easily obtain this result via the minimization of a'Wca through eigendecomposition of Wc. Thus IdD OCV is equivalent to IdD Bock when the data are sphered.

In applying canonical variates to a dataset X with a fixed clustering C, we take the eigenvectors corresponding to the largest eigenvalues of WJ 1 Bc as the solution. We showed that we can replace Bc with T to obtain the same canonical variates; however, when the data are sphered, then T = IP and the eigendecomposition reduces to that of Wc 1. Maximizing Trace(AWJ A') is achieved with the same eigenvectors as minimizing Trace(AWc A'). Thus, maximizing Iocv (A, C) for a fixed clustering C reduces to the minimization of

Trace(AWcA'); AA' = Id. (4.6)

Since Wc is symmetric, the eigenvectors are automatically orthogonal; thus the constraint AA' = Id is satisfied by the eigendecomposition. However, this PPC index is not identical to Bock's as the criterion maximized there is

a, WC a' alai = 1, ai aj =0 (j=1..,-) (4.7)

which is not equivalent to maximizing Equation (4.4). We require consistency, otherwise the k-means clustering step of the algorithm minimizes a different criterion from that in the projection step.

We can provide further evidence that applying canonical variates to the sphered data will give us an optimal projection step of the PPC algorithm for separating a given classification. Canonical variates are optimal linear discriminant functions in the sense of Fisher's linear discriminant function; that is to say they are "those linear functions which separate the k sample means as much as possible" (Mardia et al. 1979, p. 343). Canonical variates are not strictly equivalent to the maximum likelihood (ML) rule for discriminating between multivariate normal populations, except in the case of two groups (Mardia et al. 1979, p. 320). An advantage of the equivalence with Fisher's function as opposed to the ML rule is that we need not assume multivariate normality. Canonical variates are generally not mutually orthogonal, however, when the data are sphered we automatically achieve the

orthogonality we require for PP. We thus automatically preserve the structure of the sphered data in the canonical variates.

5. EXAMPLES REVISITED

We sphere our example datasets and reapply the PPC indexes (now identical) under the notation Ippc; Table 2 shows the number of misclassifications for each example dataset after the data have been centered and sphered. We applied the Ippc index in two dimensions as before and with the original number of groups. Apart from the CRAB dataset, the Ippc index produces more misclassifications on the sphered data than at least one of the IdD Bock

or I2D ocv PPC indexes did when the data were unsphered; the PPC index also has more misclassifications than a standard k-means algorithm for the GLASS and simulated data.




Table 2. Table Showing the Number of Misclassifications at Local Optima of the Clustering Algorithms for the Example Datasets

Dataset CRAB GLASS Simulated

/PPc 17 117 49 k-means 31 63 30

However, examination of the PPC solution for the sphered GLASS data in Figure 7 implies that the clustering found is not unreasonable.

We have already seen that varying the dimensionality of PPC indexes can affect cluster recovery. The number of groups and the starting classification can also affect the performance of the algorithm. We now look to investigate the question of how many groups and how many dimensions one should use when applying the PPC index to sphered data.

6. PPC DIAGNOSTICS

6.1 How MANY CLUSTERS?

When using a clustering method we are often interested in finding the most appropriate number of clusters in the dataset. To do this in the context of PPC, we adapt a diagnostic from cluster analysis and apply it to our example datasets. We pay particular attention to the simulated dataset, since we know both its dimensionality and the correct number of clusters.

1 ? 2 + 3v 40 5 6

0 0

-2 : +

2 00

0~~~~~~~~~~~

-2 -1 0 1

1 st Projection

Fig ure 7. Global optimum of IPPC for GLiASS dataset. Two dimensions, six groups.




Table 3. Table Showing the Total Within-Group Sums-of-Squares (and their frequency from 100 runs of the PPC algorithm) at the Three Smallest Local Minima of the PPC Algorithm for the Simulated Datasets for Different Numbers of Clusters (3 sf)

k Min. 2nd 3rd

6 313 (4) 318 (1) 319 (1) 7 242 (1) 244 (1) 245 (2) 8 173 (10) 234 (2) 236 (3) 9 159(3) 160(2) 161 (1) 10 149 (1) 150 (1) 195 (1)

One way of examining cluster recovery is to look at the cohesiveness of the clusters as measured by the within-group sums-of-squares. Table 3 shows the three smallest values of the resulting total within group sums-of-squares from 100 random starts of the PPC algorithm in three dimensions for the simulated data; the frequency with which these values occurred are included in parentheses. Not only are we more likely to find the smallest local minimum of the PPC index, but we also gain some idea of the sensitivity of the index to random start and different numbers of clusters. We know the true number of clusters in the simulated dataset and thus only search for a selection of clusters for demonstration purposes.

Hartigan (1975) suggested the following rough rule of thumb for choosing the number of clusters when applying the k-means algorithm. If Wk is the total within group sums-of- squares for the k-means minimum with k clusters, then Hartigan considered it justifiable to add an extra cluster if (n - k - 1)(Wk/Wk+I - 1) > 10, where n is the total sample

size. We investigate the reliability of this criterion by looking at the simulated data, which we know has eight clusters in three dimensions. Table 4 displays these calculated values for the minimum within group sums-of-squares values. However, these values are all well above 10, questioning the validity of this threshold value. If we instead look at the maximum value, we see that it occurs when we increase from seven to eight clusters; the criterion is thus maximized at the correct number of clusters.

Table 6 displays the calculated values for the minimum within group sums-of-squares values in Table 5 for the CRAB dataset. Although we do not know the dimensionality of the dataset, we performed the search in two dimensions so that we could easily visualize the PPC solutions. Using the same diagnostic we conjecture that there are three clusters in the CRAB data when we search with PPC in two dimensions. In two dimensions we can visualize PPC solutions for different numbers of clusters, and from Figures 8-11 we suggest that the "three cluster" conjecture is reasonable. These plots of the PPC solutions

Table 4. Table Showing Diagnostic Criterion Values for Choosing the Correct Number of Clusters for the Simulated Data in Three Dimensions (3 sf)

k 6 7 8 9 10

Wk 313 242 173 159 149

(n-k-1) (Wk 1 115 156 34.4 26.2 -




Table 5. Table Showing the Total Within Group Sums-of-Squares (and their frequency from 100 runs of the PPC algorithm) at the Three Smallest Local Minima of the PPC Algorithm for the CRAB Datasets for Different Numbers of Clusters

k Min. 2nd 3rd

2 224 (12) 240 (22) 243 (8) 3 100 (14) 105 (11) 108 (4) 4 60.7 (7) 66.9 (4) 77.2 (1) 5 50.6 (3) 51.8 (3) 52.0 (2)

correspond to the global minimum of the index values. We can see that the two (Figure 8) and three (Figure 9) cluster solutions look to have well separated clusters, but the four (Figure 10) and five (Figure 11) cluster solutions show less distinct separation between all clusters.

For completeness we also look at the GLASS data. Again, we do not know the correct dimensionality, but suppose that we are interested in finding a clustering in two dimensions for visualization purposes. The results in Table 7 suggest that there are three clusters; the smallest local minimum of the PPC index in two dimensions consists of clusters of size 184, 28, and 2. The smallest cluster is well separated from the others and consists of two observations from class 5; the cluster with 28 observations contains 25 observations from class 7 and is also well separated from the other clusters. Other than for the two cluster solution, the PPC algorithm is extremely sensitive to its starting classification.

We note that for both the CRAB and GLASS datasets the PPC algorithm was least sensitive to its initial classification for two clusters. This may be due to the number of possible allocations of observations increasing as the number of clusters increases, thus increasing the space of solutions.

6.2 How MANY DIMENSIONS?

We noted in Section 3.1 that changing the dimensionality of the PPC index can affect the accuracy of cluster recovery. Since we obtain the d-dimensional projection in the PPC algorithm from the first d eigenvectors of the total within-group sums-of-squares matrix, the algorithm is easily and efficiently extended to higher dimensions.

We fix the number of clusters that the algorithm searches for, but vary the dimensionality of the search subspace. We then look at both the misclassification rate and the total within-

Table 6. Table Showing the Criterion for Choosing the Correct Number of Clusters for the CRAB Data in Two Dimensions

k 1 2 3 4 5

Wk 400 224 100 60.7 50.6

(n- k-i1) (Wk -) 155 243 128 38.9 -




1 0 2 +

+ 0

2 + + 0 + + + 0

0

+ + + oo

X + +++ + 0 0 _

+ ++ 0+ 0 0 1 , ,

+* 00 00

C ~~~~+ 0 b 0

20 + + + 0 00

o~~ ~ o

+0

- 0 1 0 Fg 2 8+++ oGb ouCo 0 t a. ~~~~~~+ +

-1 0 1 CM + + + +++ Oo ++ + 0 0

+ + 0 V0

+ + + +a

-2

-10 1 2

1 st Projection

Figure 8. Global optimum of Ippc for CRAB dataset. Two dimensions, two groups.

1 0 2 + 3'v

2 v

1 v k v 00

" 0 0 v v v 00

0

0 v~~~~~~~~~~~~~ 0

00

U) ++ CM

+

v-i + +++ ++ + 0 0 Ca. ++ ++ ++++

-2 + + + 00

-3

-10 12

1 st Projection

Figure 9. Global optimum of Ippc for CRAB dataset. Two dimensions, three groups.




1 0 2 + 3 v 4 0

I I I I

2 v

VV

v w ,v ?g

V y

1 0 o ++ o O ?0

O~~~ ++

0 O

0 0~~1 %00

0 +4 v 0 0 -2 -1 0

1 o 2 + 3 v 4 0 5O 0

0 0 0 O O + o

-2+ +? o

0 s 0

-2 -1 0 1 2

1 st Projection

Figure 10. Global optimum of Ippc for CRAB dataset. Two dimensions, four groups.

1 0 2 3 v 40 5~

2

2 ~~~~~~~~VVV V ~V 0+0

05 0 VWV ~ ++ ~0 0

-o V1 1

VD ++ + C V V VV+ CM

~ ~ +

+ 000

-1 ~~~~~~+0 0 0~~~~~

0m

00 0 -20

0~~~~~ 0

-2 -1 0 12

1 st Projection

Figure 1 1. Global optimum of Ippc for CRAB dataset. Two dimensions, five groups.




Table 7. Table Showing the Criterion for Choosing the Correct Number of Clusters for the GLASS Data in Two Dimensions

k 1 2 3 4 5 6 7

Wk 428 232 64.2 41.4 27.1 19.9 17.3

(n-k-1) (W -k -1) 179 551 116 110 75.3 31.1 -

group sums-of-squares to see if we can identify the optimal dimensionality of the subspace. In Table 8 we show the within group sums-of-squares (with the number of misclassifications in parentheses) for each of our example datasets at all possible dimensionalities greater than or equal to two. In each case we initialized the PPC algorithm from the original classification of the dataset.

"Scree" plots are often used to identify the number of dimensions in principal components analysis. We can form similar plots for the total within-group sums-of-squares and identify the correct number of dimensions by a sharp increase in the slope immediately to the right. For example, If we look at scree plots of the total within-group sums-of-squares for the datasets (Figures 12-14), we see that the slope suddenly steepens in Figure 12 after the third dimension, and (to a lesser extent) in Figure 13 after the second dimension. We know that the dimensionality of the clustering in the simulated dataset is equal to three and the misclassification rate in Table 8 also suggests that the dimensionality of the cluster

structure in the CRAB dataset is equal to two. The scree plot for the GLASS data is less informative; we would perhaps conjecture that the dimensionality of the clustering here is equal to five as the slope steepens slightly after this dimension.

7. CONCLUSION

Projection pursuit clustering is successful in providing a low-dimensional clustering

of multivariate data, especially when the data have been sphered. Clustering in two or three dimensions can be particularly useful for exploratory visualization of structure in the data. For those wary of sphering the data, the IdD Bock and IdD OCV indexes can also provide reasonable cluster recovery in some circumstances.

Table 8. Table Showing the Total Within-Group Sums-of-Squares for PPC Solutions (and their number of misclassifications) for Various Dimensionalities When Starting the PPC Algorithm at the Original Classifications of the Example Datasets

Dimension 1 2 3 4 5 6 7 8 9

CRAB 7.3 60.7 202.1 401.2 601.8 (87) (17) (33) (30) (31)

GLASS 4.3 34.2 104.0 244.5 381.6 603.2 795.6 964.3 1177.3 (59) (117) (116) (50) (58) (62) (70) (59) (63)

Simulated 5.3 41.0 173.1 559.7 937.5 1339.2 1737.8 2136.8 (128) (49) (6) (14) (45) (42) (30) (30)




0

0 0 Uj) 0)

CU 0 :3 0

)0 -

0 C

C/)

0 2 4 6 8

Dimension

Figure 12. Scree plot by dimension of cumulative total within-group sums-of-squares for simulated dataset with eight groups.

C) 0 C0

CC

H -

O- O

0

0 2 6 8

Dimension

Figure 12. Scree plot by dimension of cumulative total within-group sums-of-squares for simuAte dataset withfu eihroups.

0

0

C) 0)

CU)

0 340

(1)0 ~ ~ ~ ~ ~ ~ ~~~Dmeso

0iue1.Srepo ydmnino uuaiettl ihngopsm-fsurso RBdtstwtfu

grus




Diagnostics for clustering are traditionally based on separating mixture of normal distributions and thus may not perform well on real data with unusually shaped clusters. Our diagnostics are intended to provide a guide for choosing the number of clusters given the dimensionality, and for choosing the dimensionality given the number of clusters. An ex- tensive search through all combinations of cluster numbers and dimensionalities could be performed, but care should be exercised in the interpretation of the diagnostic results to choose the "best" combination of these parameters. A feature of the PPC algorithm is its propensity to find local cluster structure in the data as there may be more than one valid clustering occurring in different subspaces. Further research lies in refining and combining diagnostics, however, the combination of projection and clustering and the local optimization nature of the algorithm makes good diagnostics more difficult to identify.

A. MEAN AND COVARIANCE FOR EXAMPLE DATASET

A.1 MEANS

P1 = (0,0,0,0,0,0,0,,?0)

112 = (4,0,0,0,0,0,0,0,0)

113 = (4,1,0,0,0,0,0,0,0)

/14 = (0,1,0,0,0,0,0,0,0)

/15 = (4,1,1,0,0,0,0,0,0)

P6 = (4,0,1,0,0,0,0,0,0)

117 = (0, 1,1,0,0,0,0,0,0)

/8 = (0,0,1,0,0,0,0,0,0)

A.2 COVARIANCE MATRIX

0.8 -0.15 0.15 0 0 0 0 0 -0.15 0.05 -0.025 0 0 0 0 0 0.15 -0.025 0.05 0 0 0 0 0

0 0 0 0.3 0 0 0 0 0 0 0 0 0.3 0 0 0 0 0 0 0 0 0.3 0 0 0 0 0 0 0 0 0.3 0 0 0 0 0 0 0 0 0.3




02

cmj

0 0 0

U) C

0

E U)

/

C* 0 = CD

0 C!, CM

O- -

0 2 4 6 8

Dimension

Figure 14. Scree plot by dimension of cumulative total within-group sums-of-squares for GLASS dataset with six groups.

ACKNOWLEDGMENTS We are grateful to the associate editor and referees for their suggestions, which have materially improved the

presentation of this article. The first author was supported by grants from the EPSRC and Mathematical Market Research (MMR) Ltd.

[Received June 2000. Revised October 2001.]

REFERENCES

Banfield, C. F., and Bassill, L. C. (1977), "AS 113: A Transfer Algorithm for Nonhierarchical Classification," Applied Statistics, 26, 206-210.

Bock, H.-H. (1986), "Multidimensional Scaling in the Framework of Cluster Analysis," in Studien Zur Klassz- fikation, eds. P. Degens, H.-J. Hermes, and 0. Opitz, Frankfurt: INDEKS-Verlag, pp. 247-258.

(1987), "On the Interface Between Cluster Analysis, Principal Component Analysis, and Multidimensional Scaling," in Multivariate Statistical Modeling and Data Analysis, eds. H. Bozdogan and A. K. Gupta, Boston: Reidel.

Bolton, R. J., and Krzanowski, W. J. (1999), "A Characterization of Principal Components for Projection Pursuit," The American Statistician 55, 108-109.

Borg, I., and Groenen, P. (1997), Modern Multidimensional Scaling: Theory andApplications, New York: Springer- Verlag.




Campbell, N. A., and Mahon, R. J. (1974), "A Multivariate Study of Variation in Two Species of Rock Crab of the Genus leptograpsus," Australian Journal of Zoology, 22, 417-425.

Cook, D., Buja, A., Cabrera, J., and Hurley, C. (1995), "Grand Tour and Projection Pursuit," Journal of Compu- tational and Graphical Statistics, 3, 155-172.

Eslava, G., and Marriott, F. H. C. (1994), "Some Criteria for Projection Pursuit," Statistics and Computing, 4, 13-20.

Everitt, B. S. (1993), Cluster Analysis (3rd ed.), London: Arnold.

Gnanadesikan, R., Kettenring, J. R., and Tsao, S. L. (1995), "Weighting and Selection of Variables for Cluster Analysis," Journal of Classification, 12, 113-136.

Gordon, A. D. (1987), "A Review of Hierarchical Classification," Journal of the Royal Statistical Society, Series A, 150, 119-137.

Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.

Hartigan, J. A., and Wong, M. A. (1979), "A k-Means Clustering Algorithm," Applied Statistics, 28, 100-108.

Heiser, W. J., and Groenen, P. J. F. (1997), "Cluster Differences Scaling with a Within-Clusters Loss Component and a Fuzzy Successive Approximation Strategy to Avoid Local Minima," Psychometrika, 62, 63-83.

Huber, P. J. (1985), "Projection Pursuit" (with discussion), The Annals of Statistics, 13, 435-525.

Jones, M. C., and Sibson, R. (1987), "What is Projection Pursuit?" (with discussion), Journal of the Royal Statistical Society, Series A, 150, 1-36.

Krzanowski, W. J. (1995), "Orthogonal Canonical Variates for Discrimination and Classification," Journal of Chemometrics, 9, 509-520.

Krzanowski, W. J., and Marriott, F. H. C. (1995), Multivariate Analysis Part 2: Classification, Covariance Struc- tures and Repeated Measurements, London: Arnold.

Kwon, S. (1999), "Clustering in Multivariate Data: Visualization, Case and Variable Reduction," Ph. D. thesis, Iowa State University.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis, London: Academic Press.

McLachlan, G. J., and Basford, K. E. (1988), Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker.

Posse, C. (1990), "An Effective Two-Dimensional Projection Pursuit Algorithm," Communications in Statistics- Simulation and Computation, 19, 1143-1164.

(1995), "Tools for Two-Dimensional Exploratory Projection Pursuit," Journal of Computational and Graphical Statistics, 4, 83-100.

Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

Swayne, D. F, Cook, D., and Buja, A. (2000), "Interactive and Dynamic Graphics for Data Analysis Using XGobi," forthcoming.

Wishart, D. (1969), "Mode Analysis," in Numerical Taxonomy, ed. A. J. Cole, New York: Academic Press.

Wong, M. A., and Lane, T. (1983), "A kth Nearest Neighbour Clustering Procedure," Journal of the Royal Statistical Society, Series B, 45, 362-368.



journal of computational and graphical statistics volume 12 issue 1 2003 [doi 10.2307_1391072] r. j....

Documents