©2009 philip j. ramsey, ph.d. 1 advanced statistical methods for research math 736/836 cluster...

©2009 Philip J. Ramsey, Ph.D. 1

Advanced Statistical Methods for Research Math 736/836

Cluster Analysis

Part 1:

Hierarchical Clustering.

Watch out for the occasional paper clip!


· Yet another important multivariate exploratory method is referred to as Cluster Analysis.

· Once again we are studying a multivariate method that is by itself the subject of numerous textbooks, websites, and academic courses.

· The Classification Society of North America (http://www.classification-society.org/csna/csna.html) deals extensively with Cluster Analysis and related topics.

· The Journal of Classification (http://www.classification-society.org/csna/joc.html) publishes numerous research papers on Cluster Analysis.

· Cluster Analysis like PCA, is a method of analysis we refer to as unsupervised learning.

· An good text is Massart, D. L. and Kaufman, L. (1983), The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, New York: John Wiley & Sons.

http://www.classification-society.org/csna/csna.html

http://www.classification-society.org/csna/joc.html


· Supervised Learning refers to the scenario were we have a predetermined structure among the variables or observations.

· As an example, some of the variables are considered responses and the remaining variables are predictors or inputs to a model. Multiple Regression analysis is a good example of this type of supervised learning.

· Another type of supervised learning occurs when we have a predetermined classification structure among the observations and we attempt to develop a classification model that accurately predicts that structure as a function of covariates (variables) in the dataset.

· Logistic regression, Discriminant Analysis, and CART modeling are examples of this type of supervised learning.

· In Cluster Analysis we attempt to estimate an empirical classification scheme among the observations or variables or both.


· Clustering is a multivariate technique of grouping observations together that are considered similar in some manner – usually based upon a distance measure.

· Clustering can incorporate any number of variables for N observations.

· The variables must be numeric variables for which numerical differences or distances make sense – hierarchical clustering in JMP allows nominal and ordinal variables under certain conditions.

· The common situation is the N observations are not scattered uniformly throughout an N-dimensional space, but rather they form clumps, or locally dense areas, or modes, or clusters.

· The goal of Cluster Analysis is the identification of these natural occurring clusters, which helps to characterize the distribution of N observations.


· Basically clustering consists of a set of algorithms to explore hidden structure among the observations.

· The goal is to separate the observations into groups or clusters such that observations within a cluster are as homogeneous as possible and the different groups are as heterogeneous as possible.

· Often we have no a priori hypotheses about the nature of the possible clusters and rely on the algorithms to define the clusters.

· Identifying a meaningful set of groupings from cluster analysis is as much or more a subject matter task as a statistical task.

· Generally no formal methods of inference are used in cluster analysis, it is strictly exploratory – some t and F tests may be used.

· In some applications of cluster analysis, experts may have some predetermined number of clusters that should exist, however the algorithm determines the composition of the clusters.


· Cluster Analysis techniques implemented in JMP generally fall into two broad categories.

· Hierarchical Clustering where we have no preconceived notion of how many natural clusters may exist, it is a combining process;

· K-means Clustering where we have a predetermined idea of the number of clusters that may exist. Everitt and Dunn refer to this as “Optimization Methods.”

· A subset of K-means Clustering is referred to as mixture models analysis (models refer to multivariate probability distributions usually assumed to be Normal).

· If one has a very large dataset, say > 2000 records (depends on computing resources), then the K-means approach might used due to the large number of possible classification groups that must be considered.


· The cluster structures have 4 basic forms:

· Disjoint Clusters where each object can be in only one cluster – K-means clustering falls in this category;

· Hierarchical Clusters where one cluster may be contained in entirely within a superior cluster;

· Overlapping Clusters where objects can belong simultaneously to two or more clusters. Often constraints are placed on the amount of overlapping objects in clusters;

· Fuzzy Clusters are defined by probabilities of membership in clusters for each object. The clusters can be any of the three types listed above.

· The most common types of clusterins used in practice are disjoint or hierarchical.


· Hierarchical clustering is also known as agglomerative hierarchical clustering because we start with a set of N single member clusters and then begin combining them based upon various distance criteria.

· The process ends when we have a final, single cluster containing N members.

· The result of hierarchical clustering is presented graphically by way of a dendrogram or tree diagram.

· One problem with the hierarchical method is that a large number of possible classification schemes are developed and the researcher has to decide which of the schemes is most appropriate.

· Two-way hierarchical clustering can also be performed where we simultaneously cluster on the observations and the variables.

· Clustering among variables is typically based upon correlation measures.


· Clustering of observations is typically based upon Euclidean distance measures between the clusters.

· We typically try to find clusters of observations such that the distances (dissimilarities) between the clusters is maximized for a given set of clusters – as different as possible.

· There are a number of different methods by which the distances between clusters are computed and the methods usually give different results in terms of the cluster compositions.

· Example: The dataset BirthDeath95.JMP contains statistics on 25 nations from 1995. We will introduce hierarchical clustering using the JMP Cluster platform, which is located in the “Multivariate Methods” submenu. The platform provides most of the popular clustering algorithms.


· Example continued: The procedure begins with 25 individual clusters and combines observations into clusters until finally only a single cluster of 25 observations exists. The reseacher must determine how many clusters are most appropriate. In JMP the user can dynamically select the number of clusters by clicking and dragging the diamond above or below the dendrogram (see picture to right) and see how the memberships change in order to come to a final set of clusters. In the dendrogram to the right, the countries are assigned markers according to their membership in 1 of the 4 clusters designated.


· Example continued: To the right is the cluster history from JMP. Notice that Greece and Italy are the first cluster formed followed by Australia and USA. At the 9th stage Australia, USA, and Argentina combine into a cluster. Eventually Greece and Italy join that cluster.

· The countries combine into fewer clusters at each stage until there exists only 1 cluster at the top of the dendrogram or tree.

242322212019181716151413121110

987654321

Number of Clusters0.1415695960.2048658810.2158010940.2168289580.2264949600.3704513850.4157529300.5187733840.5743839320.6094730100.6376421410.7012630190.7391822060.7447888650.8777222860.8948786231.0734307991.1355104961.5957215601.8297608432.2469488152.7149819093.2960929717.702060221

Distance GreeceAustraliaPhilippinesCameroonEgyptEthiopiaChileChinaArgentinaChileCameroonChileChinaArgentinaBoliviaHaitiBoliviaChileBoliviaChileEthiopiaArgentinaBoliviaArgentina

LeaderItalyUSAVietnamNigeriaIndiaSomaliaCosta RicaIndonesiaAustraliaMexicoKenyaThailandPhilippinesGreeceNicaraguaZambiaEgyptKuwaitCameroonChinaHaitiChileEthiopiaBolivia

Joiner

Clustering History


· Example continued: The clusters designations can be saved to the data table for further analysis – option under the red arrow in the Report Window. Graphical analysis is often very important to understanding the cluster structure.

· In this case we will use a new feature in JMP 8 called Graph Builder located as the first option in the Graph Menu. Graph Builder basically allows the user to construct trellis graphics.

· The interface for Graph Builder allows the user to drag and drop variables from the Select Columns window to various areas of the graph builder template.

· The user simply tries various combinations until a desired graph is constructed. By right clicking on the graph it is possible to control what type of display appears on the graph. As an example, one may prefer box plots or histograms for each cell of the plot.

· The next slide shows a Graph Builder display for the 4 clusters.


· Example continued: Graph Builder display of the clusters.

Literacy & 3 more vs. Cluster

Ba

by M

ort

Lite

racy

Bir

th R

ate

Dea

th R

ate

0

50

100

20

60

100

5

15

2535

45

0

5

10

15

1 2 3 4

Cluster

Graph Builder

From the graph can you see how the four variables define the four clusters? As an example, how do the clusters vary for Baby Mort?


· Example continued: Below we show a Fit Y by X plot of Birth Rate vs. Death Rate with 90% density ellipses for each cluster.

0

5

10

15

20D

ea

th R

ate

Bolivia

Costa Rica

Haiti

IndiaItalyKenya

Kuwait

Mexico

Nicaragua

USA

5 10 15 20 25 30 35 40 45 50

Birth Rate

Bivariate Normal Ellipse P=0.900 Cluster==1Bivariate Normal Ellipse P=0.900 Cluster==2Bivariate Normal Ellipse P=0.900 Cluster==3Bivariate Normal Ellipse P=0.900 Cluster==4

Bivariate Fit of Death Rate By Birth Rate


· Example continued: Below is a Bubble plot of the data. Note that circles are colored by cluster number.


· The hierarchical clustering as mentioned determines the clusters based upon distance measures. JMP supports the five most common types of distance measures used to create clusters.

· The goal at each stage of clustering is to combine clusters that are most similar in terms of distance between the clusters. The different measures of inter-cluster distance can arrive at very different clustering sequences.

· Average Linkage: the distance between two clusters is the average distance between pairs of observations, or one in each cluster. Average linkage tends to join clusters with small variances and is slightly biased toward producing clusters with the same variance. The distance formula is

1AB ij

i A j BA B

d dn n


· Centroid Method: the distance between two clusters is defined as the squared Euclidean distance between their means. The centroid method is more robust to outliers than most other hierarchical methods but in other respects might not perform as well as Ward's method or Average Linkage. Distance for the centroid method is

· Ward's Method: For each of k clusters let ESSk be the sum of squared deviations of each item in the cluster from the centroid of the cluster. If there are currently k clusters than the total ESS = ESS1 + ESS2 + …+ ESSk. At each stage all possible unions of cluster pairs are tried and the two clusters providing the least increase in ESS are combined. Initially ESS = 0 for the N individual clusters and for the final single cluster

2

AB A Bd X X

1

N

i

ESS

i ix x x x


· Ward's Method: At each stage, the method is biased toward creating clusters with the same numbers of observations. For Ward’s method the distance between cluster is calculated as

· Single Linkage: the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. Clusters with the smallest distance are joined at each stage. Single linkage has many desirable theoretical properties, but has performed poorly in Monte Carlo studies (Milligan 1980). The inter-cluster distance measure is

2

1 1A B

AB

A B

X Xd

n n

minAB iji Aj B

d d


· Single Linkage: As mentioned this method has done poorly in Monte Carlo studies, however it is the only clustering method that can detect long, string-like clusters often referred to as chains. It is also very good at detecting irregular, non-ellipsoidal shaped clusters. Ward’s method for example assumes that the underlying clusters are approximately ellipsoidal in shape.

· Complete Linkage: the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. At each stage pairs of clusters are joined that have the smallest distance. Complete linkage is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers (Milligan 1980). Distance for the Complete linkage cluster method is

maxAB iji Aj B

d d


· Example: The following is a simple example with 5 observations to illustrate the idea of clustering. We will use the complete linkage method. We will start with a 5 by 5 symmetric matrix of Euclidean distances between the 5 observations.

0

9 0

3 7 0

6 5 9 0

11 10 2 8 0

(35)1

(35)2

(35)4

11

10

9

max

max

max

iji Aj B

iji Aj B

iji Aj B

d d

d d

d d

At stage 1we combine #3 and #5 to form a cluster since they are the closest.

At stage two we compute a new distance matrix and join #2 and #4 since they are the closest.

0

11 0

10 9 0

9 6 5 0


· Example continued: At stage 3 we compute a new distance matrix

(35)(24)

(24)1

10

9

max

max

iji Aj B

iji Aj B

d d

d d

Since cluster (24) and #1 are closest they are joined into a new cluster.

Finally at the last stage cluster (241) is joined with cluster (35) to create the final cluster of 5 observations. The clustering stages can easily be visualized in this simple case without a dendrogram.

0

10 0

11 9 0


· We next work through the use of JMP for Hierarchical clustering. As mentioned Clustering is one of the platforms for Multivariate Methods and is located under that submenu.

For Hierarchical select the distance measure that is desired. Ward is the default.


· Select the columns containing the data on which the hierarchical clustering will be performed. If the rows of the data matrix are identified by a label column then put that variable in the Label box.

If you do not want the data standardized prior to clustering than deselect the Standardize Data default. The clusters are very scale dependent, so many experts advise standardization if scales are not commensurate.


· In the report window the dendrogram is displayed and many analysis options exist under the hotspot at the top of the report.

Click and drag the red diamond on the dendrogram to change the number of clusters you wish to display. JMP usually selects a proportion of clusters by default, but it is not necessarily optimal. Alternatively one can use the “Number of Clusters” option in the report menu.

The scree plot at the bottom displays the distance that was bridged in order to join clusters at each stage.


· In the report window the dendrogram is displayed and many analysis options exist under the hotspot at the top of the report.

Once the number of clusters have been decided, then it is a good idea to save the clusters to the data table and mark them. Simply select the options from the menu or right click at the bottom of the dendrogram. If you decide to change the number of clusters, JMP will update the markers and cluster designations. As shown earlier we often wish to save the clusters to the data table for further analysis in other JMP platforms.


· If you mouse click on a branch of the dendrogram, then all of the observations in that branch are highlighted on the dendrogram and selected in data table.


· A color map can be added to the dendrogram to help understand the relationships between the observations and the variables in the columns. The map contains a progressive color code from smallest value to largest value. As part of the color map, a two way clustering can be performed where a cluster analysis of the variables is added to the bottom of the observation dendrogram. Variables clustering is based on correlation, with negative correlations indicating dissimilarity.

ArgentinaAustralia

Bolivia

Cameroon

Chile

China

Costa Rica

Egypt

Ethiopia

Greece

Haiti

India

Indonesia

Italy

Kenya

Kuwait

Mexico

Nicaragua

Nigeria

Philippines

Somalia

Thailand

USA

Vietnam

Zambia

Lite

racy

Ba

by

Mo

rtB

irth

Ra

teD

ea

th R

ate

Dendrogram


· If a color map is desired it can be advantageous to select a display order column, which will order the observations based on values of the specified column.

· A good candidate for an ordering column is to perform PCA and then save only the first PC. This can then be specified as the ordering column in the launch window.


· The color map on the left is ordered, the one on the right is not.

Australia

ItalyGreece

USA

Costa RicaChile

ThailandKuwait

Argentina

Mexico

Philippines

ChinaVietnam

IndonesiaNicaraguaBoliviaEgyptIndiaKenyaCameroonNigeriaZambiaHaitiEthiopiaSomalia

Dendrogram

ArgentinaAustralia

Bolivia

Cameroon

Chile

China

Costa Rica

Egypt

Ethiopia

Greece

Haiti

India

Indonesia

Italy

Kenya

Kuwait

Mexico

Nicaragua

Nigeria

Philippines

Somalia

Thailand

USA

Vietnam

Zambia

Dendrogram


· Example: We use the dataset CerealBrands.JMP. The data contains information on 43 breakfast cereals. We have also saved the first principle component score as an ordering column.


· Example: Using two way clustering with Prin1 as an ordering variable we get the following set of clusters. We select 5 as the desired number of clusters. Do the clusters seem logical? Which variables seem important to the cluster development?

PuffedRice

RiceKrispiesCornFlakesCrispix

CornPops

FrostedFlakes

TotalCornFlakesKix

Trix

GoldenGrahams

PuffedWheat

Product19

CocoaPuffs

NutNHoneyCrunch

CountChocula

AppleJacks

CapNCrunch

LuckyCharms

JustRightCrunchyNuggetsWheatiesHoneyGold

FrootLoops

HoneyGrahamOhs

Smacks

NutriGrainWheatMultiGrainCheerios

SpecialKFrostedMiniWheats

ACCheeriosHoneyNutCheerios

CheatiesTotalWholeGrain

OatmealRaisinCrispLife

NutriGrainAlmondRaisin

RaisinNutBran

Cheerios

MueslixCrispyBlendFruitfulBran

QuakerOatmeal

TotalRaisinBran

CracklinOatBran

RaisinBran

AllBran

Ca

lori

es

Pro

tein

Fa

tS

od

ium

Fib

er

Ca

rbo

hyd

rate

s

Su

ga

r

Po

tass

ium

Dendrogram


· Example: Below is a Fit Y by X plot of Carbohydrates vs. Calories with the clusters displayed.

0

5

10

15

20

Ca

rbo

hyd

rate

sCheerios

OatmealRaisinCrisp

TotalRaisinBran

AllBran


Product19

Smacks

CapNCrunchPuffedRice

PuffedWheat

QuakerOatmeal

40 60 80 100 120 140 160 180

Calories

Bivariate Fit of Carbohydrates By Calories


· Example: To the right is an analysis using single linkage. Notice that the clustering sequence is significantly different.

PuffedRice

RiceKrispiesCornFlakes

Crispix

CornPops

FrostedFlakes

TotalCornFlakesKix

Trix

GoldenGrahams

PuffedWheat

Product19

CocoaPuffs

NutNHoneyCrunch

CountChocula

AppleJacks

CapNCrunch

LuckyCharms

JustRightCrunchyNuggetsWheatiesHoneyGold

FrootLoops

HoneyGrahamOhs

Smacks

NutriGrainWheat

MultiGrainCheerios

SpecialKFrostedMiniWheats

ACCheeriosHoneyNutCheerios

CheatiesTotalWholeGrain

OatmealRaisinCrisp

Life


RaisinNutBran

Cheerios

MueslixCrispyBlendFruitfulBran

QuakerOatmeal

TotalRaisinBran

CracklinOatBran

RaisinBran

AllBran

Dendrogram


· An obvious question is which hierarchical clustering method is preferred.

· Unfortunately, over the decades numerous simulation studies have been performed to attempt to answer this question and the overall results tend to be inconsistent and confusing to say the least.

· In the studies, generally Ward’s method and average linkage have tended to perform the best in finding the correct clusters, while single linkage has tended to perform the worst.

· A problem in evaluating clustering algorithms is that each tends to favor clusters with certain characteristics such as size, shape, or dispersion.

· Therefore, a comprehensive evaluation of clustering algorithms requires that one look at artificial clusters with various characteristis. For the most part this has not been done.


· Most evaluations studies have tended to use compact clusters of equal variance and size; often the clusters are based on a multivariate normal distribution.

· Ward’s method is biased toward clusters of equal size and approximately spherical shape, while average linkage is biased toward clusters of equal variance and spherical shape.

· Therefore, it is not surprising that Wards’ method and average linkage tend to be the winners in simulation studies.

· In fact, most clustering algorithms are biased toward regularly shaped regions and may perform very poorly if the clusters are irregular in shape.

· Recall that single linkage does well if one has elongated, irregularly shaped clusters.

· In practice, one has no idea about the characteristics of clusters.


· If the natural clusters are well separated from each other, then any of the clustering algorithms are likely to perform very well.

· To illustrate we use the artificial dataset WellSeparateCluster.JMP and apply both Ward’s method and single linkage clustering.

· The data consist of three very distinct (well separated) clusters.


· Below are the clustering results for both Ward’s method, display on the left, and single linkage displayed on the right.

· Notice that both methods easily identify the three clusters.

· Note, the clusters are multivariate normal with equal sizes.


· If the clusters are not well separated, then the various algorithms will perform quite differently. We illustrate with the data set PoorSeparateCluster.JMP and below is a plot of the clusters.


· The plot on the left is for Ward’s method and on the right single linkage using Proc Cluster in SAS – the highlighted points are observations Proc Cluster could not determine cluster membership.

· Ward’s method has done quite well, while single linkage has done poorly and could not determine cluster membership for quite a few observations.


· Next we look at multivariate normal clusters, but his time they are different sizes and dispersion. The dataset UnequalCluster.JMP contains the results.


· Next we look at multivariate normal clusters, but his time they are different sizes and dispersion. The dataset UnequalCluster.JMP contains the results. On the left is Ward’s method and on the right is average linkage method using Proc Cluster is SAS.

· Ward’s method and average linkage produced almost identical results. However, they both tended toward clusters of equal size and assigned too many observations to the smallest cluster.


· Next we look at two elongated clusters. We will compare Ward’s method to single linkage clustering. Generally, single linkage is supposed to be superior for elongated clusters. The data are in the file ElongateCluster.JMP.


· On the left below is Ward’s method and to the right single linkage method using Proc Cluster in SAS.

· Ward’s method finds two cluster of approximate equal size, but poorly classifies. The single linkage method correctly specifies the two elongated clusters.


· When one has elongated clusters this indicates correlation or covariance structure among the variables used to form the clusters.

· Sometimes transformations on the variables can generate more spherical clusters that are more easily detected by Ward’s method or similar methods.

· In theory the method is straightforward, however if one does not know the number of clusters or covariance structure within each cluster they have to be approximated from the data – not always so easy to do well in practice.

· Proc Aceclus in SAS can be used to perform such transformations prior to clustering with Proc Cluster.

· We can perform a rough approximation to the method in JMP by converting the original variables to principal components and then clustering on the principal component scores.


· To illustrate we will use the elongated cluster data and perform a PCA on correlations in the Multivariate platform and save the two principal component scores to the data table.

· Next we will perform clustering, using Ward’s method, on the principal component scores.

· Below is a scatter plot based on the PC’s.

Although not perfect spheres, the PC’s are more spherical in shape than the original variables and more easily separated in the Prin2 direction. Proc Aceclus produced similar results and is not shown.


· Below are the results of clustering on the PC’s using Ward’s method.

· Notice that the two clusters are perfectly classified by Ward’s method on the PC’s, while the method did not fair well on the original variables.


· To illustrate the clustering, we can use Graph Builder in JMP to show that indeed the clusters are primarily determined by the difference in Prin2 between the two clusters.


· We examine one more scenario where we have nonconvex, elongated clusters. Because of the cluster shape the PCA approach will not work in this case.

· The data are contained in the file NonConvexCluster.JMP.

· Below is a plot of the two clusters.


· Below are the results of clustering using Ward’s method and single linkage.

· Ward’s method misclassifies some of the observations in cluster 2, while the single linkage method has virtually identified the two clusters correctly.

©2009 philip j. ramsey, ph.d. 1 advanced statistical methods for research math 736/836 cluster...

Documents

goal of cluster analysis

use of cluster analysis

method of analysis

discriminant analysis

ordinal variables

remaining variables

numeric variables

number of variables