stability of cluster analysis - r: the r project for statistical …filzmoser.pdf · stability of...
TRANSCRIPT
Stability of Cluster Analysis
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Matthias Templ & Peter FilzmoserVienna University of Technology
Vienna, June 16, 2006
Stability of Cluster Analysis
For real data sets without obvious grouping structure the stability of clusters depends on:
1. Input data - the selection of variables
2. Preparation of the data
3. Distance measure used ∗
4. Clustering method
5. Number of clusters
Changing one parameter may result in complete different cluster results.
∗if a distance measure must be chosen
1. Input Data - Variable Selection
> library(mvoutlier)
> library(cluster)
> data(humus)
> a <- agnes(t(prepare(humus[, -c(1:3)])))> plot(a, which.plots = 2, col = c(4), col.main = 3, col.sub = 2)
1. Input Data - Variable Selection
> library(mvoutlier)
> data(humus)
> a <- agnes(t(prepare(humus[, -c(1:3)])))> plot(a, which.plots = 2, col = c(4), col.main = 3, col.sub = 2)
Ag
Bi
Pb
Rb Tl
Ba Si
KM
n ZnA
lB
eLa Y
Th UC
r VFe S
cA
sC
oC
u Ni
Mo
Cd
PB
Mg
Na
pHC
a Sr
Hg
S NC H
LOI
Sb
Con
d
510
1520
2530
35
Dendrogram of agnes(x = t(prepare(humus[, −c(1:3)])))
Agglomerative Coefficient = 0.47t(prepare(humus[, −c(1:3)]))
Hei
ght
A chemical process can be seen
in more detail in a map (later) by
choosing similar variables.
1. Input Data - Variable Selection
A selection of variables may be useful when clustering high-dimensional data because . . .
• clustering with all variables may hide underlying processes
• we want to see some processes in more detail
• the inclusion of one irrelevant variable may hide the real clusters in the data
One easy way (amongst others) for variable selection can be done by graphical inspection
of a dendrogram which results from hierarchical clustering of variables.
2. Data Preparation
Most of real data in practice can have some or all of these properties:
• neither normal nor log-normal
• strongly skewed
• often multi-modal distributions
• dependencies between observations
• weak clustering structures
• data includes outliers
• variables show a striking difference in the amount of variability
2. Data Preparation
If a good clustering structure for a variable exists we expect a distribution with two or more
modes. A transformation (e.g with a box-cox transformation) will preserve the modes
but remove large skewness.
Standardisation of the variables is needed if the variables show a striking difference in
the amount of variability.
Outliers can influence the clustering (depends on which clustering algorithms is chosen)
Removing outliers before clustering may be useful.
Finding outliers is not a trivial task, especially in high dimensions. (you can do this e.g.
with Package mvoutlier from Filzmoser et al. (2005))
3. Distance Measure
pam.euclidean
pam.manhattan
pam.gower
pam.rf
kmeans.euclidean
kmeans.manhattan
kmeans.gower
kmeans.rf
0.2 0.4 0.6 0.8 1.0
Rand Index for 500 bootstrap samples
Rand Index
Comparing clustered data and
clustered subsets of the data with
Rand Index.
Distance measures which results
in high Rand Indices should be
chosen.
4. Clustering Method
pam.euclidean
pam.manhattan
pam.gower
pam.rf
kmeans.euclidean
kmeans.manhattan
kmeans.gower
kmeans.rf
0.2 0.4 0.6 0.8 1.0
Rand Index for 500 bootstrap samples
Rand Index
Comparing clustered data and
clustered subsets of the data with
Rand Index.
Algorithms which results in high
Rand Indices may be chosen.
5. Number of Clusters
number of clusters
wb.
ratio
2 4 6 8
0.4
0.6
0.8
bclust clara
2 4 6 8
cmeans
kccaKmedians
2 4 6 8
Mclust speccPolydot
euclideangowermanhattannonerf
Data: HumusCoNiCuAsMo
Pollution
5. Number of Clusters
number of clusters
wb.
ratio
2 4 6 8
0.4
0.6
0.8
bclust clara
2 4 6 8
cmeans
kccaKmedians
2 4 6 8
Mclust speccPolydot
euclideangowermanhattannonerf
Data: HumusCoNiCuAsMo
Pollution
−→
Example
obs = 27separation = 2.021
obs = 35separation = 2.672
obs = 126separation = 3.813
obs = 22separation = 2.464
obs = 74separation = 2.555
obs = 34separation = 2.366
obs = 166separation = 2.027
obs = 100separation = 2.118
obs = 33separation = 2.49
−→
obs = 27separation = 2.021
obs = 35separation = 2.672
obs = 126separation = 3.813
obs = 22separation = 2.464
obs = 74separation = 2.555
obs = 34separation = 2.366
obs = 166separation = 2.027
obs = 100separation = 2.118
obs = 33separation = 2.49
AgAl
As
B
Ba
Be
Bi
Ca
Cd
Co
Cr
Cu
FeHgK
La
Mg
MnMoNa
Ni
P
Pb
Rb
S
Sb
ScSi
Sr
ThTl
U
VYZn
CH
N
LO
pH
Co
1
Ag
Al
As
B
Ba
Be
Bi
Ca
CdCoCr
Cu
Fe
Hg
K
La
Mg
Mn
MoNaNi
P
PbRb
S
Sb
Sc
Si
Sr
Th
Tl
U
V
Y
Zn
CH
N
LOpH
Co
2
AgAlAsB
Ba
BeBiCaCdCoCrCu
FeHgKLaMgMn
Mo
NaNi
PPbRbSSbScSiSrThTlUVYZnCHNLOpHCo
3
Ag
Al
AsBBa
Be
Bi
CaCd
Co
Cr
Cu
Fe
Hg
K
La
MgMn
Mo
Na
Ni
P
PbRb
S
Sb
Sc
SiSr
Th
Tl
UVY
Zn
CH
N
LO
pH
Co
4
Ag
AlAs
B
Ba
Be
Bi
Ca
CdCoCrCuFe
HgK
La
Mg
MnMo
Na
Ni
P
PbRb
S
SbScSi
Sr
Th
Tl
U
V
Y
Zn
CHNLOpH
Co
5
Ag
Al
As
BBaBe
BiCaCd
Co
Cr
Cu
FeHgKLa
MgMn
Mo
Na
Ni
PPbRbSSbScSiSr
ThTlU
V
YZnCHNLO
pH
Co
6
Ag
AlAsBBaBeBiCaCdCo
Cr
Cu
FeHgKLaMgMnMoNa
Ni
PPbRbSSb
ScSiSr
ThTlUVY
ZnCHNLOpHCo
7
Ag
AlAsB
BaBe
Bi
CaCd
Co
Cr
CuFeHgK
LaMg
Mn
MoNaNiP
PbRb
S
SbScSi
Sr
Th
Tl
UV
YZnCH
N
LO
pH
Co
8
AgAl
As
BBaBeBi
Ca
CdCoCr
Cu
Fe
Hg
K
La
Mg
Mn
Mo
Na
Ni
PPb
RbSSb
ScSi
Sr
ThTl
UVY
Zn
CH
N
LO
pH
Co
9
obs = 27separation = 2.021
obs = 35separation = 2.672
obs = 126separation = 3.813
obs = 22separation = 2.464
obs = 74separation = 2.555
obs = 34separation = 2.366
obs = 166separation = 2.027
obs = 100separation = 2.118
obs = 33separation = 2.49
AgAl
As
B
Ba
Be
Bi
Ca
Cd
Co
Cr
Cu
FeHgK
La
Mg
MnMoNa
Ni
P
Pb
Rb
S
Sb
ScSi
Sr
ThTl
U
VYZn
CH
N
LO
pH
Co
1
Ag
Al
As
B
Ba
Be
Bi
Ca
CdCoCr
Cu
Fe
Hg
K
La
Mg
Mn
MoNaNi
P
PbRb
S
Sb
Sc
Si
Sr
Th
Tl
U
V
Y
Zn
CH
N
LOpH
Co
2
AgAlAsB
Ba
BeBiCaCdCoCrCu
FeHgKLaMgMn
Mo
NaNi
PPbRbSSbScSiSrThTlUVYZnCHNLOpHCo
3
Ag
Al
AsBBa
Be
Bi
CaCd
Co
Cr
Cu
Fe
Hg
K
La
MgMn
Mo
Na
Ni
P
PbRb
S
Sb
Sc
SiSr
Th
Tl
UVY
Zn
CH
N
LO
pH
Co
4
Ag
AlAs
B
Ba
Be
Bi
Ca
CdCoCrCuFe
HgK
La
Mg
MnMo
Na
Ni
P
PbRb
S
SbScSi
Sr
Th
Tl
U
V
Y
Zn
CHNLOpH
Co
5
Ag
Al
As
BBaBe
BiCaCd
Co
Cr
Cu
FeHgKLa
MgMn
Mo
Na
Ni
PPbRbSSbScSiSr
ThTlU
V
YZnCHNLO
pH
Co
6
Ag
AlAsBBaBeBiCaCdCo
Cr
Cu
FeHgKLaMgMnMoNa
Ni
PPbRbSSb
ScSiSr
ThTlUVY
ZnCHNLOpHCo
7
Ag
AlAsB
BaBe
Bi
CaCd
Co
Cr
CuFeHgK
LaMg
Mn
MoNaNiP
PbRb
S
SbScSi
Sr
Th
Tl
UV
YZnCH
N
LO
pH
Co
8
AgAl
As
BBaBeBi
Ca
CdCoCr
Cu
Fe
Hg
K
La
Mg
Mn
Mo
Na
Ni
PPb
RbSSb
ScSi
Sr
ThTl
UVY
Zn
CH
N
LO
pH
Co
9
Pollution
Highest pollution visualised by cluster 9
This can be seen in the graphic on the right
e.g. Co, Cu, Ni typical elements for reflecting pollution
Mclust on scaled and transformed humus data
Validity measure on each cluster
cluster size
Visualising all clusters each in an own map
Seaspray
Cluster 5
(Greyscale depends on validity measure in each cluster)
Conclusions
• Applying cluster analysis on real data results in highly non-stable results for many
reasons
• The selection of variables and the selection of the optimal number of clusters on real
data is a non-trivial task.
• Cluster analysis can be seen as explorative data analysis to get ideas about your data
• Interactive tools which allow for various methods are very helpful