10 cluster analysis
TRANSCRIPT
-
7/28/2019 10 Cluster Analysis
1/13
Cluster analysis
The basic assumption with these methods is thatmeasurements made for related samples tend to
be similar.
Overall, the distance between similar samples issmaller than for unrelated samples.
Clustering methods
Well look at three unsupervised clustering methods.
Univariate clustering Evaluates individual variables (raw or scaled). Groups samples into homogeneous classes.
Hierarchical cluster analysis Reduction of multiple variables for a sample to a singledistance value.
Rank and link samples based on relative distances. k-mean clustering.
Grouping of samples into a set number of classes. Use all variables to determine relative distances.
Univariate clustering.
Creates k homogeneous classes.
Uses within-class variances as measure ofhomogeneity.
Can be used to convert quantitative variable into adiscrete ordinal variable.
Another use is to simply evaluate if a variable hasany classification type information.
Histogram (Petal width)
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20
Petal width
Relative
frequency
Iris dataset
Species PropertyI. Setosa Petal width
I. Versicolor Petal lengthI. Verginica Sepal width! Sepal length
Well look at a single
property - petal width.
Univariate clustering.
Histogram (Petal width)
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20 25
Petal width
Relative
frequency
The goal is to partition the data so that you have k
clusters of data.
Iris data
A simple ranking of the data
indicates that we would getreasonable clustering basedon petal width.
-
7/28/2019 10 Cluster Analysis
2/13
Iris data Iris data
Body Temp (from exam)
Not exactly the best
classification.
It does show that there issome skew to the results
(more men in class one andmore women in class two) -
and there is a fair amount ofoverlap.
So whats it good for
Really only useful for an initial evaluationindividual variab
Only want to use when you have a small number
classes (or potential class
Main use is to convert quantitative (continuous)ordinal da
HCA Distance and similarity Distance and similarity
Actual distances between your samples will vary based
on the type and number of measurements present.
Similarity values are calculated to normalize the datato a standard scale.
! For similar samples, sij approaches 1! For dissimilar samples, sij approaches 0
sij=1 - dmaxdij
-
7/28/2019 10 Cluster Analysis
3/13
Clustering
After all our distances or similarities have been
calculated, we need a way of determining how closely
our samples are related or grouped.
We start with the two most related samples and linkthem - forming an initial cluster.
The process is repeated until all samples have been
linked.
Clustering
Several methods of linking our samples are available.
The three most common are:
Single link
Complete linkCentroid link
Lets start by looking at the simplest method ! - single link (in two dimensions)
Single link
This approach determines linkage based on the
distance to the closest point in a cluster.
You start by assuming that the two closest points are
a cluster.
All points are initially compared as pairs and then the
search for links is expanded.
Now lets look at an example.
dij" C=0.5diC+0.5djC- diC-djC
Single link
Single link
dij
Single link
-
7/28/2019 10 Cluster Analysis
4/13
Single link Single link
Single link Other linkage methods
Complete link
Linkage is based on the farthest point in a cluster
- gives a conservative linkage
dij" C=0.5diC+0.5djC+ diC-djC
Other linkage methods
Centroid link (Wards Method)
Linkage is based on the center of the cluster.
dij" C= ni+njnidiC
2
+ ni+njnjdjC
2
+ ni+njninjdij
2
HCA dendrogram
After conducting your linkage, you need a way tvisualizing the result
Dendrograms can be used for this purpose an
provide a very simple two dimensional plot thaindicates clustering, similarities and linkage
-
7/28/2019 10 Cluster Analysis
5/13
Dendrograms
We can nowsee how oursamples are
linked.
The higher thelinkage level,the lower thesimilarity.
1.0 similarity 0.0
Dendrograms
This plot appears to indicate that there athree groups of samples that can only b
linked at very low similarity valueA
B
C
D
E
F
G
H
I
J
Dendrograms
Lets look again at our single linkage example and see
what the dendrogram would look like.
Example dendrogram
1.0 similarity 0.0
A real example
Substances commonly used as accelerants were
assayed by capillary column GC / MS.
At present, accelerants are identified based on boilingpoint range.
! Class assignments: A, B, C, D, EGoal:To determine if multivariate data treatment has the
potential for classification of accelerants.
Analysis conditions
Neat samples were spiked with a known
amount an internal standard.
! SP-5 25m x 0.2mm I.D. column! 1 l sample, 100:1 split injection! 50oC,5 min; 10oC/min ramp; hold at 250oC! Total run time: 30 minutes! Mass Range: 50-150 AMU! ISTD: octadeuteronaphthalene
-
7/28/2019 10 Cluster Analysis
6/13
Preprocessing of data
A total ion chromatographic profile was
extracted and normalized using the internal
standard.
Triplicate samples were averaged.
The first minute of data was discarded due to
the presence of a solvent tail.
The remaining data was simply summed at oneminute intervals - 19 variables.
Classes
A. Light petroleum distillates - petroleum eathers,
lighter fluid, naptha, camping fuels, ...
B. Gasoline
C.Medium petroleum distillates - paint thinners,
mineral spirits, ...
D.Kerosene - #1 fuel oil, jet A fuel, ...
E. Heavy petroleum distillates - #2 fuel oil, diesel
fuel, ...
Representative data profiles
As can be seen, Classes B, C and D show asignificant level of overlap.
A B C D E
Production of dendrograms
Both raw and autoscaled data were processed and
dendrograms were produced using single linkage.
For the autoscaled data, complete and centroid
linkages were also evaluated.
For the dendrograms, classes are color coded and
labeled.
The classes were not used in producing thedendrograms.
Raw - single linkage
Single, raw
cccccccccccddddddddddddeeeeeeebbbbbbabaaaabbbb
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Similarity
Raw - complete linkage
Complete, Raw
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Similarity
-
7/28/2019 10 Cluster Analysis
7/13
Raw - centroidal linkage
Centroid, Raw
e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Similarity
Raw - comparison
Centroidal linkage appears to give the best results.
Single, raw
cccccccccccddddddddddddeeeeeeebbbbbbabaaaabbbb
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Similarity
Complete, Raw
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Similarity
Centroid, Raw
e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Similarity
Autoscaled - single link
Single, Scaled
cccccccccccddddddddddddeeeeeeebaabaaabbbbbbbbb
0.56
0.61
0.66
0.71
0.76
0.81
0.86
0.91
0.96
Similarity
Autoscaled - complete link
Complete, Scaled
dddddedddddddeeeeeecccccccccccbbbbbbbbbbaabaaa
-0.78
-0.58
-0.38
-0.18
0.02
0.22
0.42
0.62
0.82
Similarity
Autoscaled - centroid link
Centroidal, Scaled
cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee
-0.34
-0.14
0.06
0.26
0.46
0.66
0.86
Similarity
Autoscaled - centroid link
Centroidal, Scaled
cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee
-0.34
-0.14
0.06
0.26
0.46
0.66
0.86
Similarity
-
7/28/2019 10 Cluster Analysis
8/13
Iris dataset
A pretty famous data set published by R.A. Fisher, The
Use of Multiple Measurements in Axonmic Problems.
Anals of Eugenics, 7, 179-188 (1936).
He measured four physical properties of iris to see ifthey could be used to classify any of three different
species.
Used length and width of the sepal and petal.
Iris dataset
! Species Property! I. Setosa Petal width! I. Versicolor Petal length! I. Verginica Sepal width!Sepal length
150 samples - no missing values
HCA analysis was conducted on both raw and scaled
data. Both single linkage and complete linkage were
evaluated.
Autoscaled data, centroidal linkage
222222222222222233323223222222232333233333332222223222222222232222221333333333333333333333333333333331111111111111111111111111111111111111111111111111
0
5
10
15
20
25
30
35
40
Dissimilarity
One class is distinct
but the other two
overlap.
Iris dataset
So it should be possibleclassify samples. HCA judoes not provide as use
a view as we had hoped f
5
10
15
20
25
15 30 45 60
Petal length
Petalwidth
Raw data.
Iris dataset
So there was useful information in the dataset.
HCA - not a good tool. Reducing the four
measurements into a single one actually make the data
worse.
Autoscaling - had little or no effect. The actual
numbers were all of a similar range.
Moral - just because a method doesnt work does not
mean that there is no useful information.
Classification of Mycobacteria
Investigators at the CDC wanted to see if it waspossible to identify mycobacteria using pattern
recognition of an HPLC analysis of mycolic acids.
Mycobacteria - include a number of respiratory and
non-respiratory pathogens such asM. tuberculosis.
C70-C90-branched -hydroxy mycolic acids were
selected as they are known to be in the cell walls of
these bacteria.
-
7/28/2019 10 Cluster Analysis
9/13
Classification of Mycobacteria
Eight species were investigated.! M. asiaticum M. bova! M. gastri M. gordonae! M. kansasii M. marinum! M. szulgai M. tuberculosis22 mycolic acids were used for the classification.
175 total samples.
Classification of Mycobacteria
Limitation.
Although the paper specified that it was necessary to
normalized the data to account for variations in
sample size, no standards were provided.
I chose to normalize to the total peak areas for each
sample. This assumes that each species produces
about the amount of total mycolic acid and that the
response/concentration is the same for each
component.
Single linkage
Single linkageshows some
clustering of thesamples but isnot very useful.
Complete linkage
Complete linkage givessome what better results
Well look at this sampleagain later using other too
Identification of Coffee
An attempt was made to identify the source ofcoffee beans.
Sulawesi Costa Rica
Ethiopia Sumatra
Kenya Columbia
Method.
Mass spectral analysis of headspace of beansamples. m/e range of 47-99 was used.
Six samples were obtained from each source.
Identification of Coffee
The mass spectra represented the sum ofspectra for all components present.
As is normal with mass spectra, each wasnormalized to the largest peak.
Only raw data was evaluated.
-
7/28/2019 10 Cluster Analysis
10/13
Representative spectra, 47-99 m/e Single linkage
Sulawesi
Costa Rica
Ethiopia
Sumatra
Kenya
Columbia
Complete linkage
Sulawesi
Costa Rica
Ethiopia
Sumatra
Kenya
Columbia
So whats it good fo
This is a fast method of initial data exploratio
Try all of the options with both raw and scaled dat
The plots can be rapidly evaluate
You can also use principal component data. This will be covered the next un
When you get ready to go on to other methods of clusterinknowing the best methods for linkage will also be usefu
k-mean clustering
An iterative method where samples are initially partitionedinto k classes and a centroid calculated.
Must use quantitative variables but can be raw, scaled orPCA based.
The positions of all samples are then calculated relative tothe centroids and then reassigned to new clusters (ifneeded) and the process repeated.
Classification criteria can include within-class variance,pooled covariance matrix or total inertia matrix.
The number of clusters and assignments can vary based onthe initial starting points so several iterations arecommonly used to find a constant solution.
k-mean clustering
Position initialclass centroids
Adjust centroids
Test classmemberships
Retest/repeat
-
7/28/2019 10 Cluster Analysis
11/13
Using XLStat
Classification criteria that can be minimized.
Trace. Minimize the within-class variance, giving the mosthomogeneous clusters. Data should be autoscaled if this is used.
Determinant. Minimize the covariance matrix. Moreappropriate to use with unscaled data but gives lesshomogeneous clusters.
Wilks lambda. Normalized version of the Determinateapproach.
Trace/median. Centroid ends up being based on median notthe mean, like other approaches. Better when there issubclustering of data.
Using XLStat
XLStats version of HCA (AgglomerativeHierarchical Clustering - AHC) will do a k-meananalysis but only the trace method
The k-means option provides more clusteringcontrol and is faster because no HCA isconducted.
However, AHC has an option to allow the routineto automatically set the number of clusters that
appear to exist.
Iris dataset (again). Iris dataset (again).
Iris dataset (again). Iris dataset (again).
5
10
15
20
25
15 30 45 60
Petal length
Peta
lwidth
Raw data.
-
7/28/2019 10 Cluster Analysis
12/13
Arson dataset.
Here are the final class results from the k-mean clustering.
Centroidal, Scaled
cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee
-0.34
-0.14
0.06
0.26
0.46
0.66
0.86
Similarity
CostaRicanCostaRican
SulawesiSulawesi
CostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRican
SulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSumatraSumatraSumatraSumatraSumatraSumatraSumatraSumatraSumatra
SumatraSumatraSumatraEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopia
KenyaKenya
ColumbiaColumbiaColumbiaColumbia
KenyaKenyaKenyaKenyaKenyaKenyaKenyaKenyaKenyaKenya
ColumbiaColumbiaColumbiaColumbiaColumbiaColumbiaColumbiaColumbia
0 20 40 60 80 100 120 140
Coffee (a more complete data set.)
Coffee (a more complete data set.) Mycobacteria
This data set was VERY difficult to visualize using a dendrogram.
47474747474747474747474747474747474747474543434242424242424242424242424242424242424243434343434345454545454545454545454545454543434343434345454545454545454545454545454545454343434343434343434344434444444444444444444447444444444446464649494949494949494949494949494949494946464646464646464646464646464646464646464646464646464646464646464646464646464646
0
100
200
300
400
500
600
700
800
900
1000
Mycobacteria Mycobacteria - autoclustering.
-
7/28/2019 10 Cluster Analysis
13/13
So whats it good for?
Can be used as a way to subdivide a dataset into related clusters.
Clusters are objectively determined based on similarities inmultidimensional space.
While results can vary based on starting point, the effect can beminimized by using multiple starting points and repetitions.
Results are easier to see than with HCA. k-mean and HCAcomplement each other.