cluster analysis using statgraphics analysis.pdf · (clusters) such that the items in a cluster are...
TRANSCRIPT
Cluster Analysis
using Statgraphics
Presented by
Dr. Neil W. Polhemus
Cluster Analysis
“A statistical classification technique in which cases, data, or objects (events, people, things, etc.) are sub-divided into groups (clusters) such that the items in a cluster are very similar (but not identical) to one another and very different from the items in other clusters. It is a discovery tool that reveals associations, patterns, relationships, and structures in masses of data.” from businessdictionary.com
Important Applications
• Market research – partitioning consumers into market segments.
• Medical imaging – identifying different types of tissue.
• Bioinformatics – separating sequences into gene families.
• Social networking – dividing people into communities.
• Educational data mining – identifying groups of students with different needs.
• Climatology – identifying weather patterns.
Example
Scatterplot matrix
LOG(Pop. Density)
Rural Population
Female Percentage
Age Dependency Ratio
Life Expectancy (Total)
Fertility Rate
LOG(Infant Mortality Rate)
Trade
LOG(GDP per Capita)
Consumer Price Inflation
Notation
n multivariate observations
𝑿 =
𝑥11 𝑥12 … 𝑥1𝑝𝑥21 𝑥22 … 𝑥2𝑝
::
𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑝
𝑥𝑖𝑗 = value of jth variable for the ith item
Data Input Dialog
Analysis Options
Cluster – may cluster either observations or variables. Method – procedure used to create clusters. Distance metric – how distance between points or clusters are measured. Standardize – whether to standardize the variables before calculating distance. Number of clusters – target number of clusters to be created.
Distance Metrics
• Squared Euclidean
𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗2
𝑝
𝑗=1
• Euclidean
𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗2
𝑝
𝑗=1
• City Block
𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗
𝑝
𝑗=1
Clustering Methods in Statgraphics Centurion
• Agglomerative hierarchical methods
– Begin with n clusters.
– Combine 2 clusters together based on how close they are to each other.
– Continue until only k clusters remain.
• Method of k-means
– Begin by creating k clusters using selected observations as seeds.
– Assign other observations to the closest cluster.
– Move points from one cluster to another until no more changes are indicated.
Single Linkage (Nearest Neighbor)
1. Begin with n clusters, one for each observation.
2. Calculate the minimum distance between all
pairs of points that are located in different
clusters.
3. For the pair that are closest to each other, join
those 2 clusters together.
4. Repeat steps 2 to 3 until the number of clusters
has been reduced to that desired.
Other Hierarchical Methods
• Complete linkage (farthest neighbor) – distance between clusters is maximum distance between members of the 2 clusters.
• Average linkage – distance between clusters is average distance between all pairs of points.
• Centroid – distance between clusters is distance between their centroids.
• Median – distance between clusters is distance between their multivariate medians.
• Ward’s method – distance between clusters is defined by increase in sums of squared deviations around the means if clusters were joined.
Tables and Graphs
Dendrogram
Dendrogram
Nearest Neighbor Method,Squared Euclidean
0
10
20
30
40
50
60
Dis
tan
ce
An
go
la
Alb
an
ia
Arg
en
tin
a
Arm
en
ia
Au
str
ia
Az
erb
aija
n
Be
lgiu
m
Be
nin
Ba
ng
lad
es
h
Bu
lga
ria
Ba
ha
ma
s
Bo
sn
ia a
nd
He
rze
go
vin
a
Bo
livia
Bra
zil
Ba
rba
do
s
Bh
uta
n
Bo
tsw
an
a
Ce
ntr
al A
fric
an
Re
pu
blic
Ca
na
da
Sw
itz
erl
an
d
Ch
ina
Cte
d'Iv
oir
eC
am
ero
on
Co
ng
o (
Re
p.)
Co
lom
bia
Co
mo
ros
Ca
pe
Ve
rde
Co
sta
Ric
a
Cy
pru
s
Cz
ec
h R
ep
ub
lic
Ge
rma
ny
De
nm
ark
Do
min
ica
n R
ep
ub
lic
Alg
eri
a
Ec
ua
do
r
Eg
yp
t
Sp
ain
Es
ton
ia
Eth
iop
ia
Fin
lan
d
Fiji
Fra
nc
e
Ga
bo
n
Un
ite
d K
ing
do
m
Ge
org
ia
Gh
an
a
Gu
ine
a
Ga
mb
ia
Gre
ec
e
Gre
na
da
Gu
ate
ma
la
Ho
nd
ura
s
Cro
ati
a
Ha
iti
Hu
ng
ary
Ind
on
es
ia
Ind
ia
Ire
lan
d
Ice
lan
d
Isra
el
Ita
ly
Ja
ma
ica
Jo
rda
n
Ja
pa
n
Ka
za
kh
sta
n
Ke
ny
a
Ky
rgy
z R
ep
ub
lic
Ca
mb
od
ia
So
uth
Ko
rea
La
o P
DR
Le
ba
no
n
St.
Lu
cia
Sri
La
nk
a
Le
so
tho
Lit
hu
an
ia
Lu
xe
mb
ou
rg
La
tvia
Mo
roc
co
Mo
ldo
va
Ma
da
ga
sc
ar
Ma
ldiv
es
Me
xic
o
Ma
ce
do
nia
Ma
lta
Mo
nte
ne
gro
Mo
ng
olia
Mo
za
mb
iqu
e
Ma
uri
tan
ia
Ma
uri
tiu
s
Ma
law
i
Ma
lay
sia
Na
mib
ia
Nig
eri
a
Nic
ara
gu
a
Ne
the
rla
nd
s
No
rwa
y
Ne
pa
l
Ne
w Z
ea
lan
d
Om
an
Pa
kis
tan
Pa
na
ma
Pe
ru
Ph
ilip
pin
es
Pa
pu
a N
ew
Gu
ine
a
Po
lan
d
Po
rtu
ga
l
Pa
rag
ua
y
Qa
tar
Ro
ma
nia
Ru
ss
ia
Rw
an
da
Sa
ud
i Ara
bia
Su
da
n
Se
ne
ga
l
Sin
ga
po
re
So
lom
on
Isla
nd
s
Sie
rra
Le
on
e
El S
alv
ad
or
Slo
va
k R
ep
ub
lic
Slo
ve
nia
Sw
ed
en
Sw
az
ilan
d
Sy
ria
n A
rab
Re
pu
blic
Ch
ad
Th
aila
nd
Ta
jikis
tan
To
ng
a
Tu
nis
ia
Tu
rke
y
Ta
nz
an
ia
Ug
an
da
Uk
rain
e
Uru
gu
ay
Un
ite
d S
tate
s
St.
Vin
ce
nt
an
d t
he
Gre
na
din
es
Ve
ne
zu
ela
Vie
tna
m
Va
nu
atu
Sa
mo
a
So
uth
Afr
ica
Za
mb
ia
Ward’s Method
Dendrogram
Ward's Method,Squared Euclidean
0
300
600
900
1200
1500
Dis
tan
ce
An
go
la
Alb
an
ia
Arg
en
tin
a
Arm
en
ia
Au
str
ia
Az
erb
aija
n
Be
lgiu
m
Be
nin
Ba
ng
lad
es
h
Bu
lga
ria
Ba
ha
ma
s
Bo
sn
ia a
nd
He
rze
go
vin
a
Bo
livia
Bra
zil
Ba
rba
do
s
Bh
uta
n
Bo
tsw
an
a
Ce
ntr
al A
fric
an
Re
pu
blic
Ca
na
da
Sw
itz
erl
an
d
Ch
ina
Cte
d'Iv
oir
eC
am
ero
on
Co
ng
o (
Re
p.)
Co
lom
bia
Co
mo
ros
Ca
pe
Ve
rde
Co
sta
Ric
a
Cy
pru
s
Cz
ec
h R
ep
ub
lic
Ge
rma
ny
De
nm
ark
Do
min
ica
n R
ep
ub
lic
Alg
eri
a
Ec
ua
do
r
Eg
yp
t
Sp
ain
Es
ton
ia
Eth
iop
ia
Fin
lan
d
Fiji
Fra
nc
e
Ga
bo
n
Un
ite
d K
ing
do
m
Ge
org
ia
Gh
an
a
Gu
ine
a
Ga
mb
ia
Gre
ec
e
Gre
na
da
Gu
ate
ma
la
Ho
nd
ura
s
Cro
ati
a
Ha
iti
Hu
ng
ary
Ind
on
es
ia
Ind
ia
Ire
lan
d
Ice
lan
d
Isra
el
Ita
ly
Ja
ma
ica
Jo
rda
n
Ja
pa
n
Ka
za
kh
sta
n
Ke
ny
a
Ky
rgy
z R
ep
ub
lic
Ca
mb
od
ia
So
uth
Ko
rea
La
o P
DR
Le
ba
no
n
St.
Lu
cia
Sri
La
nk
a
Le
so
tho
Lit
hu
an
ia
Lu
xe
mb
ou
rg
La
tvia
Mo
roc
co
Mo
ldo
va
Ma
da
ga
sc
ar
Ma
ldiv
es
Me
xic
o
Ma
ce
do
nia
Ma
lta
Mo
nte
ne
gro
Mo
ng
olia
Mo
za
mb
iqu
e
Ma
uri
tan
ia
Ma
uri
tiu
s
Ma
law
i
Ma
lay
sia
Na
mib
ia
Nig
eri
a
Nic
ara
gu
a
Ne
the
rla
nd
s
No
rwa
y
Ne
pa
l
Ne
w Z
ea
lan
d
Om
an
Pa
kis
tan
Pa
na
ma
Pe
ru
Ph
ilip
pin
es
Pa
pu
a N
ew
Gu
ine
a
Po
lan
d
Po
rtu
ga
l
Pa
rag
ua
y
Qa
tar
Ro
ma
nia
Ru
ss
ia
Rw
an
da
Sa
ud
i Ara
bia
Su
da
n
Se
ne
ga
l
Sin
ga
po
re
So
lom
on
Isla
nd
s
Sie
rra
Le
on
e
El S
alv
ad
or
Slo
va
k R
ep
ub
lic
Slo
ve
nia
Sw
ed
en
Sw
az
ilan
d
Sy
ria
n A
rab
Re
pu
blic
Ch
ad
Th
aila
nd
Ta
jikis
tan
To
ng
a
Tu
nis
ia
Tu
rke
y
Ta
nz
an
ia
Ug
an
da
Uk
rain
e
Uru
gu
ay
Un
ite
d S
tate
s
St.
Vin
ce
nt
an
d t
he
Gre
na
din
es
Ve
ne
zu
ela
Vie
tna
m
Va
nu
atu
Sa
mo
a
So
uth
Afr
ica
Za
mb
ia
Cluster Scatterplot
Cluster ScatterplotWard's Method,Squared Euclidean
0 1 2 3 4 5LOG(Infant Mortality Rate)
020
4060
80100
Rural Population
4.6
6.6
8.6
10.6
12.6
LO
G(G
DP
per
Cap
ita)
Cluster 123Centroids
Icicle Plot
Agglomeration Distance Plot
Agglomeration Distance Plot
Ward's Method,Squared Euclidean
0 30 60 90 120 150
Stage
0
300
600
900
1200
1500
Dis
tan
ce
Agglomeration Schedule
Creating 3 Clusters
Dendrogram
Ward's Method,Squared Euclidean
0
200
400
600
800
Dis
tan
ce
An
go
la
Alb
an
ia
Arg
en
tin
a
Arm
en
ia
Au
str
ia
Az
erb
aija
n
Be
lgiu
m
Be
nin
Ba
ng
lad
es
h
Bu
lga
ria
Ba
ha
ma
s
Bo
sn
ia a
nd
He
rze
go
vin
a
Bo
livia
Bra
zil
Ba
rba
do
s
Bh
uta
n
Bo
tsw
an
a
Ce
ntr
al A
fric
an
Re
pu
blic
Ca
na
da
Sw
itz
erl
an
d
Ch
ina
Cte
d'Iv
oir
eC
am
ero
on
Co
ng
o (
Re
p.)
Co
lom
bia
Co
mo
ros
Ca
pe
Ve
rde
Co
sta
Ric
a
Cy
pru
s
Cz
ec
h R
ep
ub
lic
Ge
rma
ny
De
nm
ark
Do
min
ica
n R
ep
ub
lic
Alg
eri
a
Ec
ua
do
r
Eg
yp
t
Sp
ain
Es
ton
ia
Eth
iop
ia
Fin
lan
d
Fiji
Fra
nc
e
Ga
bo
n
Un
ite
d K
ing
do
m
Ge
org
ia
Gh
an
a
Gu
ine
a
Ga
mb
ia
Gre
ec
e
Gre
na
da
Gu
ate
ma
la
Ho
nd
ura
s
Cro
ati
a
Ha
iti
Hu
ng
ary
Ind
on
es
ia
Ind
ia
Ire
lan
d
Ice
lan
d
Isra
el
Ita
ly
Ja
ma
ica
Jo
rda
n
Ja
pa
n
Ka
za
kh
sta
n
Ke
ny
a
Ky
rgy
z R
ep
ub
lic
Ca
mb
od
ia
So
uth
Ko
rea
La
o P
DR
Le
ba
no
n
St.
Lu
cia
Sri
La
nk
a
Le
so
tho
Lit
hu
an
ia
Lu
xe
mb
ou
rg
La
tvia
Mo
roc
co
Mo
ldo
va
Ma
da
ga
sc
ar
Ma
ldiv
es
Me
xic
o
Ma
ce
do
nia
Ma
lta
Mo
nte
ne
gro
Mo
ng
olia
Mo
za
mb
iqu
e
Ma
uri
tan
ia
Ma
uri
tiu
s
Ma
law
i
Ma
lay
sia
Na
mib
ia
Nig
eri
a
Nic
ara
gu
a
Ne
the
rla
nd
s
No
rwa
y
Ne
pa
l
Ne
w Z
ea
lan
d
Om
an
Pa
kis
tan
Pa
na
ma
Pe
ru
Ph
ilip
pin
es
Pa
pu
a N
ew
Gu
ine
a
Po
lan
d
Po
rtu
ga
l
Pa
rag
ua
y
Qa
tar
Ro
ma
nia
Ru
ss
ia
Rw
an
da
Sa
ud
i Ara
bia
Su
da
n
Se
ne
ga
l
Sin
ga
po
re
So
lom
on
Isla
nd
s
Sie
rra
Le
on
e
El S
alv
ad
or
Slo
va
k R
ep
ub
lic
Slo
ve
nia
Sw
ed
en
Sw
az
ilan
d
Sy
ria
n A
rab
Re
pu
blic
Ch
ad
Th
aila
nd
Ta
jikis
tan
To
ng
a
Tu
nis
ia
Tu
rke
y
Ta
nz
an
ia
Ug
an
da
Uk
rain
e
Uru
gu
ay
Un
ite
d S
tate
s
St.
Vin
ce
nt
an
d t
he
Gre
na
din
es
Ve
ne
zu
ela
Vie
tna
m
Va
nu
atu
Sa
mo
a
So
uth
Afr
ica
Za
mb
ia
Save Cluster Numbers
Discriminant Analysis
Discriminant Function Plot
Plot of Discriminant Functions
-5 -3 -1 1 3 5 7
Function 1
-3.5
-1.5
0.5
2.5
4.5
Fu
ncti
on
2
CLUSTNUMS123
Discriminant Function Coefficients
Cluster Scatterplot
Cluster Scatterplot
Ward's Method,Squared Euclidean
46 56 66 76 86
Life Expectancy (Total)
0
1
2
3
4
5
LO
G(I
nfa
nt
Mo
rtali
ty R
ate
)
Cluster 123Centroids
Cluster Scatterplot
Method of k-Means
1. k observations are selected to be the initial
seeds.
2. All remaining observations are assigned to
cluster with nearest seed.
3. The centroids of each cluster are calculated.
4. Each observation is checked to see if it is
closer to the centroid of another cluster. If so, it
is switched to that cluster.
5. Step 4 is repeated until there are no further
changes.
Seeds
• Select USA, China and India as initial seeds.
Cluster Summary
Cluster Scatterplot
USA=#3, China=#1, India=#2
World Map (coming soon)
CLUSTNUMS
123
References
• Johnson and Wichern (2012) Applied
Multivariate Analysis, sixth edition.
• Copy of slides and presentation at:
www.statgraphics.com/webinars