cluster analysis using statgraphics analysis.pdf · (clusters) such that the items in a cluster are...

Cluster Analysis

using Statgraphics

Presented by

Dr. Neil W. Polhemus

Cluster Analysis

“A statistical classification technique in which cases, data, or objects (events, people, things, etc.) are sub-divided into groups (clusters) such that the items in a cluster are very similar (but not identical) to one another and very different from the items in other clusters. It is a discovery tool that reveals associations, patterns, relationships, and structures in masses of data.” from businessdictionary.com

Important Applications

• Market research – partitioning consumers into market segments.

• Medical imaging – identifying different types of tissue.

• Bioinformatics – separating sequences into gene families.

• Social networking – dividing people into communities.

• Educational data mining – identifying groups of students with different needs.

• Climatology – identifying weather patterns.

Example

Scatterplot matrix

LOG(Pop. Density)

Rural Population

Female Percentage

Age Dependency Ratio

Life Expectancy (Total)

Fertility Rate

LOG(Infant Mortality Rate)

Trade

LOG(GDP per Capita)

Consumer Price Inflation

Notation

n multivariate observations

𝑿 =

𝑥11 𝑥12 … 𝑥1𝑝𝑥21 𝑥22 … 𝑥2𝑝

::

𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑝

𝑥𝑖𝑗 = value of jth variable for the ith item

Data Input Dialog

Analysis Options

Cluster – may cluster either observations or variables. Method – procedure used to create clusters. Distance metric – how distance between points or clusters are measured. Standardize – whether to standardize the variables before calculating distance. Number of clusters – target number of clusters to be created.

Distance Metrics

• Squared Euclidean

𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗2

𝑝

𝑗=1

• Euclidean

𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗2

𝑝

𝑗=1

• City Block

𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗

𝑝

𝑗=1

Clustering Methods in Statgraphics Centurion

• Agglomerative hierarchical methods

– Begin with n clusters.

– Combine 2 clusters together based on how close they are to each other.

– Continue until only k clusters remain.

• Method of k-means

– Begin by creating k clusters using selected observations as seeds.

– Assign other observations to the closest cluster.

– Move points from one cluster to another until no more changes are indicated.

Single Linkage (Nearest Neighbor)

1. Begin with n clusters, one for each observation.

2. Calculate the minimum distance between all

pairs of points that are located in different

clusters.

3. For the pair that are closest to each other, join

those 2 clusters together.

4. Repeat steps 2 to 3 until the number of clusters

has been reduced to that desired.

Other Hierarchical Methods

• Complete linkage (farthest neighbor) – distance between clusters is maximum distance between members of the 2 clusters.

• Average linkage – distance between clusters is average distance between all pairs of points.

• Centroid – distance between clusters is distance between their centroids.

• Median – distance between clusters is distance between their multivariate medians.

• Ward’s method – distance between clusters is defined by increase in sums of squared deviations around the means if clusters were joined.

Tables and Graphs

Dendrogram

Dendrogram

Nearest Neighbor Method,Squared Euclidean

0

10

20

30

40

50

60

Dis

tan

ce

An

go

la

Alb

an

ia

Arg

en

tin

a

Arm

en

ia

Au

str

ia

Az

erb

aija

n

Be

lgiu

m

Be

nin

Ba

ng

lad

es

h

Bu

lga

ria

Ba

ha

ma

s

Bo

sn

ia a

nd

He

rze

go

vin

a

Bo

livia

Bra

zil

Ba

rba

do

s

Bh

uta

n

Bo

tsw

an

a

Ce

ntr

al A

fric

an

Re

pu

blic

Ca

na

da

Sw

itz

erl

an

d

Ch

ina

Cte

d'Iv

oir

eC

am

ero

on

Co

ng

o (

Re

p.)

Co

lom

bia

Co

mo

ros

Ca

pe

Ve

rde

Co

sta

Ric

a

Cy

pru

s

Cz

ec

h R

ep

ub

lic

Ge

rma

ny

De

nm

ark

Do

min

ica

n R

ep

ub

lic

Alg

eri

a

Ec

ua

do

r

Eg

yp

t

Sp

ain

Es

ton

ia

Eth

iop

ia

Fin

lan

d

Fiji

Fra

nc

e

Ga

bo

n

Un

ite

d K

ing

do

m

Ge

org

ia

Gh

an

a

Gu

ine

a

Ga

mb

ia

Gre

ec

e

Gre

na

da

Gu

ate

ma

la

Ho

nd

ura

s

Cro

ati

a

Ha

iti

Hu

ng

ary

Ind

on

es

ia

Ind

ia

Ire

lan

d

Ice

lan

d

Isra

el

Ita

ly

Ja

ma

ica

Jo

rda

n

Ja

pa

n

Ka

za

kh

sta

n

Ke

ny

a

Ky

rgy

z R

ep

ub

lic

Ca

mb

od

ia

So

uth

Ko

rea

La

o P

DR

Le

ba

no

n

St.

Lu

cia

Sri

La

nk

a

Le

so

tho

Lit

hu

an

ia

Lu

xe

mb

ou

rg

La

tvia

Mo

roc

co

Mo

ldo

va

Ma

da

ga

sc

ar

Ma

ldiv

es

Me

xic

o

Ma

ce

do

nia

Ma

lta

Mo

nte

ne

gro

Mo

ng

olia

Mo

za

mb

iqu

e

Ma

uri

tan

ia

Ma

uri

tiu

s

Ma

law

i

Ma

lay

sia

Na

mib

ia

Nig

eri

a

Nic

ara

gu

a

Ne

the

rla

nd

s

No

rwa

y

Ne

pa

l

Ne

w Z

ea

lan

d

Om

an

Pa

kis

tan

Pa

na

ma

Pe

ru

Ph

ilip

pin

es

Pa

pu

a N

ew

Gu

ine

a

Po

lan

d

Po

rtu

ga

l

Pa

rag

ua

y

Qa

tar

Ro

ma

nia

Ru

ss

ia

Rw

an

da

Sa

ud

i Ara

bia

Su

da

n

Se

ne

ga

l

Sin

ga

po

re

So

lom

on

Isla

nd

s

Sie

rra

Le

on

e

El S

alv

ad

or

Slo

va

k R

ep

ub

lic

Slo

ve

nia

Sw

ed

en

Sw

az

ilan

d

Sy

ria

n A

rab

Re

pu

blic

Ch

ad

Th

aila

nd

Ta

jikis

tan

To

ng

a

Tu

nis

ia

Tu

rke

y

Ta

nz

an

ia

Ug

an

da

Uk

rain

e

Uru

gu

ay

Un

ite

d S

tate

s

St.

Vin

ce

nt

an

d t

he

Gre

na

din

es

Ve

ne

zu

ela

Vie

tna

m

Va

nu

atu

Sa

mo

a

So

uth

Afr

ica

Za

mb

ia

Ward’s Method

Dendrogram

Ward's Method,Squared Euclidean

0

300

600

900

1200

1500

Dis

tan

ce

An

go

la

Alb

an

ia

Arg

en

tin

a

Arm

en

ia

Au

str

ia

Az

erb

aija

n

Be

lgiu

m

Be

nin

Ba

ng

lad

es

h

Bu

lga

ria

Ba

ha

ma

s

Bo

sn

ia a

nd

He

rze

go

vin

a

Bo

livia

Bra

zil

Ba

rba

do

s

Bh

uta

n

Bo

tsw

an

a

Ce

ntr

al A

fric

an

Re

pu

blic

Ca

na

da

Sw

itz

erl

an

d

Ch

ina

Cte

d'Iv

oir

eC

am

ero

on

Co

ng

o (

Re

p.)

Co

lom

bia

Co

mo

ros

Ca

pe

Ve

rde

Co

sta

Ric

a

Cy

pru

s

Cz

ec

h R

ep

ub

lic

Ge

rma

ny

De

nm

ark

Do

min

ica

n R

ep

ub

lic

Alg

eri

a

Ec

ua

do

r

Eg

yp

t

Sp

ain

Es

ton

ia

Eth

iop

ia

Fin

lan

d

Fiji

Fra

nc

e

Ga

bo

n

Un

ite

d K

ing

do

m

Ge

org

ia

Gh

an

a

Gu

ine

a

Ga

mb

ia

Gre

ec

e

Gre

na

da

Gu

ate

ma

la

Ho

nd

ura

s

Cro

ati

a

Ha

iti

Hu

ng

ary

Ind

on

es

ia

Ind

ia

Ire

lan

d

Ice

lan

d

Isra

el

Ita

ly

Ja

ma

ica

Jo

rda

n

Ja

pa

n

Ka

za

kh

sta

n

Ke

ny

a

Ky

rgy

z R

ep

ub

lic

Ca

mb

od

ia

So

uth

Ko

rea

La

o P

DR

Le

ba

no

n

St.

Lu

cia

Sri

La

nk

a

Le

so

tho

Lit

hu

an

ia

Lu

xe

mb

ou

rg

La

tvia

Mo

roc

co

Mo

ldo

va

Ma

da

ga

sc

ar

Ma

ldiv

es

Me

xic

o

Ma

ce

do

nia

Ma

lta

Mo

nte

ne

gro

Mo

ng

olia

Mo

za

mb

iqu

e

Ma

uri

tan

ia

Ma

uri

tiu

s

Ma

law

i

Ma

lay

sia

Na

mib

ia

Nig

eri

a

Nic

ara

gu

a

Ne

the

rla

nd

s

No

rwa

y

Ne

pa

l

Ne

w Z

ea

lan

d

Om

an

Pa

kis

tan

Pa

na

ma

Pe

ru

Ph

ilip

pin

es

Pa

pu

a N

ew

Gu

ine

a

Po

lan

d

Po

rtu

ga

l

Pa

rag

ua

y

Qa

tar

Ro

ma

nia

Ru

ss

ia

Rw

an

da

Sa

ud

i Ara

bia

Su

da

n

Se

ne

ga

l

Sin

ga

po

re

So

lom

on

Isla

nd

s

Sie

rra

Le

on

e

El S

alv

ad

or

Slo

va

k R

ep

ub

lic

Slo

ve

nia

Sw

ed

en

Sw

az

ilan

d

Sy

ria

n A

rab

Re

pu

blic

Ch

ad

Th

aila

nd

Ta

jikis

tan

To

ng

a

Tu

nis

ia

Tu

rke

y

Ta

nz

an

ia

Ug

an

da

Uk

rain

e

Uru

gu

ay

Un

ite

d S

tate

s

St.

Vin

ce

nt

an

d t

he

Gre

na

din

es

Ve

ne

zu

ela

Vie

tna

m

Va

nu

atu

Sa

mo

a

So

uth

Afr

ica

Za

mb

ia

Cluster Scatterplot

Cluster ScatterplotWard's Method,Squared Euclidean

0 1 2 3 4 5LOG(Infant Mortality Rate)

020

4060

80100

Rural Population

4.6

6.6

8.6

10.6

12.6

LO

G(G

DP

per

Cap

ita)

Cluster 123Centroids

Icicle Plot

Agglomeration Distance Plot

Agglomeration Distance Plot


0 30 60 90 120 150

Stage

0

300

600

900

1200

1500

Dis

tan

ce

Agglomeration Schedule

Creating 3 Clusters

Dendrogram


0

200

400

600

800

Dis

tan

ce

An

go

la

Alb

an

ia

Arg

en

tin

a

Arm

en

ia

Au

str

ia

Az

erb

aija

n

Be

lgiu

m

Be

nin

Ba

ng

lad

es

h

Bu

lga

ria

Ba

ha

ma

s

Bo

sn

ia a

nd

He

rze

go

vin

a

Bo

livia

Bra

zil

Ba

rba

do

s

Bh

uta

n

Bo

tsw

an

a

Ce

ntr

al A

fric

an

Re

pu

blic

Ca

na

da

Sw

itz

erl

an

d

Ch

ina

Cte

d'Iv

oir

eC

am

ero

on

Co

ng

o (

Re

p.)

Co

lom

bia

Co

mo

ros

Ca

pe

Ve

rde

Co

sta

Ric

a

Cy

pru

s

Cz

ec

h R

ep

ub

lic

Ge

rma

ny

De

nm

ark

Do

min

ica

n R

ep

ub

lic

Alg

eri

a

Ec

ua

do

r

Eg

yp

t

Sp

ain

Es

ton

ia

Eth

iop

ia

Fin

lan

d

Fiji

Fra

nc

e

Ga

bo

n

Un

ite

d K

ing

do

m

Ge

org

ia

Gh

an

a

Gu

ine

a

Ga

mb

ia

Gre

ec

e

Gre

na

da

Gu

ate

ma

la

Ho

nd

ura

s

Cro

ati

a

Ha

iti

Hu

ng

ary

Ind

on

es

ia

Ind

ia

Ire

lan

d

Ice

lan

d

Isra

el

Ita

ly

Ja

ma

ica

Jo

rda

n

Ja

pa

n

Ka

za

kh

sta

n

Ke

ny

a

Ky

rgy

z R

ep

ub

lic

Ca

mb

od

ia

So

uth

Ko

rea

La

o P

DR

Le

ba

no

n

St.

Lu

cia

Sri

La

nk

a

Le

so

tho

Lit

hu

an

ia

Lu

xe

mb

ou

rg

La

tvia

Mo

roc

co

Mo

ldo

va

Ma

da

ga

sc

ar

Ma

ldiv

es

Me

xic

o

Ma

ce

do

nia

Ma

lta

Mo

nte

ne

gro

Mo

ng

olia

Mo

za

mb

iqu

e

Ma

uri

tan

ia

Ma

uri

tiu

s

Ma

law

i

Ma

lay

sia

Na

mib

ia

Nig

eri

a

Nic

ara

gu

a

Ne

the

rla

nd

s

No

rwa

y

Ne

pa

l

Ne

w Z

ea

lan

d

Om

an

Pa

kis

tan

Pa

na

ma

Pe

ru

Ph

ilip

pin

es

Pa

pu

a N

ew

Gu

ine

a

Po

lan

d

Po

rtu

ga

l

Pa

rag

ua

y

Qa

tar

Ro

ma

nia

Ru

ss

ia

Rw

an

da

Sa

ud

i Ara

bia

Su

da

n

Se

ne

ga

l

Sin

ga

po

re

So

lom

on

Isla

nd

s

Sie

rra

Le

on

e

El S

alv

ad

or

Slo

va

k R

ep

ub

lic

Slo

ve

nia

Sw

ed

en

Sw

az

ilan

d

Sy

ria

n A

rab

Re

pu

blic

Ch

ad

Th

aila

nd

Ta

jikis

tan

To

ng

a

Tu

nis

ia

Tu

rke

y

Ta

nz

an

ia

Ug

an

da

Uk

rain

e

Uru

gu

ay

Un

ite

d S

tate

s

St.

Vin

ce

nt

an

d t

he

Gre

na

din

es

Ve

ne

zu

ela

Vie

tna

m

Va

nu

atu

Sa

mo

a

So

uth

Afr

ica

Za

mb

ia

Save Cluster Numbers

Discriminant Analysis

Discriminant Function Plot

Plot of Discriminant Functions

-5 -3 -1 1 3 5 7

Function 1

-3.5

-1.5

0.5

2.5

4.5

Fu

ncti

on

2

CLUSTNUMS123

Discriminant Function Coefficients

Cluster Scatterplot

Cluster Scatterplot


46 56 66 76 86

Life Expectancy (Total)

0

1

2

3

4

5

LO

G(I

nfa

nt

Mo

rtali

ty R

ate

)

Cluster 123Centroids

Cluster Scatterplot

Method of k-Means

1. k observations are selected to be the initial

seeds.

2. All remaining observations are assigned to

cluster with nearest seed.

3. The centroids of each cluster are calculated.

4. Each observation is checked to see if it is

closer to the centroid of another cluster. If so, it

is switched to that cluster.

5. Step 4 is repeated until there are no further

changes.

Seeds

• Select USA, China and India as initial seeds.

Cluster Summary

Cluster Scatterplot

USA=#3, China=#1, India=#2

World Map (coming soon)

CLUSTNUMS

123

References

• Johnson and Wichern (2012) Applied

Multivariate Analysis, sixth edition.

• Copy of slides and presentation at:

www.statgraphics.com/webinars

http://www.statgraphics.com/webinars

cluster analysis using statgraphics analysis.pdf · (clusters) such that the items in a cluster are...

Documents