research methodology: tools - schwarz & partners · cluster analysis is a multivariate...

Research Methodology: Tools

Applied Data Analysis (with SPSS)

Lecture 03: Cluster Analysis

March 2014

Prof. Dr. Jürg Schwarz

Lic. phil. Heidi Bruderer Enzler

MSc Business Administration

Slide 2

Contents

Aims of the Lecture ______________________________________________________________________________________ 3

Typical Syntax ___________________________________________________________________________________________ 4

Introduction _____________________________________________________________________________________________ 5

Fictitious Example .................................................................................................................................................................................................. 5

Overview _______________________________________________________________________________________________ 8

Concept of Cluster Analysis _______________________________________________________________________________ 9

Key Steps in Cluster Analysis ................................................................................................................................................................................. 9

Step 1: Measures of Proximity .............................................................................................................................................................................. 10

Proximity Measures for Interval Data .................................................................................................................................................................... 12

Proximity Measures for Binary Data ..................................................................................................................................................................... 14

Step 2: How Are the Clusters Formed? ................................................................................................................................................................ 17

Cluster Analysis with SPSS: A Detailed Example _____________________________________________________________ 23

Market Research: Customer Survey Regarding Brand Awareness ....................................................................................................................... 23

SPSS: Analyze�Classify�Hierarchical Cluster ................................................................................................................................................... 24

Step 1: Measuring the Distance or Similarity Between Objects ............................................................................................................................. 26

Step 2: Forming Clusters ...................................................................................................................................................................................... 27

Step 3: Determine the Number of Clusters ........................................................................................................................................................... 30

Step 4: Saving and Representing Cluster Membership ......................................................................................................................................... 32

Step 5: Cluster Interpretation ................................................................................................................................................................................ 36

Aims of the Lecture

You will understand measures of distance and similarity.

You will understand the steps for performing a cluster analysis.

You will be able to perform a cluster analysis with SPSS.

(Hierarchical agglomerative methods: Between-groups linkage and Ward's method)

In particular, you will know how ...

◦ to interpret an agglomeration schedule.

◦ to read a dendrogram, and how to read the number of clusters from it.

◦ to interpret clusters.

Slide 4

"Squared Euclidean distance"

Range of solutions (number of clusters)

Display dendrogram and vertical icicle plot

Items

"Between-groups linkage"

Label for cases

Split file by CLU3_1 for analyses to come

Frequency analysis

Typical Syntax

Cluster Analysis (without standardization of variables) CLUSTER income awareness /METHOD BAVERAGE /MEASURE=SEUCLID /ID=person /PRINT SCHEDULE CLUSTER(2,5) /PRINT DISTANCE /PLOT DENDROGRAM VICICLE /SAVE CLUSTER(2,5).

Obtain mean values for clusters

SORT CASES BY CLU3_1. SPLIT FILE SEPARATE BY CLU3_1. FREQUENCIES VARIABLES=income awareness /FORMAT=NOTABLE /STATISTICS=MEAN /ORDER=ANALYSIS. SPLIT FILE OFF.

Introduction

Fictitious Example

Market research: Customer survey on brand awareness

Bra

nd a

ware

ne

ss [

Index]

Annual Income [Index]

Characteristics of the survey

Sample of 150 customers

The index for brand awareness is com-

posed of 3 items:

◦ I am aware of whether people wear brand-name clothes.

◦ It is important to me to wear brand-name clothes.

◦ By wearing brand-name clothes, I make a statement about myself.

The data set also includes:

◦ Annual income

Slide 6

Question

Is there a linear relationship between brand awareness and income?

Hypothesis: The higher the person's income, the greater the brand awareness

Performing a regression analysis with SPSS

Bra

nd a

ware

ne

ss [

Index]

Output (summarized)

Test of the overall model (F-Test):

Significance p = .014

Test coefficients:

Constant p = .000

Income p = .014

Coefficient of determination:

R-Squared = .040

A very poor model!

But there appears to be a structure in the

data.

Annual income [Index]

Question

Is there a structure present in the data regarding brand awareness?

Are there clusters for a combination of annual income and brand awareness?

Performing a cluster analysis

Bra

nd a

ware

ne

ss [

Index]


Output

Yes, SPSS identifies 3 clusters

Interpretation

Persons with low income have less brand

awareness because they have less finan-

cial resources.

Persons with average incomes have the

highest brand awareness because they

dream of being rich.

Persons with high income have moderate

brand awareness, because they already

hold a special status and don't need to

show off.

Slide 8

Overview

Cluster analysis is a multivariate procedure that finds natural groups in data.

Information from multiple variables is used for the grouping

(for example, income and brand awareness).


Goal of a cluster analysis

The elements within a group should be as

similar as possible.

<=> Distance d should be small.

The similarities between the groups should

be minimal.

<=> Distance D should be large.

Characteristics

Because measured values are used for

grouping, cluster analysis is objective in a

certain sense.

There is no "optical illusion."

D

d

Bra

nd a

ware

ne

ss [

Index]

Concept of Cluster Analysis

Key Steps in Cluster Analysis

0. Choose variables (based on theory and previous research)

1. Measures of distance or similarity between objects (measures of proximity)

◦ Depends on the data type: interval, frequency, binary

◦ Distance: geometric measurement. Similarity: content measurement

◦ Calculation of a proximity matrix

2. Forming clusters

◦ Various algorithms: hierarchical / non-hierarchical, agglomerative / divisive, etc.

3. Instruments / criteria for deciding on the number of clusters

◦ Instruments: Agglomerative schedule, structure diagram, dendrogram, icicle plot

◦ Criteria (not available in SPSS): F-value, information criteria etc.

4. Saving and representing cluster membership

◦ Performed by SPSS

5. Interpreting clusters

◦ Taking into consideration the mean values (possibly the variance) of the cluster elements

Slide 10

Step 1: Measures of Proximity

From the data ...

... to the proximity matrix (calculated within SPSS)

Variable 1 Variable 2 Variable 3 : Variable j

Object 1

Object 2

Object 3

:

Object k

Object 1 Object 2 Object 3 : Object k

Object 1

Object 2

Object 3

:

Object k

Raw data

Distance or similarity

Different measures of proximity depend on type of data

There are measures of distance (d) and measures of similarity (s).

Interval (for example, brand awareness, annual income)

◦ Euclidean distance (d)

◦ City block distance (d)

◦ Pearson correlation (s) :

Frequencies (for example, number of customers)

◦ Chi-squared measurement (d)

◦ Phi-squared measurement (d) :

Binary (for example, Yes/No, Male/Female)

◦ Euclidean distance (d)

◦ Russel and Rao (s)

◦ Simple correspondence (s)

◦ Dice (s)

(only a selection of 27!)

Slide 12

Proximity Measures for Interval Data

Example: Brand awareness and income

Theorem of Pythagoras about right triangle

cba cba 22222 =+=>=+

Distance between "pers_001" and "pers_002":

[ ][ ]

407.1

488.1490.0

73.195.297.067.1d

2/1

2/122

002,001

=

+=

−+−=

Coordinates {x-axis, y-axis}

{1.67, 1.73} Person 2

{0.97, 2.95} Person 1

Generalized equation

Minkowski metric (Hermann Minkowski, 1864 – 1909, German physicist) r/1

J

1j

r

ljkjl,k xxd

−= ∑

=

r = Minkowski constant

dk,l = Distance between objects k and l (for example, distance between persons 001 and 002)

J = Number of cluster variables (for example, income and awareness variables)

xkj, xlj = Values of variable j for objects k and l (for example, income of persons 001 and 002)

Value of Minkowski constant

◦ r = 1: City block distance (also called L1-Norm)

◦ r = 2: Euclidean distance (also called L2-Norm)

City block distance

= Manhattan distance

= Taxi distance

Slide 14

Proximity Measures for Binary Data

Example: Car configuration

Determining the similarity between two objects by comparison

Are the following two cases (Mercedes, BMW) the same or are they different?

4 Cases

A = Feature is present in both objects

B, C = Feature is only present in one of the objects

D = Feature is not present in the objects

Absence is also a similarity that can influence the proximity measurement

ABS Airbag ESP Navi Metallic

Mercedes 0 1 1 1 0

BMW 0 1 1 0 1

Case D A A C B

0 = feature not present 1 = feature present

Configuration

Binary proximity measurement

The similarity measurement of the two objects i and j depends on whether and how the four

cases above (A, B, C, D) are used and how they are weighted (weights α, δi and λ).

General Case: Simple Matching Coefficient*

ij

a dS

a (b c) d1

2

α ⋅ + δ ⋅=α ⋅ + λ + + δ ⋅

Options Description Definition

Russel and Rao Case d reduces similarity ij

aS

a b c d=

+ + +

Simple Matching Case d increases similarity ij

a dS

a b c d

+=

+ + +

Dice Case d is not considered.

Similar features are weighted higher ij

2aS

2a b c=

+ +

*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships, *University of Kansas science bulletin, 38:1409-1438, 1958.

a = Number of "A" cases b = Number of "B" cases :

Slide 16

Example: Car configuration

Measurement Proximity

Russel and Rao ij

a 2 2S 0.4

a b c d 2 1 1 1 5= = = =

+ + + + + +

Simple

Matching ij

a d 2 1 3S 0.6

a b c d 2 1 1 1 5

+ += = = =

+ + + + + +

Dice ij

2a 2 2 4S 0.67

2a b c 2 2 1 1 6

⋅= = = =

+ + ⋅ + +

ABS Airbag ESP Navi Metallic

Mercedes 0 1 1 1 0

BMW 0 1 1 0 1

Case D A A C B

0 = feature not present 1 = feature present

Configuration

Comments

Sij varies between 0 and 1

There is no "correct" measure-

ment of proximity.

Important question / decision:

Is absence important?

(↔ is d considered?)

Number of

cases

a = 2

b = 1

c = 1

d = 1

Step 2: How Are the Clusters Formed?

How is proximity defined?

The proximity between clusters A and B is measured as =

1. Nearest neighbor (single linkage)

... Minimum of all possible distances of the cases in cluster A and of the cases in cluster B.

2. Centroid clustering (other linkage)

... Distance between the centroids of clusters A and B.

3. Furthest neighbor (complete linkage)

... Maximum of all possible distances of the cases in cluster A and of the cases in cluster B.

Cluster A Cluster B

1.

2.

3.

Slide 18

The proximity between clusters A and B is measured as = (continued)

4. Between-groups linkage (average linkage)

... Mean value of all possible distances between the cases of clusters A and B.

5. Within-groups linkage (other linkage)

... Average value of all possible distances of cases within a group formed by combining clusters

A and B.

6. Median clustering (other linkage)

... Distance between the SPSS-defined median for cluster A cases and the median for cluster B

cases.

Special case using sum of squares

7. Ward's method

For a cluster the sum of squares is the sum of squared distances of each case from the centroid.

d1

d 2

Sum of the squared distance

∑=

=++k

1i

2

i

2

2

2

1 d ...dd

"Tree" of clustering algorithms

There are different clustering algorithms:

Non-hierarchical procedures are also called k-means procedures.

Between-groups linkage is default in SPSS.

used in the course

Slide 20

Features

Approach Proximity measure Comment

Nearest neighbor Distance or similarity Tendency to form chains

Furthest neighbor Distance or similarity Tends to small groups of similar sizes

Between-groups linkage Distance or similarity Lies "between" "nearest neighbor" and

"furthest neighbor"

Other linkage Distance only

Ward's method Distance only Tends to groups of similar sizes

Example of a hierarchical method: Nearest neighbor (single linkage)

◦ Tendency to form chains

◦ Good for identifying outliers

◦ Groups that lie near each other are poorly separated

Kognitive Psychologie, Universität des Saarlandes (www.uni-saarland.de) (Access: March 2014)

Nearest neighbor

Stage k Stage k + 1

"chain"

Slide 22

Example of a hierarchical method: Furthest neighbor (complete linkage)

◦ Tends to small groups of similar sizes

◦ Not appropriate for identifying outliers

Kognitive Psychologie, Universität des Saarlandes (www.uni-saarland.de) (Access: March 2014)q

Furthest neighbor

Stage k Stage k + 1

=> Same data as on previous slide, yet different solution!

Cluster Analysis with SPSS: A Detailed Example

Market Research: Customer Survey Regarding Brand Awareness

Data

Random sub-sample of n = 15

(Why such a small sub-sample?

Just to keep track of what SPSS does.)

Bra

nd a

ware

ne

ss [

Index]


Slide 24

SPSS: Analyze�Classify�Hierarchical Cluster

Syntax

CLUSTER income awareness Variables used

/METHOD BAVERAGE Clustering method "Linkage between groups"

/MEASURE= SEUCLID Proximity measure "Squared Euclidean distance"

/ID=person Label for diagrams and tables

/PRINT SCHEDULE CLUSTER(2,5) Agglomeration schedule, display membership

/PRINT DISTANCE "Distance matrix" (Proximity matrix)

/PLOT DENDROGRAM VICICLE Instruments for specifying the number of clusters

/SAVE CLUSTER(2,5). Save cluster membership

Clustering method "Between-groups linkage" (default)

<=> A better choice might be Ward's method. "Between-groups linkage" is only used to show in detail how SPSS performs a cluster analysis.

Proximity measure "Squared Euclidean distance"

<=> The squared Euclidean distance (default) should be used in the BAVERAGE, CENTROID, MEDIAN or WARD clustering methods.

Slide 26

Step 1: Measuring the Distance or Similarity Between Objects

Output

Proximity matrix: (Distance or similarity between objects)

Values represent the squared Euclidean distance

Example:

Distance between Persons 9 and 7

:

:

Step 2: Forming Clusters

Between-groups linkage

Stage 1: Cases 7 and 9 have the smallest distance ("Coefficients" = .041) => first cluster {7,9}

First cluster {7,9} is merged with case 10 in stage 5 ("Next Stage") => Cluster {7,9,10}

Stage 2: Cases 13 and 14 have the second smallest distance => second cluster {13,14}

Second cluster {13,14} is merged with case 11 in stage 3 => Cluster {11,13,14}

:

Agglomeration schedule: Shows how

the clusters are combined at each stage.

Slide 28

Dendrogram

Stage 1

Stage 5

Stage 2

Stage 3

Icicle plot

14 Cluster: Cases 7 and 9 in a cluster, all others in their own cluster.

13 Cluster: 7 and 9 in a cluster, 13 and 14 in a cluster, all others in their own cluster.

12 Cluster: 7 and 9 in a cluster, 11, 13 and 14 in a cluster, all others in their own cluster. :

Because the columns look like

icicles, this illustration is called

an "icicle plot".

The diagram shows how the

cases are grouped into clus-

ters.

It is read from bottom to top.

Slide 30

Step 3: Determine the Number of Clusters

0) Theoretical and empirical reasons (Caution: optical illusion!)

In the case of brand awareness, there is information about three clusters.

A) Elbow criterion in the structure diagram (cannot be done with SPSS, but with Excel)

Attention:

There is usually a jump from cluster 1 to cluster 2. However, this is not an elbow.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Pro

xim

ity (

"Co

eff

icie

nts

")

Number of clusters (= Sample size - "Stage")

Elbow => 3 clusters

B) Dendrogram

Choose the number of clusters within the largest increase in heterogeneity.

Standardized distance

Greatest increase in heterogeneity

Slide 32

Step 4: Saving and Representing Cluster Membership

Displaying Cluster Membership Table

Example brand awareness: Assume 3 clusters

If you are uncertain about

the number of clusters,

specify a range.

Saving cluster membership

For example, is used for drawing a scatterplot

Range of solutions: 2 to 5

Example brand awareness: Assume 3 clusters

Slide 34

Scatterplot in SPSS: Graphs��Chart builder ...

One case was incorrectly assigned.

Slide 36

Step 5: Cluster Interpretation

In case of brand awareness, the interpretation was already discussed.

Using mean values

The mean values of the cluster provide information on how the clusters can be interpreted

in relation to the original variables.

Simple example: Market research on purchasing habits of customers

Given a questionnaire about attitudes.

Among other items:

"What is your general attitude toward life?" (Variable x1)

"What is your attitude toward innovation?" (Variable x2)

"How willing are you to take risks?" (Variable x3)

The scale of the variables varies

between 1 (lowest level)

and 7 (highest level)

x1: General

attitude to life

x2: Attitude to

innovation

x3: Willingness

to take risks

Person A 1 2 2

Person B 1 3 3

Person C 2 4 2

Person D 5 4 3

Person E 5 4 4

Person F 7 6 7

Attributes

Ob

jects

Data from 6 persons

Mean values of the cluster in regard to the clustering variables:

Cluster 1 (A, B, C): pessimistic, anxious people

Cluster 2 (D, E): slightly optimistic "ordinary people"

Cluster 3 (F): life-affirming adventurers

General

attitude to life

Attitude to

innovation

Willingness

to take risks

(A, B, C) 1.3 3 2.3

(D, E) 5 4 3.5

(F) 7 6 7

Attributes

Clu

ste

r Obtaining mean values:

SORT CASES BY CLU3_1.

SPLIT FILE SEPARATE BY CLU3_1.

FREQUENCIES VARIABLES=x1 x2 x3

/FORMAT=NOTABLE

/STATISTICS=MEAN

/ORDER=ANALYSIS.

SPLIT FILE OFF.

Slide 38

Notes:

research methodology: tools - schwarz & partners · cluster analysis is a multivariate...

Documents