research methodology: tools - schwarz & partners · cluster analysis is a multivariate...
TRANSCRIPT
Research Methodology: Tools
Applied Data Analysis (with SPSS)
Lecture 03: Cluster Analysis
March 2014
Prof. Dr. Jürg Schwarz
Lic. phil. Heidi Bruderer Enzler
MSc Business Administration
Slide 2
Contents
Aims of the Lecture ______________________________________________________________________________________ 3
Typical Syntax ___________________________________________________________________________________________ 4
Introduction _____________________________________________________________________________________________ 5
Fictitious Example .................................................................................................................................................................................................. 5
Overview _______________________________________________________________________________________________ 8
Concept of Cluster Analysis _______________________________________________________________________________ 9
Key Steps in Cluster Analysis ................................................................................................................................................................................. 9
Step 1: Measures of Proximity .............................................................................................................................................................................. 10
Proximity Measures for Interval Data .................................................................................................................................................................... 12
Proximity Measures for Binary Data ..................................................................................................................................................................... 14
Step 2: How Are the Clusters Formed? ................................................................................................................................................................ 17
Cluster Analysis with SPSS: A Detailed Example _____________________________________________________________ 23
Market Research: Customer Survey Regarding Brand Awareness ....................................................................................................................... 23
SPSS: Analyze�Classify�Hierarchical Cluster ................................................................................................................................................... 24
Step 1: Measuring the Distance or Similarity Between Objects ............................................................................................................................. 26
Step 2: Forming Clusters ...................................................................................................................................................................................... 27
Step 3: Determine the Number of Clusters ........................................................................................................................................................... 30
Step 4: Saving and Representing Cluster Membership ......................................................................................................................................... 32
Step 5: Cluster Interpretation ................................................................................................................................................................................ 36
Slide 3
Aims of the Lecture
You will understand measures of distance and similarity.
You will understand the steps for performing a cluster analysis.
You will be able to perform a cluster analysis with SPSS.
(Hierarchical agglomerative methods: Between-groups linkage and Ward's method)
In particular, you will know how ...
◦ to interpret an agglomeration schedule.
◦ to read a dendrogram, and how to read the number of clusters from it.
◦ to interpret clusters.
Slide 4
"Squared Euclidean distance"
Range of solutions (number of clusters)
Display dendrogram and vertical icicle plot
Items
"Between-groups linkage"
Label for cases
Split file by CLU3_1 for analyses to come
Frequency analysis
Typical Syntax
Cluster Analysis (without standardization of variables) CLUSTER income awareness /METHOD BAVERAGE /MEASURE=SEUCLID /ID=person /PRINT SCHEDULE CLUSTER(2,5) /PRINT DISTANCE /PLOT DENDROGRAM VICICLE /SAVE CLUSTER(2,5).
Obtain mean values for clusters
SORT CASES BY CLU3_1. SPLIT FILE SEPARATE BY CLU3_1. FREQUENCIES VARIABLES=income awareness /FORMAT=NOTABLE /STATISTICS=MEAN /ORDER=ANALYSIS. SPLIT FILE OFF.
Slide 5
Introduction
Fictitious Example
Market research: Customer survey on brand awareness
Bra
nd a
ware
ne
ss [
Index]
Annual Income [Index]
Characteristics of the survey
Sample of 150 customers
The index for brand awareness is com-
posed of 3 items:
◦ I am aware of whether people wear brand-name clothes.
◦ It is important to me to wear brand-name clothes.
◦ By wearing brand-name clothes, I make a statement about myself.
The data set also includes:
◦ Annual income
Slide 6
Question
Is there a linear relationship between brand awareness and income?
Hypothesis: The higher the person's income, the greater the brand awareness
Performing a regression analysis with SPSS
Bra
nd a
ware
ne
ss [
Index]
Output (summarized)
Test of the overall model (F-Test):
Significance p = .014
Test coefficients:
Constant p = .000
Income p = .014
Coefficient of determination:
R-Squared = .040
A very poor model!
But there appears to be a structure in the
data.
Annual income [Index]
Slide 7
Question
Is there a structure present in the data regarding brand awareness?
Are there clusters for a combination of annual income and brand awareness?
Performing a cluster analysis
Bra
nd a
ware
ne
ss [
Index]
Annual income [Index]
Output
Yes, SPSS identifies 3 clusters
Interpretation
Persons with low income have less brand
awareness because they have less finan-
cial resources.
Persons with average incomes have the
highest brand awareness because they
dream of being rich.
Persons with high income have moderate
brand awareness, because they already
hold a special status and don't need to
show off.
Slide 8
Overview
Cluster analysis is a multivariate procedure that finds natural groups in data.
Information from multiple variables is used for the grouping
(for example, income and brand awareness).
Annual income [Index]
Goal of a cluster analysis
The elements within a group should be as
similar as possible.
<=> Distance d should be small.
The similarities between the groups should
be minimal.
<=> Distance D should be large.
Characteristics
Because measured values are used for
grouping, cluster analysis is objective in a
certain sense.
There is no "optical illusion."
D
d
Bra
nd a
ware
ne
ss [
Index]
Slide 9
Concept of Cluster Analysis
Key Steps in Cluster Analysis
0. Choose variables (based on theory and previous research)
1. Measures of distance or similarity between objects (measures of proximity)
◦ Depends on the data type: interval, frequency, binary
◦ Distance: geometric measurement. Similarity: content measurement
◦ Calculation of a proximity matrix
2. Forming clusters
◦ Various algorithms: hierarchical / non-hierarchical, agglomerative / divisive, etc.
3. Instruments / criteria for deciding on the number of clusters
◦ Instruments: Agglomerative schedule, structure diagram, dendrogram, icicle plot
◦ Criteria (not available in SPSS): F-value, information criteria etc.
4. Saving and representing cluster membership
◦ Performed by SPSS
5. Interpreting clusters
◦ Taking into consideration the mean values (possibly the variance) of the cluster elements
Slide 10
Step 1: Measures of Proximity
From the data ...
... to the proximity matrix (calculated within SPSS)
Variable 1 Variable 2 Variable 3 : Variable j
Object 1
Object 2
Object 3
:
Object k
Object 1 Object 2 Object 3 : Object k
Object 1
Object 2
Object 3
:
Object k
Raw data
Distance or similarity
Slide 11
Different measures of proximity depend on type of data
There are measures of distance (d) and measures of similarity (s).
Interval (for example, brand awareness, annual income)
◦ Euclidean distance (d)
◦ City block distance (d)
◦ Pearson correlation (s) :
Frequencies (for example, number of customers)
◦ Chi-squared measurement (d)
◦ Phi-squared measurement (d) :
Binary (for example, Yes/No, Male/Female)
◦ Euclidean distance (d)
◦ Russel and Rao (s)
◦ Simple correspondence (s)
◦ Dice (s)
(only a selection of 27!)
Slide 12
Proximity Measures for Interval Data
Example: Brand awareness and income
Theorem of Pythagoras about right triangle
cba cba 22222 =+=>=+
Distance between "pers_001" and "pers_002":
[ ][ ]
407.1
488.1490.0
73.195.297.067.1d
2/1
2/122
002,001
=
+=
−+−=
Coordinates {x-axis, y-axis}
{1.67, 1.73} Person 2
{0.97, 2.95} Person 1
Slide 13
Generalized equation
Minkowski metric (Hermann Minkowski, 1864 – 1909, German physicist) r/1
J
1j
r
ljkjl,k xxd
−= ∑
=
r = Minkowski constant
dk,l = Distance between objects k and l (for example, distance between persons 001 and 002)
J = Number of cluster variables (for example, income and awareness variables)
xkj, xlj = Values of variable j for objects k and l (for example, income of persons 001 and 002)
Value of Minkowski constant
◦ r = 1: City block distance (also called L1-Norm)
◦ r = 2: Euclidean distance (also called L2-Norm)
City block distance
= Manhattan distance
= Taxi distance
Slide 14
Proximity Measures for Binary Data
Example: Car configuration
Determining the similarity between two objects by comparison
Are the following two cases (Mercedes, BMW) the same or are they different?
4 Cases
A = Feature is present in both objects
B, C = Feature is only present in one of the objects
D = Feature is not present in the objects
Absence is also a similarity that can influence the proximity measurement
ABS Airbag ESP Navi Metallic
Mercedes 0 1 1 1 0
BMW 0 1 1 0 1
Case D A A C B
0 = feature not present 1 = feature present
Configuration
Slide 15
Binary proximity measurement
The similarity measurement of the two objects i and j depends on whether and how the four
cases above (A, B, C, D) are used and how they are weighted (weights α, δi and λ).
General Case: Simple Matching Coefficient*
ij
a dS
a (b c) d1
2
α ⋅ + δ ⋅=α ⋅ + λ + + δ ⋅
Options Description Definition
Russel and Rao Case d reduces similarity ij
aS
a b c d=
+ + +
Simple Matching Case d increases similarity ij
a dS
a b c d
+=
+ + +
Dice Case d is not considered.
Similar features are weighted higher ij
2aS
2a b c=
+ +
*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships, *University of Kansas science bulletin, 38:1409-1438, 1958.
a = Number of "A" cases b = Number of "B" cases :
Slide 16
Example: Car configuration
Measurement Proximity
Russel and Rao ij
a 2 2S 0.4
a b c d 2 1 1 1 5= = = =
+ + + + + +
Simple
Matching ij
a d 2 1 3S 0.6
a b c d 2 1 1 1 5
+ += = = =
+ + + + + +
Dice ij
2a 2 2 4S 0.67
2a b c 2 2 1 1 6
⋅= = = =
+ + ⋅ + +
ABS Airbag ESP Navi Metallic
Mercedes 0 1 1 1 0
BMW 0 1 1 0 1
Case D A A C B
0 = feature not present 1 = feature present
Configuration
Comments
Sij varies between 0 and 1
There is no "correct" measure-
ment of proximity.
Important question / decision:
Is absence important?
(↔ is d considered?)
Number of
cases
a = 2
b = 1
c = 1
d = 1
Slide 17
Step 2: How Are the Clusters Formed?
How is proximity defined?
The proximity between clusters A and B is measured as =
1. Nearest neighbor (single linkage)
... Minimum of all possible distances of the cases in cluster A and of the cases in cluster B.
2. Centroid clustering (other linkage)
... Distance between the centroids of clusters A and B.
3. Furthest neighbor (complete linkage)
... Maximum of all possible distances of the cases in cluster A and of the cases in cluster B.
Cluster A Cluster B
1.
2.
3.
Slide 18
The proximity between clusters A and B is measured as = (continued)
4. Between-groups linkage (average linkage)
... Mean value of all possible distances between the cases of clusters A and B.
5. Within-groups linkage (other linkage)
... Average value of all possible distances of cases within a group formed by combining clusters
A and B.
6. Median clustering (other linkage)
... Distance between the SPSS-defined median for cluster A cases and the median for cluster B
cases.
Special case using sum of squares
7. Ward's method
For a cluster the sum of squares is the sum of squared distances of each case from the centroid.
d1
d 2
Sum of the squared distance
∑=
=++k
1i
2
i
2
2
2
1 d ...dd
Slide 19
"Tree" of clustering algorithms
There are different clustering algorithms:
Non-hierarchical procedures are also called k-means procedures.
Between-groups linkage is default in SPSS.
used in the course
Slide 20
Features
Approach Proximity measure Comment
Nearest neighbor Distance or similarity Tendency to form chains
Furthest neighbor Distance or similarity Tends to small groups of similar sizes
Between-groups linkage Distance or similarity Lies "between" "nearest neighbor" and
"furthest neighbor"
Other linkage Distance only
Ward's method Distance only Tends to groups of similar sizes
Slide 21
Example of a hierarchical method: Nearest neighbor (single linkage)
◦ Tendency to form chains
◦ Good for identifying outliers
◦ Groups that lie near each other are poorly separated
Kognitive Psychologie, Universität des Saarlandes (www.uni-saarland.de) (Access: March 2014)
Nearest neighbor
Stage k Stage k + 1
"chain"
Slide 22
Example of a hierarchical method: Furthest neighbor (complete linkage)
◦ Tends to small groups of similar sizes
◦ Not appropriate for identifying outliers
Kognitive Psychologie, Universität des Saarlandes (www.uni-saarland.de) (Access: March 2014)q
Furthest neighbor
Stage k Stage k + 1
=> Same data as on previous slide, yet different solution!
Slide 23
Cluster Analysis with SPSS: A Detailed Example
Market Research: Customer Survey Regarding Brand Awareness
Data
Random sub-sample of n = 15
(Why such a small sub-sample?
Just to keep track of what SPSS does.)
Bra
nd a
ware
ne
ss [
Index]
Annual income [Index]
Slide 24
SPSS: Analyze�Classify�Hierarchical Cluster
Slide 25
Syntax
CLUSTER income awareness Variables used
/METHOD BAVERAGE Clustering method "Linkage between groups"
/MEASURE= SEUCLID Proximity measure "Squared Euclidean distance"
/ID=person Label for diagrams and tables
/PRINT SCHEDULE CLUSTER(2,5) Agglomeration schedule, display membership
/PRINT DISTANCE "Distance matrix" (Proximity matrix)
/PLOT DENDROGRAM VICICLE Instruments for specifying the number of clusters
/SAVE CLUSTER(2,5). Save cluster membership
Clustering method "Between-groups linkage" (default)
<=> A better choice might be Ward's method. "Between-groups linkage" is only used to show in detail how SPSS performs a cluster analysis.
Proximity measure "Squared Euclidean distance"
<=> The squared Euclidean distance (default) should be used in the BAVERAGE, CENTROID, MEDIAN or WARD clustering methods.
Slide 26
Step 1: Measuring the Distance or Similarity Between Objects
Output
Proximity matrix: (Distance or similarity between objects)
Values represent the squared Euclidean distance
Example:
Distance between Persons 9 and 7
:
:
Slide 27
Step 2: Forming Clusters
Between-groups linkage
Stage 1: Cases 7 and 9 have the smallest distance ("Coefficients" = .041) => first cluster {7,9}
First cluster {7,9} is merged with case 10 in stage 5 ("Next Stage") => Cluster {7,9,10}
Stage 2: Cases 13 and 14 have the second smallest distance => second cluster {13,14}
Second cluster {13,14} is merged with case 11 in stage 3 => Cluster {11,13,14}
:
Agglomeration schedule: Shows how
the clusters are combined at each stage.
Slide 28
Dendrogram
Stage 1
Stage 5
Stage 2
Stage 3
Slide 29
Icicle plot
14 Cluster: Cases 7 and 9 in a cluster, all others in their own cluster.
13 Cluster: 7 and 9 in a cluster, 13 and 14 in a cluster, all others in their own cluster.
12 Cluster: 7 and 9 in a cluster, 11, 13 and 14 in a cluster, all others in their own cluster. :
Because the columns look like
icicles, this illustration is called
an "icicle plot".
The diagram shows how the
cases are grouped into clus-
ters.
It is read from bottom to top.
Slide 30
Step 3: Determine the Number of Clusters
0) Theoretical and empirical reasons (Caution: optical illusion!)
In the case of brand awareness, there is information about three clusters.
A) Elbow criterion in the structure diagram (cannot be done with SPSS, but with Excel)
Attention:
There is usually a jump from cluster 1 to cluster 2. However, this is not an elbow.
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Pro
xim
ity (
"Co
eff
icie
nts
")
Number of clusters (= Sample size - "Stage")
Elbow => 3 clusters
Slide 31
B) Dendrogram
Choose the number of clusters within the largest increase in heterogeneity.
Standardized distance
Greatest increase in heterogeneity
Slide 32
Step 4: Saving and Representing Cluster Membership
Displaying Cluster Membership Table
Example brand awareness: Assume 3 clusters
If you are uncertain about
the number of clusters,
specify a range.
Slide 33
Saving cluster membership
For example, is used for drawing a scatterplot
Range of solutions: 2 to 5
Example brand awareness: Assume 3 clusters
Slide 34
Scatterplot in SPSS: Graphs����Chart builder ...
Slide 35
One case was incorrectly assigned.
Slide 36
Step 5: Cluster Interpretation
In case of brand awareness, the interpretation was already discussed.
Using mean values
The mean values of the cluster provide information on how the clusters can be interpreted
in relation to the original variables.
Simple example: Market research on purchasing habits of customers
Given a questionnaire about attitudes.
Among other items:
"What is your general attitude toward life?" (Variable x1)
"What is your attitude toward innovation?" (Variable x2)
"How willing are you to take risks?" (Variable x3)
The scale of the variables varies
between 1 (lowest level)
and 7 (highest level)
x1: General
attitude to life
x2: Attitude to
innovation
x3: Willingness
to take risks
Person A 1 2 2
Person B 1 3 3
Person C 2 4 2
Person D 5 4 3
Person E 5 4 4
Person F 7 6 7
Attributes
Ob
jects
Data from 6 persons
Slide 37
Mean values of the cluster in regard to the clustering variables:
Cluster 1 (A, B, C): pessimistic, anxious people
Cluster 2 (D, E): slightly optimistic "ordinary people"
Cluster 3 (F): life-affirming adventurers
General
attitude to life
Attitude to
innovation
Willingness
to take risks
(A, B, C) 1.3 3 2.3
(D, E) 5 4 3.5
(F) 7 6 7
Attributes
Clu
ste
r Obtaining mean values:
SORT CASES BY CLU3_1.
SPLIT FILE SEPARATE BY CLU3_1.
FREQUENCIES VARIABLES=x1 x2 x3
/FORMAT=NOTABLE
/STATISTICS=MEAN
/ORDER=ANALYSIS.
SPLIT FILE OFF.
Slide 38
Notes: