data reduction & classification mark tranmer ccsr

Data reduction & classification

Mark TranmerCCSR

Introduction

• Data reduction techniques involve an investigation of the inter-relationships in a set number of variables of interest, either as an exploratory technique, or as a way of constructing an index from those variables. PCA is the most commonly used data reduction technique and is used to reduce the number of dimensions in a large number of variables to just a few dimensions.

Data reduction: examples

Examples include:• Construct validity of questionnaires• Factorial ecology – geographical studies:

characteristics of areas.• Latent variables – intelligence, numeracy• Constructing deprivation indices – we will

consider this in detail later

Data Reduction

• Today we will focus on Principal Component Analysis (PCA) as a data reduction technique.

• If you want to read more about the distinction between Factor Analysis and PCA, see Chatfield & Collins (1992) and Field (2005) in the reading list.

Data Classification

• Data classification techniques involve a way of grouping together cases that are similar in a dataset. The most common way of doing this is through a cluster analysis.

• An example of the use of this technique is to identify similar areas on the basis of a set of census variables.

• Today we will look at data reduction and classification from a theoretical perspective with some examples. We will also try out some of these techniques using SPSS. The examples will be based on data from the 1991 UK census.

Principal Component Analysis (PCA)

• What is it?

PCA is a multivariate statistical technique

Used to examine the relationships among p correlated variables.

It may be useful to transform the original set of variables to a new set of uncorrelated variables called principal components.


• These new variables are linear combinations of the original variables and are derived in decreasing order of importance.

• If we start with p variables, we can obtain p principal components.

• The first principal component accounts for the most variation in the original data.

• We do not have a dependent variable and explanatory variables in PCA.


• The usual objective is to see if the first few components account for most of the variation in the data. If they do, it is argued that dimensionality of the problem is less than p.

• E.g. if we have 20 variables , and these are explained largely by the first 2 components, we might say that we do not have a 20 dimensional problem, but in fact a 2 dimensional problem.


• Hence, Principal components analysis is also sometimes described as a ‘data reduction’ technique.

• We can also use the principal components to construct indices (e.g. an index of deprivation)

• We can sometimes use a PCA (or factor analysis) as an EDA procedure, to ‘get a feel for the data’.


• Finally we can use the principal component scores in other analyses like regression to get around the problem of multicollinearity, but interpretation may be difficult.

• If the original variables are nearly uncorrelated it is not worth carrying out the PCA.

• No statistical model is assumed in PCA. It is in fact a mathematical technique.

• When the components are extracted one of the mathematical constraints is that these components are uncorrelated with one another.


Example: making a deprivation index from census data.

We have 6 census variables for 104 districts and we wish to make a district level index of ‘deprivation’ from them.

Principal components analysis in SPSS.

• PCA is one of the data reduction techniques in SPSS

• We choose the factor analysis option from the data reduction menu.

• PCA is the default. • We can save the factor scores – useful if

we want to look at an index of deprivation.

Session 3: Principal components analysis practical

• We will focus on six variables:

% of people in h/h with no car (CAR0)

% of people in h/h renting from local authority (RLA)

% of people aged 60 and over (A60P) % Adults unemployed (UNEM) % of people in households with more than 1 person per

room (DENS)

% Adults from ‘non-white’ ethnic groups (NONW)

graphs

NMinimu

mMaximu

m MeanStd.

Deviation

% persons without a car

104 6.51 57.22 27.2973 12.32367

% persons in housing rented from local authority

104 3.69 60.93 20.7763 10.70068

% persons aged 60 and over 104 13.06 26.28 19.3941 2.36919

% persons unemployed (of total population)

104 2.12 10.59 4.9948 1.86621

% persons in hh with over 1.5 persons per room

104 .16 13.15 1.3517 1.63813

% persons non-white

104 .45 44.88 9.3351 10.15503

Valid N (listwise) 104

Correlations

1 .829** .095 .922** .594** .471**

.000 .337 .000 .000 .000

104 104 104 104 104 104

.829** 1 .031 .807** .508** .311**

.000 .758 .000 .000 .001

104 104 104 104 104 104

.095 .031 1 -.103 -.252** -.356**

.337 .758 .296 .010 .000

104 104 104 104 104 104

.922** .807** -.103 1 .586** .565**

.000 .000 .296 .000 .000

104 104 104 104 104 104

.594** .508** -.252** .586** 1 .758**

.000 .000 .010 .000 .000

104 104 104 104 104 104

.471** .311** -.356** .565** .758** 1

.000 .001 .000 .000 .000

104 104 104 104 104 104

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N


% persons in housingrented from local authority

% persons aged 60 andover

% persons unemployed(of total population)

% persons in hh with over1.5 persons per room

% persons non-white

% personswithout a car

% persons inhousing

rented fromlocal authority

% personsaged 60and over

% personsunemployed

(of totalpopulation)

% persons inhh with over 1.5 persons per

room% personsnon-white

Correlation is significant at the 0.01 level (2-tailed).**.

Principal components analysis.

Communalities

1.000 .940

1.000 .817

1.000 .738

1.000 .893

1.000 .757

1.000 .783






% persons non-white

Initial Extraction

Extraction Method: Principal Component Analysis.

Total Variance Explained

3.589 59.819 59.819 3.589 59.819 59.819

1.338 22.303 82.122 1.338 22.303 82.122

.600 9.996 92.118

.282 4.705 96.823

.144 2.393 99.216

.047 .784 100.000

Component1

2

3

4

5

6

Total % of Variance Cumulative % Total % of Variance Cumulative %

Initial Eigenvalues Extraction Sums of Squared Loadings


654321

Component Number

4

2

0

Eig

enva

lue

Scree Plot

Component Matrixa

.908 .340

.825 .369

-.173 .841

.930 .166

.811 -.315

.729 -.502






% persons non-white

1 2

Component


2 components extracted.a.

Component Score Coefficient Matrix

.253 .254

.230 .276

-.048 .629

.259 .124

.226 -.235

.203 -.375






% persons non-white

1 2

Component

Extraction Method: Principal Component Analysis. Component Scores.

We can save the factor scores in the data worksheet

Principal components analysis

• By default, SPSS stops extracting components when the eigenvalues are smaller than 1. The eigenvalues get smaller for each component and many text books suggest that we should only consider components that have eigenvalues greater than 1. For our example, using this criterion, we have reduced our 6 dimensions to two components.

Principal components analysis• Our two components (labelled U1 and U2)

are:

U1 = .253*CAR0 + .230*RLA -.048*A60P + .259*UNEM + .226*DENS + .203*NONW

U2 = .254*CAR0 + .276*RLA -.629*A60P + .124*UNEM -.235*DENS -.375*NONW


• We can look at the size of the ‘loadings’ (coefficients) to see the relative importance of each variable in each component.

• The first component has an eigenvalue of 3.59 and explains 59.82 % of the total variance. This component has high positive coefficients for all variables except A60P.


• The second component has an eigenvalue of 1.34 and explains 22.30 % of the total variance. This component has high positive coefficient for A60P. Other variables have relatively small coefficients. This is because the components are assumed to be uncorrelated.


• Collectively, the two components explain around 82.12 % of the total variance in the original 6 variables (quite a lot!).

• We can save the component scores in SPSS. • In this case they are called fac1 and fac2

(and the scores have been standardised: i.e. (score – mean(score))/(sd(score)).


• Suppose we use the first component to construct a deprivation score. The first component is:

U1 = .25*CAR0 + .23*RLA -.05A60P + .26*UNEM + .23*DENS + .20*NONW

• Hence, higher scores on this component will tend to indicate higher deprivation.

• By substituting in the values of each variable for each district we can get a deprivation score for each district.

• In fact, SPSS will calculate the scores for us.


Comparing two districts, we can see thatHackney, London has score on component 1 of

2.92 Chiltern, Bucks has a score on component 1 of

–1.73


Higher scores indicate higher deprivation .

Hence, we might conclude that at the district level, Hackney, London is more deprived than Chiltern, bucks.

Data classification

• Introduction to cluster analysis• The aim of cluster analysis is to classify a set of cases,

such as individuals or areas, into a small number of groups. These groups are usually non-overlapping.

• Two main types of cluster analysis that we will discuss today are:

• Hierarchical cluster analysis (e.g. single linkage, group linkage, average linkage, Ward’s method).

• Non hierarchical cluster analysis (e.g. k-means)• Other methods of cluster analysis are covered in detail

in Everitt and Dunn (2001)

Data classification

• Note that the techniques covered here are based on assumption that there is no prior information on the groupings of cases.

• For situations where we know the number and composition of groups for existing cases, and we wish to allocate new cases to particular groups we can use techniques such as discriminant analysis (see, for example Everitt and Dunn (2001), Chapter 11.)

Hierarchical Clustering Techniques

• (Agglomerative) hierarchical clustering techniques

• A hierarchical classification is achieved in a number of steps. Starting from n clusters - one for each individual and running through to one cluster which contains all n individuals. Once fusions (i.e putting cases into a particular cluster) are made these cases cannot be re-allocated to different clusters.


• Pro : no need to decide how many clusters you need• Pro: can be ideal for things like biological

classifications (see Everitt and Dunn, page 129)• Con: imposes a structure on the data that may be

inappropriate. • Con: once a case has been allocated to a particular

cluster it cannot then be re-allocated to a particular cluster. However, these techniques may still be useful in an exploratory sense to get some idea how many clusters there are before doing a non-hierarchical analysis (which does require the required number of clusters to be specified).


• Distance (or similarity measures) are often used to determine whether two observations are from a particular cluster the most common measures that are used are

• The minimum distance between observations in two clusters (single linkage)

• The maximum distance between observations in two clusters (complete linkage)

• The average distance between observations in two clusters (group average)

Hierarchical Clustering techniques

• Ward’s method

• An alternative approach, wards method, does not use distance measure to determine clusters. Instead, clusters are formed with this method by maximizing the within cluster homogeneity

Non hierarchical clustering methods

non-hierarchical methods may also be used. The most common one, that will be discussed here is the ‘k means cluster analysis’. ( a full discussion of non hierarchical methods can be found in Sharma, page 202 ff.)


non-hierarchical clustering techniques basicallyfollow these steps. (see Sharma 1997 page 202).

• Select k initial cluster centroids, where k is the desired number of clusters

• Assign each case to the cluster that is the closest to it.

• Re-assign each case to one of the k clusters according to a pre-determined stopping rule.


Methods used to obtain the initial cluster

centroids include:• use the first k observations as centroids• Randomly select k non-missing

observations as cluster centres.• Use centroids supplied by the researcher

e.g. from previous research

cluster analysis - example

Data for 14 wards in manchester 1991 census data

cluster analysis - example

Six variables

• Overcrowding

• No car

• Renting

• Unemployed

• Ethnic group

• Students

Initial Cluster Centers

3.55 5.32 .78

81.36 56.27 29.70

97.83 58.35 27.84

20.48 11.26 4.53

32.25 31.17 8.73

13.87 12.47 6.53

% hh with >= 1 personper room

% hh with no car

% hh renting

% ad unemployed

% pp from ethnic minority

% ad students

1 2 3

Cluster

Cluster Membership

Ardwick 1 5.017

Burnage 2 23.259

Central 1 14.936

Chorlton 3 2.771

Crumpsall 3 7.895

Didsbury 3 14.495

Fallowfield 2 9.093

Hulme 1 13.426

Levenshulme

3 5.668

Longsight 2 13.782

Moss Side 2 21.657

Rusholme 2 3.883

WhalleyRange

2 20.152

Withington 3 7.412

Case Number1

2

3

4

5

6

7

8

9

10

11

12

13

14

Ward name Cluster Distance

Final Cluster Centers

3.75 4.45 1.84

79.44 56.84 41.33

93.58 59.61 35.62

17.38 11.85 7.25

21.77 30.48 11.13

7.62 9.07 6.49


% hh with no car

% hh renting

% ad unemployed


% ad students

1 2 3

Cluster

Distances between Final Cluster Centers

42.118 70.942

42.118 34.996

70.942 34.996

Cluster1

2

3

1 2 3

ANOVA

9.517 2 1.162 11 8.193 .007

1364.526 2 60.069 11 22.716 .000

3158.030 2 46.896 11 67.340 .000

97.419 2 5.204 11 18.720 .000

510.110 2 103.500 11 4.929 .030

9.143 2 10.395 11 .880 .442


% hh with no car

% hh renting

% ad unemployed


% ad students

Mean Square df

Cluster

Mean Square df

Error

F Sig.

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximizethe differences among cases in different clusters. The observed significance levels are not corrected for this andthus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

Number of Cases in each Cluster

3.000

6.000

5.000

14.000

.000

1

2

3

Cluster

Valid

Missing

Reading List

Applied multivariate techniques Subhash Sharma(1997) ISBN: 0-471-31064-6. Wiley.Applied multivariate data analysis Brian Everitt andGraham Dunn (2001) ISBN: 0-340-74122-8. EdwardArnold.Introduction to multivariate analysisChatfield and Collins (1980 reprinted 1992).ISBN: 0 412 16040 4. Chapman and HallDiscovering Statistics Using SPSS. Andy Field (2005)ISBN: 0761944524. Sage Publications Ltd.

data reduction & classification mark tranmer ccsr

Documents