data reduction & classification mark tranmer ccsr

60
Data reduction & classification Mark Tranmer CCSR

Upload: miguel-gorman

Post on 28-Mar-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data reduction & classification Mark Tranmer CCSR

Data reduction & classification

Mark TranmerCCSR

Page 2: Data reduction & classification Mark Tranmer CCSR

Introduction

• Data reduction techniques involve an investigation of the inter-relationships in a set number of variables of interest, either as an exploratory technique, or as a way of constructing an index from those variables. PCA is the most commonly used data reduction technique and is used to reduce the number of dimensions in a large number of variables to just a few dimensions.

Page 3: Data reduction & classification Mark Tranmer CCSR

Data reduction: examples

Examples include:• Construct validity of questionnaires• Factorial ecology – geographical studies:

characteristics of areas.• Latent variables – intelligence, numeracy• Constructing deprivation indices – we will

consider this in detail later

Page 4: Data reduction & classification Mark Tranmer CCSR

Data Reduction

• Today we will focus on Principal Component Analysis (PCA) as a data reduction technique.

• If you want to read more about the distinction between Factor Analysis and PCA, see Chatfield & Collins (1992) and Field (2005) in the reading list.

Page 5: Data reduction & classification Mark Tranmer CCSR

Data Classification

• Data classification techniques involve a way of grouping together cases that are similar in a dataset. The most common way of doing this is through a cluster analysis.

• An example of the use of this technique is to identify similar areas on the basis of a set of census variables.

Page 6: Data reduction & classification Mark Tranmer CCSR

• Today we will look at data reduction and classification from a theoretical perspective with some examples. We will also try out some of these techniques using SPSS. The examples will be based on data from the 1991 UK census.

Page 7: Data reduction & classification Mark Tranmer CCSR

Principal Component Analysis (PCA)

• What is it?

PCA is a multivariate statistical technique

Used to examine the relationships among p correlated variables.

It may be useful to transform the original set of variables to a new set of uncorrelated variables called principal components.

Page 8: Data reduction & classification Mark Tranmer CCSR

Principal Component Analysis (PCA)

• These new variables are linear combinations of the original variables and are derived in decreasing order of importance.

• If we start with p variables, we can obtain p principal components.

• The first principal component accounts for the most variation in the original data.

• We do not have a dependent variable and explanatory variables in PCA.

Page 9: Data reduction & classification Mark Tranmer CCSR

Principal Component Analysis (PCA)

• The usual objective is to see if the first few components account for most of the variation in the data. If they do, it is argued that dimensionality of the problem is less than p.

• E.g. if we have 20 variables , and these are explained largely by the first 2 components, we might say that we do not have a 20 dimensional problem, but in fact a 2 dimensional problem.

Page 10: Data reduction & classification Mark Tranmer CCSR

Principal Component Analysis (PCA)

• Hence, Principal components analysis is also sometimes described as a ‘data reduction’ technique.

• We can also use the principal components to construct indices (e.g. an index of deprivation)

• We can sometimes use a PCA (or factor analysis) as an EDA procedure, to ‘get a feel for the data’.

Page 11: Data reduction & classification Mark Tranmer CCSR

Principal Component Analysis (PCA)

• Finally we can use the principal component scores in other analyses like regression to get around the problem of multicollinearity, but interpretation may be difficult.

• If the original variables are nearly uncorrelated it is not worth carrying out the PCA.

• No statistical model is assumed in PCA. It is in fact a mathematical technique.

• When the components are extracted one of the mathematical constraints is that these components are uncorrelated with one another.

Page 12: Data reduction & classification Mark Tranmer CCSR

Principal Component Analysis (PCA)

Example: making a deprivation index from census data.

We have 6 census variables for 104 districts and we wish to make a district level index of ‘deprivation’ from them.

Page 13: Data reduction & classification Mark Tranmer CCSR

Principal components analysis in SPSS.

• PCA is one of the data reduction techniques in SPSS

• We choose the factor analysis option from the data reduction menu.

• PCA is the default. • We can save the factor scores – useful if

we want to look at an index of deprivation.

Page 14: Data reduction & classification Mark Tranmer CCSR

Session 3: Principal components analysis practical

• We will focus on six variables:

% of people in h/h with no car (CAR0)

% of people in h/h renting from local authority (RLA)

% of people aged 60 and over (A60P) % Adults unemployed (UNEM) % of people in households with more than 1 person per

room (DENS)

% Adults from ‘non-white’ ethnic groups (NONW)

Page 15: Data reduction & classification Mark Tranmer CCSR

graphs

Page 16: Data reduction & classification Mark Tranmer CCSR
Page 17: Data reduction & classification Mark Tranmer CCSR
Page 18: Data reduction & classification Mark Tranmer CCSR
Page 19: Data reduction & classification Mark Tranmer CCSR

NMinimu

mMaximu

m MeanStd.

Deviation

% persons without a car

104 6.51 57.22 27.2973 12.32367

% persons in housing rented from local authority

104 3.69 60.93 20.7763 10.70068

% persons aged 60 and over 104 13.06 26.28 19.3941 2.36919

% persons unemployed (of total population)

104 2.12 10.59 4.9948 1.86621

% persons in hh with over 1.5 persons per room

104 .16 13.15 1.3517 1.63813

% persons non-white

104 .45 44.88 9.3351 10.15503

Valid N (listwise) 104

Page 20: Data reduction & classification Mark Tranmer CCSR
Page 21: Data reduction & classification Mark Tranmer CCSR

Correlations

1 .829** .095 .922** .594** .471**

.000 .337 .000 .000 .000

104 104 104 104 104 104

.829** 1 .031 .807** .508** .311**

.000 .758 .000 .000 .001

104 104 104 104 104 104

.095 .031 1 -.103 -.252** -.356**

.337 .758 .296 .010 .000

104 104 104 104 104 104

.922** .807** -.103 1 .586** .565**

.000 .000 .296 .000 .000

104 104 104 104 104 104

.594** .508** -.252** .586** 1 .758**

.000 .000 .010 .000 .000

104 104 104 104 104 104

.471** .311** -.356** .565** .758** 1

.000 .001 .000 .000 .000

104 104 104 104 104 104

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

% persons without a car

% persons in housingrented from local authority

% persons aged 60 andover

% persons unemployed(of total population)

% persons in hh with over1.5 persons per room

% persons non-white

% personswithout a car

% persons inhousing

rented fromlocal authority

% personsaged 60and over

% personsunemployed

(of totalpopulation)

% persons inhh with over 1.5 persons per

room% personsnon-white

Correlation is significant at the 0.01 level (2-tailed).**.

Page 22: Data reduction & classification Mark Tranmer CCSR

Principal components analysis.

Page 23: Data reduction & classification Mark Tranmer CCSR
Page 24: Data reduction & classification Mark Tranmer CCSR
Page 25: Data reduction & classification Mark Tranmer CCSR

Communalities

1.000 .940

1.000 .817

1.000 .738

1.000 .893

1.000 .757

1.000 .783

% persons without a car

% persons in housingrented from local authority

% persons aged 60 andover

% persons unemployed(of total population)

% persons in hh with over1.5 persons per room

% persons non-white

Initial Extraction

Extraction Method: Principal Component Analysis.

Page 26: Data reduction & classification Mark Tranmer CCSR

Total Variance Explained

3.589 59.819 59.819 3.589 59.819 59.819

1.338 22.303 82.122 1.338 22.303 82.122

.600 9.996 92.118

.282 4.705 96.823

.144 2.393 99.216

.047 .784 100.000

Component1

2

3

4

5

6

Total % of Variance Cumulative % Total % of Variance Cumulative %

Initial Eigenvalues Extraction Sums of Squared Loadings

Extraction Method: Principal Component Analysis.

Page 27: Data reduction & classification Mark Tranmer CCSR

654321

Component Number

4

2

0

Eig

enva

lue

Scree Plot

Page 28: Data reduction & classification Mark Tranmer CCSR

Component Matrixa

.908 .340

.825 .369

-.173 .841

.930 .166

.811 -.315

.729 -.502

% persons without a car

% persons in housingrented from local authority

% persons aged 60 andover

% persons unemployed(of total population)

% persons in hh with over1.5 persons per room

% persons non-white

1 2

Component

Extraction Method: Principal Component Analysis.

2 components extracted.a.

Page 29: Data reduction & classification Mark Tranmer CCSR

Component Score Coefficient Matrix

.253 .254

.230 .276

-.048 .629

.259 .124

.226 -.235

.203 -.375

% persons without a car

% persons in housingrented from local authority

% persons aged 60 andover

% persons unemployed(of total population)

% persons in hh with over1.5 persons per room

% persons non-white

1 2

Component

Extraction Method: Principal Component Analysis. Component Scores.

Page 30: Data reduction & classification Mark Tranmer CCSR

We can save the factor scores in the data worksheet

Page 31: Data reduction & classification Mark Tranmer CCSR
Page 32: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

• By default, SPSS stops extracting components when the eigenvalues are smaller than 1. The eigenvalues get smaller for each component and many text books suggest that we should only consider components that have eigenvalues greater than 1. For our example, using this criterion, we have reduced our 6 dimensions to two components.

Page 33: Data reduction & classification Mark Tranmer CCSR

Principal components analysis• Our two components (labelled U1 and U2)

are:

U1 = .253*CAR0 + .230*RLA -.048*A60P + .259*UNEM + .226*DENS + .203*NONW

U2 = .254*CAR0 + .276*RLA -.629*A60P + .124*UNEM -.235*DENS -.375*NONW

Page 34: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

• We can look at the size of the ‘loadings’ (coefficients) to see the relative importance of each variable in each component.

• The first component has an eigenvalue of 3.59 and explains 59.82 % of the total variance. This component has high positive coefficients for all variables except A60P.

Page 35: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

• The second component has an eigenvalue of 1.34 and explains 22.30 % of the total variance. This component has high positive coefficient for A60P. Other variables have relatively small coefficients. This is because the components are assumed to be uncorrelated.

Page 36: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

• Collectively, the two components explain around 82.12 % of the total variance in the original 6 variables (quite a lot!).

• We can save the component scores in SPSS. • In this case they are called fac1 and fac2

(and the scores have been standardised: i.e. (score – mean(score))/(sd(score)).

Page 37: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

• Suppose we use the first component to construct a deprivation score. The first component is:

U1 = .25*CAR0 + .23*RLA -.05A60P + .26*UNEM + .23*DENS + .20*NONW

• Hence, higher scores on this component will tend to indicate higher deprivation.

• By substituting in the values of each variable for each district we can get a deprivation score for each district.

• In fact, SPSS will calculate the scores for us.

Page 38: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

Comparing two districts, we can see thatHackney, London has score on component 1 of

2.92 Chiltern, Bucks has a score on component 1 of

–1.73

Page 39: Data reduction & classification Mark Tranmer CCSR

Principal components analysis

Higher scores indicate higher deprivation .

Hence, we might conclude that at the district level, Hackney, London is more deprived than Chiltern, bucks.

Page 40: Data reduction & classification Mark Tranmer CCSR

Data classification

• Introduction to cluster analysis• The aim of cluster analysis is to classify a set of cases,

such as individuals or areas, into a small number of groups. These groups are usually non-overlapping.

• Two main types of cluster analysis that we will discuss today are:

• Hierarchical cluster analysis (e.g. single linkage, group linkage, average linkage, Ward’s method).

• Non hierarchical cluster analysis (e.g. k-means)• Other methods of cluster analysis are covered in detail

in Everitt and Dunn (2001)

Page 41: Data reduction & classification Mark Tranmer CCSR

Data classification

• Note that the techniques covered here are based on assumption that there is no prior information on the groupings of cases.

• For situations where we know the number and composition of groups for existing cases, and we wish to allocate new cases to particular groups we can use techniques such as discriminant analysis (see, for example Everitt and Dunn (2001), Chapter 11.)

Page 42: Data reduction & classification Mark Tranmer CCSR

Hierarchical Clustering Techniques

• (Agglomerative) hierarchical clustering techniques

• A hierarchical classification is achieved in a number of steps. Starting from n clusters - one for each individual and running through to one cluster which contains all n individuals. Once fusions (i.e putting cases into a particular cluster) are made these cases cannot be re-allocated to different clusters.

Page 43: Data reduction & classification Mark Tranmer CCSR

Hierarchical Clustering Techniques

• Pro : no need to decide how many clusters you need• Pro: can be ideal for things like biological

classifications (see Everitt and Dunn, page 129)• Con: imposes a structure on the data that may be

inappropriate. • Con: once a case has been allocated to a particular

cluster it cannot then be re-allocated to a particular cluster. However, these techniques may still be useful in an exploratory sense to get some idea how many clusters there are before doing a non-hierarchical analysis (which does require the required number of clusters to be specified).

Page 44: Data reduction & classification Mark Tranmer CCSR

Hierarchical Clustering Techniques

• Distance (or similarity measures) are often used to determine whether two observations are from a particular cluster the most common measures that are used are

• The minimum distance between observations in two clusters (single linkage)

• The maximum distance between observations in two clusters (complete linkage)

• The average distance between observations in two clusters (group average)

Page 45: Data reduction & classification Mark Tranmer CCSR

Hierarchical Clustering techniques

• Ward’s method

• An alternative approach, wards method, does not use distance measure to determine clusters. Instead, clusters are formed with this method by maximizing the within cluster homogeneity

Page 46: Data reduction & classification Mark Tranmer CCSR

Non hierarchical clustering methods

non-hierarchical methods may also be used. The most common one, that will be discussed here is the ‘k means cluster analysis’. ( a full discussion of non hierarchical methods can be found in Sharma, page 202 ff.)

Page 47: Data reduction & classification Mark Tranmer CCSR

Non hierarchical clustering methods

non-hierarchical clustering techniques basicallyfollow these steps. (see Sharma 1997 page 202).

• Select k initial cluster centroids, where k is the desired number of clusters

• Assign each case to the cluster that is the closest to it.

• Re-assign each case to one of the k clusters according to a pre-determined stopping rule.

Page 48: Data reduction & classification Mark Tranmer CCSR

Non hierarchical clustering methods

Methods used to obtain the initial cluster

centroids include:• use the first k observations as centroids• Randomly select k non-missing

observations as cluster centres.• Use centroids supplied by the researcher

e.g. from previous research

Page 49: Data reduction & classification Mark Tranmer CCSR

cluster analysis - example

Data for 14 wards in manchester 1991 census data

Page 50: Data reduction & classification Mark Tranmer CCSR

cluster analysis - example

Six variables

• Overcrowding

• No car

• Renting

• Unemployed

• Ethnic group

• Students

Page 51: Data reduction & classification Mark Tranmer CCSR
Page 52: Data reduction & classification Mark Tranmer CCSR
Page 53: Data reduction & classification Mark Tranmer CCSR
Page 54: Data reduction & classification Mark Tranmer CCSR
Page 55: Data reduction & classification Mark Tranmer CCSR

Initial Cluster Centers

3.55 5.32 .78

81.36 56.27 29.70

97.83 58.35 27.84

20.48 11.26 4.53

32.25 31.17 8.73

13.87 12.47 6.53

% hh with >= 1 personper room

% hh with no car

% hh renting

% ad unemployed

% pp from ethnic minority

% ad students

1 2 3

Cluster

Page 56: Data reduction & classification Mark Tranmer CCSR

Cluster Membership

Ardwick 1 5.017

Burnage 2 23.259

Central 1 14.936

Chorlton 3 2.771

Crumpsall 3 7.895

Didsbury 3 14.495

Fallowfield 2 9.093

Hulme 1 13.426

Levenshulme

3 5.668

Longsight 2 13.782

Moss Side 2 21.657

Rusholme 2 3.883

WhalleyRange

2 20.152

Withington 3 7.412

Case Number1

2

3

4

5

6

7

8

9

10

11

12

13

14

Ward name Cluster Distance

Page 57: Data reduction & classification Mark Tranmer CCSR

Final Cluster Centers

3.75 4.45 1.84

79.44 56.84 41.33

93.58 59.61 35.62

17.38 11.85 7.25

21.77 30.48 11.13

7.62 9.07 6.49

% hh with >= 1 personper room

% hh with no car

% hh renting

% ad unemployed

% pp from ethnic minority

% ad students

1 2 3

Cluster

Page 58: Data reduction & classification Mark Tranmer CCSR

Distances between Final Cluster Centers

42.118 70.942

42.118 34.996

70.942 34.996

Cluster1

2

3

1 2 3

ANOVA

9.517 2 1.162 11 8.193 .007

1364.526 2 60.069 11 22.716 .000

3158.030 2 46.896 11 67.340 .000

97.419 2 5.204 11 18.720 .000

510.110 2 103.500 11 4.929 .030

9.143 2 10.395 11 .880 .442

% hh with >= 1 personper room

% hh with no car

% hh renting

% ad unemployed

% pp from ethnic minority

% ad students

Mean Square df

Cluster

Mean Square df

Error

F Sig.

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximizethe differences among cases in different clusters. The observed significance levels are not corrected for this andthus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

Page 59: Data reduction & classification Mark Tranmer CCSR

Number of Cases in each Cluster

3.000

6.000

5.000

14.000

.000

1

2

3

Cluster

Valid

Missing

Page 60: Data reduction & classification Mark Tranmer CCSR

Reading List

Applied multivariate techniques Subhash Sharma(1997) ISBN: 0-471-31064-6. Wiley.Applied multivariate data analysis Brian Everitt andGraham Dunn (2001) ISBN: 0-340-74122-8. EdwardArnold.Introduction to multivariate analysisChatfield and Collins (1980 reprinted 1992).ISBN: 0 412 16040 4. Chapman and HallDiscovering Statistics Using SPSS. Andy Field (2005)ISBN: 0761944524. Sage Publications Ltd.