principal coordinate analysis, correspondence analysis and multidimensional scaling: multivariate...

Post on 12-Jan-2016

235 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Principal Coordinate Analysis, Correspondence Analysis and

Multidimensional Scaling:Multivariate Analysis of

Association Matrices

BIOL4062/5062

Hal Whitehead

• Association matrices

• Principal Coordinates Analysis (PCO)

• Correspondence Analysis (COA)

• Multidimensional Scaling (MDS)

The Association Matrix

A B C D E F G …ABCDEFG…

Units:

Units:

Association matrices• Social structure

– association between individuals

• Community ecology– similarity between species, sites

– dissimilarities between species sites

• Genetic distances

• Correlation matrices

• Covariance matrices

• Distance matrices– Euclidean, Penrose, Mahalanobis

Similarity

Dissimilarity

Association matricesSymmetric/Asymmetric

Genetic relatedness among bottlenose dolphins (Krutzen et

al. 2003)

Grooming ratesof capuchinmonkeys(Perry 1996)

GRI -0.24VAX 0.02 0.08KRI 0.02 -0.04 -0.19MYR -0.27 0.44 -0.03 -0.11WOW 0.22 0.11 0.32 -0.10 0.10HOB -0.04 0.11 -0.17 -0.13 -0.08 -0.12WBE 0.15 0.07 -0.08 0.08 -0.08 0.23 0.13HOR -0.08 0.21 -0.14 -0.23 0.18 0.12 0.11 0.26AJA -0.24 0.23 -0.04 -0.16 -0.01 -0.16 0.07 0.25 0.32PIK -0.11 0.35 -0.07 0.04 0.02 -0.05 0.09 0.60 0.21 0.27ANV -0.05 -0.23 -0.39 -0.39 -0.21 -0.13 -0.41 0.11 0.11 0.02 -0.06VEE 0.14 0.02 0.15 -0.11 -0.08 0.00 -0.09 -0.05 0.06 0.01 -0.17 -0.17

LAT GRI VAX KRI MYR WOW HOB WBE HOR AJA PIK ANV

Recipient

Actor A S N D W T

A - 5.8 3.5 2.1 2.3 0.04

S 41.6 - 28.6 18.1 9.0 7.4

N 10.3 25.5 - 9.6 9.9 4.3

D 23.3 9.3 10.5 - 13.4 6.9

W 21.2 15.2 14.6 25.1 - 10.4

T 2.5 2.9 3.7 3.6 5.3 -

Principal Coordinates Analysis

• Consider a symmetric dissimilarity matrixB 5

C 3 7

D5 4 4

A B C

• As a distance matrix

• And then plot it

Principal Coordinates Analysis

B 5

C 3 7

D5 4 4

A B C AB 5

C

37

D

5

44

Can represent: distances between 2 points in 1 dimension distances between 3 points in 2 dimensions distances between 4 points in 3 dimensions … distances between k points in k-1dimensions

Principal Coordinates AnalysisHOWEVER!

B 5

C 3 7

D5 4 4

A B C AB 5

Triangle inequality violated if:

AB + AC < BC

No representation possible

10C ??

Principal Coordinates Analysis

• Take distance (dissimilarity) matrix with k units• Represent as k points in k-1 dimensional space

– if triangle inequality holds throughout

• Find direction of greatest variability– 1st Principal Coordinate

• Find direction of next greatest variability (orthogonal)– 2nd Principal Coordinate

• …• k-1 Principal Coordinates

Reducesdimensionality

ofrepresentation

Principal Coordinates Analysis• Eigenvectors of distance matrix give principal

coordinates• Eigenvalues give proportion of variance accounted

for• Triangle inequality equivalent to:

– matrix is positive semi-definite– no unreal eigenvectors– no negative eigenvalues– analysis probably OK if few small, negative eigenvalues

Principal Coordinates Analysis (PCO)& Principal Coomponents Analysis (PCA)

• PCO is equivalent to PCA on covariance matrix of transposed data matrix if distance matrix is Euclidean

• PCO is equivalent to PCA on correlation matrix of transposed data matrix if distance matrix is Penrose

• PCO only gives information on units or variables not both

• Axes (principal coordinates) rarely interpretable in PCO

Principal Coordinates Analysis

Proportion of time chickadees seen together at feeder

SCAO 1.00 AOPR 0.18 1.00 ARPO 0.07 0.27 1.00 YOSA 0.26 0.12 0.12 1.00 ROAY 0.21 0.19 0.18 0.31 1.00 SORA 0.06 0.02 0.03 0.15 0.04 1.00 BJAO 0.19 0.17 0.09 0.16 0.21 0.28 1.00 SCAO AOPR ARPO YOSA ROAY SORA BJAO

Ficken et al. Behav. Ecol. Sociobiol. 1981

Principal Coordinates Analysis

Proportion of time chickadees seen together at feederTransformed to distance matrix (1-X)

SCAO 0.00 AOPR 0.91 0.00 ARPO 0.96 0.85 0.00 YOSA 0.86 0.94 0.94 0.00 ROAY 0.89 0.90 0.91 0.83 0.00 SORA 0.97 0.99 0.98 0.92 0.98 0.00 BJAO 0.90 0.91 0.95 0.92 0.89 0.85 0.00 SCAO AOPR ARPO YOSA ROAY SORA BJAO

Principal CoordinatesAnalysis:Chickadeesat Feeder

-0.4 -0.2 0 0.2 0.4 0.6-0.4

-0.2

0

0.2

SCAO

AOPR

ARPO

YOSA

ROAY

SORA

BJAO

1st principal coordinate

2nd

prin

cipa

l coo

rdin

ate

SCAO 1.00 AOPR 0.18 1.00 ARPO 0.07 0.27 1.00 YOSA 0.26 0.12 0.12 1.00 ROAY 0.21 0.19 0.18 0.31 1.00 SORA 0.06 0.02 0.03 0.15 0.04 1.00 BJAO 0.19 0.17 0.09 0.16 0.21 0.28 1.00 SCAO AOPR ARPO YOSA ROAY SORA BJAO

Prin Coord % explained Cumulative Eigenvalue 1 22.77 22.77 0.575 2 20.05 42.82 0.507 3 16.63 59.45 0.420 4 15.17 74.62 0.383 5 13.37 87.98 0.338 6 12.02 100.00 0.304

Correspondence Analysis

• Uses incidence matrix– counts indexed by two factors– e.g., Archaeology: tombs X artifacts– e.g., Community ecology: sites X species

• Data matrix with counts and many zeros

Correspondence Analysis• Distance between two species, i and j, over sites k=1,

…,p is (“Chi-squared” measure):

ri species totals

ck site totals

• {Difference in proportions of each species at each site}

D

x r x rcij

ik i jk j

kk

p

/ /2

1

Then do Principal Coordinates Analysis

Correspondence Analysis• Distance between two species, i and j, over sites

k=1,…,p is (“Chi-squared” measure):

• Distance between two sites, k and l, over species i=1,…,n is:

D

x r x rcij

ik i jk j

kk

p

/ /2

1

D

x c x crkl

ik k il l

ii

n

/ /

2

1

Correspondence Analysis Example: Sperm Whale Haplotypes by Clan

Reg Short 4-plus

#1 48 28 2

#2 8 27 11

#3 9 26 0

#4 0 0 3

#5 1 2 1

#6 1 0 5

#7 4 0 0

#8 0 4 1

#9 0 2 0

#11 3 0 0

#12 0 1 0

#13 4 1 0

#14 1 0 0

#15 1 0 0

mtD

NA

hap

loty

pe

Correspondence Plot

-3 -1 1 3Dim(1)

-3

-1

1

3

Dim

(2)

11 14157

13

46

8

25

12 9

3

14-plus

Short

Reg

Eigenvalue 0.394

Eig

enva

lue

0.20

5

Multidimensional Scaling

• “Non-parametric version of principal coordinates analysis”

• Given an association matrix between units:– tries to find a representation of the units in a

given number of dimensions– preserving the pattern/ordering in the

association matrix

Multidimensional ScalingHow it works:

1 Provide association matrix (similarity/dissimilarity)

2 Provide number of dimensions

3 Produce initial plot, perhaps using Principal Coordinates

4 Orders distances on plot, compares them with ordering of association matrix

5 Computes STRESS

6 Juggles points to reduce STRESS

7 Go to 4, until STRESS is stabilized

8 Output plot, STRESS

9 Perhaps repeat with new starting conditions

Multidimensional Scaling

• STRESS:

• dij associations between i and j

• xij associations between i and j predicted using distances on plot (by regression)

d x

d

ij iji j

iji j

2

2

,

,

Multidimensional Scaling

• Iterative– No unique solution– Try with different starting positions

• Different possible definitions of STRESS

Multidimensional ScalingShepard Diagrams

Metric Scaling Non-metric Scaling

Similar plots to Principal Coordinates Easier to fit

Stress 23% Stress 16%

Shepard Diagram

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Data

0

1

2

3

Dis

tan

ces

Association values

Shepard Diagram

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Data

0

1

2

3

Dis

tan

ces

Association values

Genetic distances between sperm whale groups

Configuration

-2 -1 0 1 2Dimension-1

-2

-1

0

1

2

Dim

ensi

on-2

45

23 3

48

37

39

40

62

43

2144

4124

46

Stress 23%

Metric MDS

Configuration

-2 -1 0 1 2Dimension-1

-2

-1

0

1

2

Dim

ens

ion -

2

45

233

37

39

40

44

62

43

41

21

2446

48

Non-Metric 2-D MDS

Stress 16%

Configuration

-2-1

01

2

Dimension-1

-2

-1

0

1

2

Dimension-2

-1

0

1

2

Dim

en

sio

n-3 3

23

24

44

39

21

40

41

37

43

46

48

6245

Non-Metric 3-D MDS

Stress 8%

Principal coordinates13/14 eigenvalues negative -not a good representation

Multidimensional Scaling• How many dimensions?

– STRESS <10% is “good representation”– Scree diagram– two (or three) dimensions for visual ease

• Metric or non-metric?– Metric has few advantages over Principal Coordinates

Analysis (unless many negative eigenvalues)– Non-metric does better with fewer dimensions

Non-metric Multidimensional Scaling vs. Principal Coordinates Analysis

Principal Coordinates MDSCAL

Scaling: Metric Non-metric

Input: Distance matrix Association matrix

Matrix: Pos. Semi-Def. -

Solution: Unique Iterative

Max. Units: 100's 25-100

Dimensions: More Less

Choose no. of

dimensions: Afterwards Before

top related