lecture 5: ecological distance metrics; principal

Lecture5:Ecologicaldistancemetrics;PrincipalCoordinatesAnalysis

Univariate testingvs.communityanalysis

• Univariate testingdealswithhypothesesconcerningindividualtaxa– Isthistaxondifferentiallypresent/abundantindifferentsamples?– Isthistaxoncorrelatedwithagivencontinuousvariable?

• Whatifwewouldliketodrawconclusionsaboutthecommunityasawhole?

2

Usefulideasfrommodernstatistics

• Distancesbetweenanything(abundances,presence-absence,graphs,trees);

• Directhypothesesbasedondistances;• Decompositionsthroughiterativestructuration;• Projections;• Randomizationtests,probabilisticsimulations.

3

Dataà Distancesà Statistics0.54

0.54 0.34

0.60 0.17 0.10

0.01 0.75 0.17 0.72

0.24 0.34 0.74 0.89 0.58

0.08 0.43 0.82 0.91 0.69 0.64

0.57 0.06 0.59 0.03 0.69 0.89 0.07

0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26

0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30

0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59

0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65

0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19

0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45

0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14

0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83

4

S-1 S-2 … S-16

OTU-1

OTU-2

OTU-3

OTU-4

OTU-5

….

OTU-10,000

Visualization(Principalcoordinatesanalysis)Statisticalhypothesistesting(PERMANOVA)

Whatisadistancemetric?

• Scalarfunctiond(.,.)oftwoarguments• d(x,y)>=0,alwaysnonnegative;• d(x,x)=0,distancetoselfis0;• d(x,y)=d(y,x),distanceissymmetric;• d(x,y)<=d(x,z)+d(z,y),triangleinequality.

5

WHATARESOMEGOODDISTANCEMETRICS?

6

Usingdistancestocapturemultidimensionalheterogeneousinformation

• A“good”distancewillenableustoanalyzeanytypeofdatausefully

• Wecanbuildspecializeddistancesthatincorporatedifferenttypesofinformation(abundance,trees,geographicallocations,etc.)

• Wecanvisualizecomplexdataaslongasweknowthedistancesbetweenobjects(observations,variables)

• Wecancomputedistances(correlations)betweendistancestocomparethem

• WecandecomposethesourcesofvariabilitycontributingtodistancesinANOVA-likefashion

7

Distanceandsimilarity

• Sometimesitisconceptuallyeasiertotalkaboutsimilaritiesratherthandistances– E.g.sequencesimilarity

• Anysimilaritymeasurecanbeconvertedintoadistancemetric,e.g.– S– IfSis(0,1),D=1-S– IfS>0,D=1/SorD=exp(-S)

8

Afewusefuldistancesandsimilarityindices• Distances:

– Euclidean:(rememberPythagorastheorem)Σ(xi-yi)2– WeightedEuclidean:χ2 = Σ(ei - oi)2/ei– Hamming/L1,BrayCurtis=Σ 1{xi=yi}– Unifrac (later)– Jensen-Shannon:(D(X|M)+D(Y|M))/2,where

• M =(X +Y)/2• Kullback-Leibler divergence:D(X|Y)=Σ ln[xi/yi]xi

• Similarity:– Correlationcoefficient– Matchingcoefficient:(f11+f00)/(f11 +f10 +f01+f00)– Jaccard SimilarityIndex:f11/(f11 +f10 +f01)

x\y 1 0

1 f11 f100 f01 f00

9

Unifrac distance(Lozupone andKnight,2005)• Isadistancebetweengroupsoforganismsrelatedbyatree• Definition:Ratioofthesumofthelengthofthebranches

leadingtosampleXorY,butnotboth,tothesumofallbranchlengthsofthetree.

10

WeightedUnifrac (Lozupone etal.,2007)

� � � � � �

-RGSVTSVEXI�XE\E�EFYRHERGIW�ERH�TL]PSKIRIXMGXVII

Taxonomy Table taxonomyTableslots: .Data

OTU Abundanceclass: otuTableslots: .Data, speciesAreRows

Sample VariablessampleDataslots: .Data,names,row.names,.S3Class

Phylogenetic Treeclass: phyloslots: see ape package

matrix matrixdata.frame

phyloseqslots:otuTablesampleDatataxTabtre

Experiment-level data object:

Component data objects:

� � � � � �

9RMJVEG�(MWXERGI��0S^YTSRI�ERH�/RMKLX� �� MW�E�HMWXERGI�FIX[IIR�KVSYTW�SJ�SVKERMWQW�XLEX�EVI�VIPEXIHXS�IEGL�SXLIV�F]�E�XVII�7YTTSWI�[I�LEZI�XLI�389W�TVIWIRX�MR�WEQTPI��FPYI �ERHMR�WEQTPI��VIH �5YIWXMSR� (S�XLI�X[S�WEQTPIW�HMJJIV�TL]PSKIRIXMGEPP]#-X�MW�HI¿RIH�EW�XLI�VEXMS�SJ�XLI�WYQ�SJ�XLI�PIRKXLW�SJ�XLIFVERGLIW�PIEHMRK�XS�QIQFIVW�SJ�KVSYT�% SV�QIQFIVW�SJKVSYT�& FYX�RSX�FSXL�XS�XLI�XSXEP�FVERGL�PIRKXL�SJ�XLIXVII�

� � � � � �

;IMKLXIH�9RMJVEG�HMWXERGI % QSHM¿GEXMSR�SJ�9RM*VEG�[IMKLXIH�9RM*VEG�MW�HI¿RIH�MR��0S^YTSRI�IX�EP�� EW

R∑

M=1

FM × | %M%8

− &M&8 |

� R = RYQFIV�SJ�FVERGLIW�MR�XLI�XVII� FM !�PIRKXL�SJ�XLI�MXL�FVERGL� %M !�RYQFIV�SJ�HIWGIRHERXW�SJMXL�FVERGL�MR�KVSYT�%

� %8 !�XSXEP�RYQFIV�SJ�WIUYIRGIWMR�KVSYT�%

?#ASV�GER�FI�WIIR�EW�F]�TVSFEFMPMWXW�EW�XLI�;EWWIVWXIMRHMWXERGI��IEVXL�QSZIVW �?#A� � � � � � �

11

KeywarningsabouttheUnifrac familyofdistances.

• Thescalingofthetreeisimportant.• Therootingofthetreeisimportant.

12

Anoteofwarning!

• “Garbagein,garbageout”• Wrongnormalization=>wrongdistance=>wronganswer• However,giventhemanychoicesthereisn’tmuchbeyondpriorknowledge,experienceandintuitiontoguideinselectionofthedistance.

13

PRINCIPALCOORDINATES ANALYSIS-MULTIDIMENSIONALSCALING

14

v2

v1

v3

1

2

3

vvv

é ùê ú= ê úê úë û

v

15

Everymultivariatesamplecanberepresentedasavectorinsomevectorspace

16

4 5 6 7 8

12

34

56

7

2.02.5

3.03.5

4.04.5

iris$Sepal.Length

iris$Sepal.W

idth

iris$Petal.Length

●

●●

●●

●

●

●

●

●

●●

●● ●

●

●

●

●●

● ●●●●

●●

●

● ●●●

●●● ●●

●●

●

●

●

●

● ●

●

●●

●

●●

●● ●

●

●

●

●●●

●

●

●

●●●●

● ● ●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

VectorBasis

• Abasisisasetoflinearlyindependent(dotproductiszero)vectorsthatspan thevectorspace.

• Spanning thevectorspace:Anyvectorinthisvectorspacemayberepresentedasalinearcombinationofthebasisvectors.

• Thevectorsformingabasisareorthogonaltoeachother.Ifallthevectorsareoflength1,thenthebasisiscalledorthonormal.

17

4 5 6 7 8

12

34

56

7

2.02.5

3.03.5

4.04.5

iris$Sepal.Length

iris$Sepal.W

idth

iris$Petal.Length

●

●●

●●

●

●

●

●

●

●●

●● ●

●

●

●

●●

● ●●●●

●●

●

● ●●●

●●● ●●

●●

●

●

●

●

● ●

●

●●

●

●●

●● ●

●

●

●

●●●

●

●

●

●●●●

● ● ●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

18

4 5 6 7 8 9 10 11

−3−2

−1 0

1 2

3

−0.8−0.6

−0.4−0.2

0.0 0.2

0.4 0.6

0.8

pca$li$Axis1

pca$

li$Ax

is3

pca$

li$Ax

is2

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●● ●

●● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●● ●●●

Basicideaforanalysisofmultidimensionaldata

• Computedistances• Reducedimensions• EmbedinEuclideanspace• ThegeneralframeworkbehindthisprocessiscalledDualitydiagram:(Xnxp,Qpxp,Dnxn)– Xnxp (centered)datamatrix– Qpxp columnweights(weightsonvariables)– Dnxn rowweights(weightsonobservations)

19

Dualitydiagramdefinesthegeometryofmultivariateanalysis

S. Holmes/The French way. 2

which were translated into English. My PhD advisor, Yves Escoufier [8][10] pub-licized the method to biologists and ecologists presenting a formulation based onhis RV-coe�cient, that I will develop below. The first software implementation ofduality based methods described here were done in LEAS (1984) a Pascal programwritten for Apple II computers. The most recent implementation is the R pack-age ade-4 (see A for a review of various implementations of the methods describedhere).

2.1. Notation

The data are p variables measures on n observations. They are recorded in a matrixX with n rows (the observations) and p columns (the variables). D

n

is an nxnmatrix of weights on the ”observations”, most often diagonal. We will also use a”neighborhood” relation (thought of as a metric on the observations) defined bytaking a symmetric definite positive matrix Q. For example, to standardize thevariables Q can be chosen as

Q =

0

B

B

B

@

1�

21

0 0 0 ...

0 1�

22

0 0 ...

0 0 1�

23

0 ...

... ... ... 0 1�

2p

1

C

C

C

A

.

These three matrices form the essential ”triplet” (X,Q,D) defining a multivariatedata analysis. As the approach here is geometrical, it is important to see that Qand D define geometries or inner products in Rp and Rn respectively through

xtQy =< x, y >Q

x, y 2 Rp

xtDy =< x, y >D

x, y 2 Rn

From these definitions we see there is a close relation between this approach and

kernel based methods, for more details see [24]. Q can be seen as a linear function

from Rp to Rp⇤ = L(Rp), the space of scalar linear functions on Rp. D can be seen

as a linear function from Rn to Rn⇤ = L(Rn). Escoufier[8] proposed to associate to

a data set an operator from the space of observations Rp into the dual of the space

of variables Rn⇤. This is summarized in the following diagram [1] which is made

commutative by defining V and W as XtDX and XQXt respectively.

Rp⇤ ��!X

Rn

Q

x

?

?

?

?

?

?

y

V D

?

?

?

y

x

?

?

?

W

Rp ��Xt

Rn⇤

This is known as the duality diagram because knowledge of the eigendecompo-sition of XtDXQ = V Q leads to that of the dual operator XQXtD. The mainconsequence is an easy transition between principal components and principal axes

imsart-lnms ver. 2005/05/19 file: dfc.tex date: July 30, 2006

• V=XTDX• W=XQXT

• Duality:– Theeigen decompositionofVQleadstoeigen-decompositionofWD

• Inertia isequaltotrace(sumofthediagonalelements)ofVQorWD.

20

PrincipalComponentAnalysis(PCA)

• LetQ=IandD=1/nIandletXbecentered.• VQ=XTDXQ=1/nXTX.• TheinertiaTr(VQ)=sumofthevariances.• PCAdecomposesthevarianceofXintoindependentcomponents.• Todecomposetheinertiameanstofindtheeigen-systemofVQorequivalentlyWDmatrices.

• Eigenvaluesgivetheamountofinertiaexplainedincorrespondingdimension.

• Eigenvectorsgivethedimensionsofvariability.

21

ExamplePCAhead(USArrests)

Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5−5

05

Murder

Assault

UrbanPop

Rape

Comp.1 Comp.2 Comp.3 Comp.4

pc.cr

Variances

0.0

0.5

1.0

1.5

2.0

Screeplot:plotofinertia

Biplot

Loadings:Comp.1 Comp.2 Comp.3 Comp.4

Murder -0.536 0.418 -0.341 0.649Assault -0.583 0.188 -0.268 -0.743UrbanPop -0.278 -0.873 -0.378 0.134Rape -0.543 -0.167 0.818

22

Centering

• LetYbenotcentereddatamatrixwithnobservations(rows)andpvariables(columns)

• LetH =(I – 1/n1x1’)• ThenX=HYiscentered

23

FromEuclideandistancestoPCAtoPCoA

• NotethatifD isaEuclideandistance,then• XX’=1/nH D(2) H.• PCoA isageneralizationofPCAinthatknowledgeofXisnotrequired,allyouneedtorepresentthepointsisD,theinter-pointdistancematrix.

24

Representationof(arbitrary)distancesinEuclideanspace

• Theideaistousesingularvaluedecomposition(SVD)onthecenteredinterpoint distancematrixtoextractEuclideandimensions

• SVD:X=USV,whereSisdiagonalmatrixwithdiagonalelementss1,s2,…,sn,andUandVareunitmatrices(i.e.theirdeterminantis1andtheyspantheircorrespondingspaces)

25

PCoA details

• AlgorithmstartingfromD inter-pointdistances:– Centertherowsandcolumnsofthematrixofsquare(element-by-element)distances:S =-1/2H D(2)H

– ComputeSVDbydiagonalizing S,S =U Λ UT

– ExtractEuclideanrepresentations:X =UΛ1/2

• TherelativevaluesofdiagonalelementsofΛ givestheproportionofvariabilityexplainedbyeachoftheaxes.

• ThevaluesofΛ shouldalwaysbelookedatindecidinghowmanydimensionstoretain.

26

Beta-diversity;ordinationanalysis

27ISMEJ.2016Mar25.doi:10.1038/ismej.2016.37

Differentiationofmicrobiota betweendiabeticandnon-diabeticsubjectsandacrossbodysites

28

(a) Hands B. Feet C. Hands and Feet

!

Weighted Unifrac (PC1 [28.5 %] − PC2[12.4 %])

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

● ●

● ●

●

●●

●

●

●

●

Control Diabetic

Weighted Unifrac (PC1 [54.0 %] − PC2[ 7.9 %])

●● ● ●

●●

●

●●●

●

●● ●●

●

● ●

●

●

●

●

●●

● ●●

●●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

● Control Diabetic

Weighted Unifrac (PC1 [37.8 %] − PC2[ 7.4 %])

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●●

●

●●

●●

●●●

● ●●●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●●

●●

●●

●

●

●

●●

●

●

●●

●

●●

● ●

●●

●

●

●●

●

●

●

●

●

Arm Control Foot Control

Arm Diabetic Foot Diabetic

Redel etal.JInfectDis.2013207(7):1105-14

Correspondenceanalysis• Isobtainedbyanalyzingtheeigen valuesofthechi-squaretransformedcountsY.

• 𝑌 = 𝑦%% 𝑦%& 𝑦%'𝑦&% 𝑦&& 𝑦&'𝑦'% 𝑦'& 𝑦''

, 𝑦)* =𝑦%% + 𝑦%& + 𝑦%'𝑦&% + 𝑦&& + 𝑦&'𝑦'% + 𝑦'& + 𝑦''

,

• 𝑦*, = 𝑦%% + 𝑦&% + 𝑦'%,… ,…

• 𝑄 = 𝑞), = 0120334013032033 013032�

• SVDanalysisofQresultsinprincipalcomponentsforcorrespondenceanalysis

• Correspondenceanalysispreservesthechi-squaredistance.

Withinclassanalysis

Withinclassanalysis

Suggestedreading/references

32+anyproof-basedlinearalgebratextbook.

Suggestedreading

• SusanHolmes“MultivariateDataAnalysis:TheFrenchWay”,IMSLectureNotes–MonographSeries,2006.

33

lecture 5: ecological distance metrics; principal

Documents