lecture 5: ecological distance metrics; principal

17
Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially present/abundant in different samples? Is this taxon correlated with a given continuous variable? What if we would like to draw conclusions about the community as a whole? 2

Upload: others

Post on 03-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: Ecological distance metrics; Principal

Lecture5:Ecologicaldistancemetrics;PrincipalCoordinatesAnalysis

Univariate testingvs.communityanalysis

• Univariate testingdealswithhypothesesconcerningindividualtaxa– Isthistaxondifferentiallypresent/abundantindifferentsamples?– Isthistaxoncorrelatedwithagivencontinuousvariable?

• Whatifwewouldliketodrawconclusionsaboutthecommunityasawhole?

2

Page 2: Lecture 5: Ecological distance metrics; Principal

Usefulideasfrommodernstatistics

• Distancesbetweenanything(abundances,presence-absence,graphs,trees);

• Directhypothesesbasedondistances;• Decompositionsthroughiterativestructuration;• Projections;• Randomizationtests,probabilisticsimulations.

3

Dataà Distancesà Statistics0.54

0.54 0.34

0.60 0.17 0.10

0.01 0.75 0.17 0.72

0.24 0.34 0.74 0.89 0.58

0.08 0.43 0.82 0.91 0.69 0.64

0.57 0.06 0.59 0.03 0.69 0.89 0.07

0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26

0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30

0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59

0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65

0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19

0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45

0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14

0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83

4

S-1 S-2 … S-16

OTU-1

OTU-2

OTU-3

OTU-4

OTU-5

….

OTU-10,000

Visualization(Principalcoordinatesanalysis)Statisticalhypothesistesting(PERMANOVA)

Page 3: Lecture 5: Ecological distance metrics; Principal

Whatisadistancemetric?

• Scalarfunctiond(.,.)oftwoarguments• d(x,y)>=0,alwaysnonnegative;• d(x,x)=0,distancetoselfis0;• d(x,y)=d(y,x),distanceissymmetric;• d(x,y)<=d(x,z)+d(z,y),triangleinequality.

5

WHATARESOMEGOODDISTANCEMETRICS?

6

Page 4: Lecture 5: Ecological distance metrics; Principal

Usingdistancestocapturemultidimensionalheterogeneousinformation

• A“good”distancewillenableustoanalyzeanytypeofdatausefully

• Wecanbuildspecializeddistancesthatincorporatedifferenttypesofinformation(abundance,trees,geographicallocations,etc.)

• Wecanvisualizecomplexdataaslongasweknowthedistancesbetweenobjects(observations,variables)

• Wecancomputedistances(correlations)betweendistancestocomparethem

• WecandecomposethesourcesofvariabilitycontributingtodistancesinANOVA-likefashion

7

Distanceandsimilarity

• Sometimesitisconceptuallyeasiertotalkaboutsimilaritiesratherthandistances– E.g.sequencesimilarity

• Anysimilaritymeasurecanbeconvertedintoadistancemetric,e.g.– S– IfSis(0,1),D=1-S– IfS>0,D=1/SorD=exp(-S)

8

Page 5: Lecture 5: Ecological distance metrics; Principal

Afewusefuldistancesandsimilarityindices• Distances:

– Euclidean:(rememberPythagorastheorem)Σ(xi-yi)2– WeightedEuclidean:χ2 = Σ(ei - oi)2/ei– Hamming/L1,BrayCurtis=Σ 1{xi=yi}– Unifrac (later)– Jensen-Shannon:(D(X|M)+D(Y|M))/2,where

• M =(X +Y)/2• Kullback-Leibler divergence:D(X|Y)=Σ ln[xi/yi]xi

• Similarity:– Correlationcoefficient– Matchingcoefficient:(f11+f00)/(f11 +f10 +f01+f00)– Jaccard SimilarityIndex:f11/(f11 +f10 +f01)

x\y 1 0

1 f11 f100 f01 f00

9

Unifrac distance(Lozupone andKnight,2005)• Isadistancebetweengroupsoforganismsrelatedbyatree• Definition:Ratioofthesumofthelengthofthebranches

leadingtosampleXorY,butnotboth,tothesumofallbranchlengthsofthetree.

10

Page 6: Lecture 5: Ecological distance metrics; Principal

WeightedUnifrac (Lozupone etal.,2007)

� � � � � �

-RGSVTSVEXI�XE\E�EFYRHERGIW�ERH�TL]PSKIRIXMGXVII

Taxonomy Table taxonomyTableslots: .Data

OTU Abundanceclass: otuTableslots: .Data, speciesAreRows

Sample VariablessampleDataslots: .Data,names,row.names,.S3Class

Phylogenetic Treeclass: phyloslots: see ape package

matrix matrixdata.frame

phyloseqslots:otuTablesampleDatataxTabtre

Experiment-level data object:

Component data objects:

� � � � � �

9RMJVEG�(MWXERGI��0S^YTSRI�ERH�/RMKLX� ���� MW�E�HMWXERGI�FIX[IIR�KVSYTW�SJ�SVKERMWQW�XLEX�EVI�VIPEXIHXS�IEGL�SXLIV�F]�E�XVII�7YTTSWI�[I�LEZI�XLI�389W�TVIWIRX�MR�WEQTPI����FPYI �ERHMR�WEQTPI���VIH �5YIWXMSR� (S�XLI�X[S�WEQTPIW�HMJJIV�TL]PSKIRIXMGEPP]#-X�MW�HI¿RIH�EW�XLI�VEXMS�SJ�XLI�WYQ�SJ�XLI�PIRKXLW�SJ�XLIFVERGLIW�PIEHMRK�XS�QIQFIVW�SJ�KVSYT�% SV�QIQFIVW�SJKVSYT�& FYX�RSX�FSXL�XS�XLI�XSXEP�FVERGL�PIRKXL�SJ�XLIXVII�

� � � � � �

;IMKLXIH�9RMJVEG�HMWXERGI % QSHM¿GEXMSR�SJ�9RM*VEG�[IMKLXIH�9RM*VEG�MW�HI¿RIH�MR��0S^YTSRI�IX�EP�� ���� �EW

R∑

M=1

FM × | %M%8

− &M&8 |

� R = RYQFIV�SJ�FVERGLIW�MR�XLI�XVII� FM !�PIRKXL�SJ�XLI�MXL�FVERGL� %M !�RYQFIV�SJ�HIWGIRHERXW�SJMXL�FVERGL�MR�KVSYT�%

� %8 !�XSXEP�RYQFIV�SJ�WIUYIRGIWMR�KVSYT�%

?#ASV�GER�FI�WIIR�EW�F]�TVSFEFMPMWXW�EW�XLI�;EWWIVWXIMRHMWXERGI��IEVXL�QSZIVW �?#A� � � � � � �

11

KeywarningsabouttheUnifrac familyofdistances.

• Thescalingofthetreeisimportant.• Therootingofthetreeisimportant.

12

Page 7: Lecture 5: Ecological distance metrics; Principal

Anoteofwarning!

• “Garbagein,garbageout”• Wrongnormalization=>wrongdistance=>wronganswer• However,giventhemanychoicesthereisn’tmuchbeyondpriorknowledge,experienceandintuitiontoguideinselectionofthedistance.

13

PRINCIPALCOORDINATES ANALYSIS-MULTIDIMENSIONALSCALING

14

Page 8: Lecture 5: Ecological distance metrics; Principal

v2

v1

v3

1

2

3

vvv

é ùê ú= ê úê úë û

v

15

Everymultivariatesamplecanberepresentedasavectorinsomevectorspace

16

4 5 6 7 8

12

34

56

7

2.02.5

3.03.5

4.04.5

iris$Sepal.Length

iris$Sepal.W

idth

iris$Petal.Length

●●

●●

●●

●● ●

●●

● ●●●●

●●

● ●●●

●●● ●●

●●

● ●

●●

●●

●● ●

●●●

●●●●

● ● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

Page 9: Lecture 5: Ecological distance metrics; Principal

VectorBasis

• Abasisisasetoflinearlyindependent(dotproductiszero)vectorsthatspan thevectorspace.

• Spanning thevectorspace:Anyvectorinthisvectorspacemayberepresentedasalinearcombinationofthebasisvectors.

• Thevectorsformingabasisareorthogonaltoeachother.Ifallthevectorsareoflength1,thenthebasisiscalledorthonormal.

17

4 5 6 7 8

12

34

56

7

2.02.5

3.03.5

4.04.5

iris$Sepal.Length

iris$Sepal.W

idth

iris$Petal.Length

●●

●●

●●

●● ●

●●

● ●●●●

●●

● ●●●

●●● ●●

●●

● ●

●●

●●

●● ●

●●●

●●●●

● ● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

18

4 5 6 7 8 9 10 11

−3−2

−1 0

1 2

3

−0.8−0.6

−0.4−0.2

0.0 0.2

0.4 0.6

0.8

pca$li$Axis1

pca$

li$Ax

is3

pca$

li$Ax

is2

●●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●● ●

●● ●

●●

●●

● ●●●

● ●

●●

●●

● ●

●●●

●●

●●

●● ●●●

Page 10: Lecture 5: Ecological distance metrics; Principal

Basicideaforanalysisofmultidimensionaldata

• Computedistances• Reducedimensions• EmbedinEuclideanspace• ThegeneralframeworkbehindthisprocessiscalledDualitydiagram:(Xnxp,Qpxp,Dnxn)– Xnxp (centered)datamatrix– Qpxp columnweights(weightsonvariables)– Dnxn rowweights(weightsonobservations)

19

Dualitydiagramdefinesthegeometryofmultivariateanalysis

S. Holmes/The French way. 2

which were translated into English. My PhD advisor, Yves Escoufier [8][10] pub-licized the method to biologists and ecologists presenting a formulation based onhis RV-coe�cient, that I will develop below. The first software implementation ofduality based methods described here were done in LEAS (1984) a Pascal programwritten for Apple II computers. The most recent implementation is the R pack-age ade-4 (see A for a review of various implementations of the methods describedhere).

2.1. Notation

The data are p variables measures on n observations. They are recorded in a matrixX with n rows (the observations) and p columns (the variables). D

n

is an nxnmatrix of weights on the ”observations”, most often diagonal. We will also use a”neighborhood” relation (thought of as a metric on the observations) defined bytaking a symmetric definite positive matrix Q. For example, to standardize thevariables Q can be chosen as

Q =

0

B

B

B

@

1�

21

0 0 0 ...

0 1�

22

0 0 ...

0 0 1�

23

0 ...

... ... ... 0 1�

2p

1

C

C

C

A

.

These three matrices form the essential ”triplet” (X,Q,D) defining a multivariatedata analysis. As the approach here is geometrical, it is important to see that Qand D define geometries or inner products in Rp and Rn respectively through

xtQy =< x, y >Q

x, y 2 Rp

xtDy =< x, y >D

x, y 2 Rn

From these definitions we see there is a close relation between this approach and

kernel based methods, for more details see [24]. Q can be seen as a linear function

from Rp to Rp⇤ = L(Rp), the space of scalar linear functions on Rp. D can be seen

as a linear function from Rn to Rn⇤ = L(Rn). Escoufier[8] proposed to associate to

a data set an operator from the space of observations Rp into the dual of the space

of variables Rn⇤. This is summarized in the following diagram [1] which is made

commutative by defining V and W as XtDX and XQXt respectively.

Rp⇤ ��!X

Rn

Q

x

?

?

?

?

?

?

y

V D

?

?

?

y

x

?

?

?

W

Rp ��Xt

Rn⇤

This is known as the duality diagram because knowledge of the eigendecompo-sition of XtDXQ = V Q leads to that of the dual operator XQXtD. The mainconsequence is an easy transition between principal components and principal axes

imsart-lnms ver. 2005/05/19 file: dfc.tex date: July 30, 2006

• V=XTDX• W=XQXT

• Duality:– Theeigen decompositionofVQleadstoeigen-decompositionofWD

• Inertia isequaltotrace(sumofthediagonalelements)ofVQorWD.

20

Page 11: Lecture 5: Ecological distance metrics; Principal

PrincipalComponentAnalysis(PCA)

• LetQ=IandD=1/nIandletXbecentered.• VQ=XTDXQ=1/nXTX.• TheinertiaTr(VQ)=sumofthevariances.• PCAdecomposesthevarianceofXintoindependentcomponents.• Todecomposetheinertiameanstofindtheeigen-systemofVQorequivalentlyWDmatrices.

• Eigenvaluesgivetheamountofinertiaexplainedincorrespondingdimension.

• Eigenvectorsgivethedimensionsofvariability.

21

ExamplePCAhead(USArrests)

Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5−5

05

Murder

Assault

UrbanPop

Rape

Comp.1 Comp.2 Comp.3 Comp.4

pc.cr

Variances

0.0

0.5

1.0

1.5

2.0

Screeplot:plotofinertia

Biplot

Loadings:Comp.1 Comp.2 Comp.3 Comp.4

Murder -0.536 0.418 -0.341 0.649Assault -0.583 0.188 -0.268 -0.743UrbanPop -0.278 -0.873 -0.378 0.134Rape -0.543 -0.167 0.818

22

Page 12: Lecture 5: Ecological distance metrics; Principal

Centering

• LetYbenotcentereddatamatrixwithnobservations(rows)andpvariables(columns)

• LetH =(I – 1/n1x1’)• ThenX=HYiscentered

23

FromEuclideandistancestoPCAtoPCoA

• NotethatifD isaEuclideandistance,then• XX’=1/nH D(2) H.• PCoA isageneralizationofPCAinthatknowledgeofXisnotrequired,allyouneedtorepresentthepointsisD,theinter-pointdistancematrix.

24

Page 13: Lecture 5: Ecological distance metrics; Principal

Representationof(arbitrary)distancesinEuclideanspace

• Theideaistousesingularvaluedecomposition(SVD)onthecenteredinterpoint distancematrixtoextractEuclideandimensions

• SVD:X=USV,whereSisdiagonalmatrixwithdiagonalelementss1,s2,…,sn,andUandVareunitmatrices(i.e.theirdeterminantis1andtheyspantheircorrespondingspaces)

25

PCoA details

• AlgorithmstartingfromD inter-pointdistances:– Centertherowsandcolumnsofthematrixofsquare(element-by-element)distances:S =-1/2H D(2)H

– ComputeSVDbydiagonalizing S,S =U Λ UT

– ExtractEuclideanrepresentations:X =UΛ1/2

• TherelativevaluesofdiagonalelementsofΛ givestheproportionofvariabilityexplainedbyeachoftheaxes.

• ThevaluesofΛ shouldalwaysbelookedatindecidinghowmanydimensionstoretain.

26

Page 14: Lecture 5: Ecological distance metrics; Principal

Beta-diversity;ordinationanalysis

27ISMEJ.2016Mar25.doi:10.1038/ismej.2016.37

Differentiationofmicrobiota betweendiabeticandnon-diabeticsubjectsandacrossbodysites

28

(a) Hands B. Feet C. Hands and Feet

!

Weighted Unifrac (PC1 [28.5 %] − PC2[12.4 %])

●●

●●

●●

● ●

●●

● ●

● ●

●●

Control Diabetic

Weighted Unifrac (PC1 [54.0 %] − PC2[ 7.9 %])

●● ● ●

●●

●●●

●● ●●

● ●

●●

● ●●

●●

●●

●●

●●

● Control Diabetic

Weighted Unifrac (PC1 [37.8 %] − PC2[ 7.4 %])

●●

●●●

●●

●●

●●

●●

●●●

● ●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

Arm Control Foot Control

Arm Diabetic Foot Diabetic

Redel etal.JInfectDis.2013207(7):1105-14

Page 15: Lecture 5: Ecological distance metrics; Principal

Correspondenceanalysis• Isobtainedbyanalyzingtheeigen valuesofthechi-squaretransformedcountsY.

• 𝑌 = 𝑦%% 𝑦%& 𝑦%'𝑦&% 𝑦&& 𝑦&'𝑦'% 𝑦'& 𝑦''

, 𝑦)* =𝑦%% + 𝑦%& + 𝑦%'𝑦&% + 𝑦&& + 𝑦&'𝑦'% + 𝑦'& + 𝑦''

,

• 𝑦*, = 𝑦%% + 𝑦&% + 𝑦'%,… ,…

• 𝑄 = 𝑞), = 0120334013032033 013032�

• SVDanalysisofQresultsinprincipalcomponentsforcorrespondenceanalysis

• Correspondenceanalysispreservesthechi-squaredistance.

Withinclassanalysis

Page 16: Lecture 5: Ecological distance metrics; Principal

Withinclassanalysis

Suggestedreading/references

32+anyproof-basedlinearalgebratextbook.

Page 17: Lecture 5: Ecological distance metrics; Principal

Suggestedreading

• SusanHolmes“MultivariateDataAnalysis:TheFrenchWay”,IMSLectureNotes–MonographSeries,2006.

33