lecture 5: ecological distance metrics; principal
TRANSCRIPT
Lecture5:Ecologicaldistancemetrics;PrincipalCoordinatesAnalysis
Univariate testingvs.communityanalysis
• Univariate testingdealswithhypothesesconcerningindividualtaxa– Isthistaxondifferentiallypresent/abundantindifferentsamples?– Isthistaxoncorrelatedwithagivencontinuousvariable?
• Whatifwewouldliketodrawconclusionsaboutthecommunityasawhole?
2
Usefulideasfrommodernstatistics
• Distancesbetweenanything(abundances,presence-absence,graphs,trees);
• Directhypothesesbasedondistances;• Decompositionsthroughiterativestructuration;• Projections;• Randomizationtests,probabilisticsimulations.
3
Dataà Distancesà Statistics0.54
0.54 0.34
0.60 0.17 0.10
0.01 0.75 0.17 0.72
0.24 0.34 0.74 0.89 0.58
0.08 0.43 0.82 0.91 0.69 0.64
0.57 0.06 0.59 0.03 0.69 0.89 0.07
0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26
0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30
0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59
0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65
0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19
0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45
0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14
0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83
4
S-1 S-2 … S-16
OTU-1
OTU-2
OTU-3
OTU-4
OTU-5
….
OTU-10,000
Visualization(Principalcoordinatesanalysis)Statisticalhypothesistesting(PERMANOVA)
Whatisadistancemetric?
• Scalarfunctiond(.,.)oftwoarguments• d(x,y)>=0,alwaysnonnegative;• d(x,x)=0,distancetoselfis0;• d(x,y)=d(y,x),distanceissymmetric;• d(x,y)<=d(x,z)+d(z,y),triangleinequality.
5
WHATARESOMEGOODDISTANCEMETRICS?
6
Usingdistancestocapturemultidimensionalheterogeneousinformation
• A“good”distancewillenableustoanalyzeanytypeofdatausefully
• Wecanbuildspecializeddistancesthatincorporatedifferenttypesofinformation(abundance,trees,geographicallocations,etc.)
• Wecanvisualizecomplexdataaslongasweknowthedistancesbetweenobjects(observations,variables)
• Wecancomputedistances(correlations)betweendistancestocomparethem
• WecandecomposethesourcesofvariabilitycontributingtodistancesinANOVA-likefashion
7
Distanceandsimilarity
• Sometimesitisconceptuallyeasiertotalkaboutsimilaritiesratherthandistances– E.g.sequencesimilarity
• Anysimilaritymeasurecanbeconvertedintoadistancemetric,e.g.– S– IfSis(0,1),D=1-S– IfS>0,D=1/SorD=exp(-S)
8
Afewusefuldistancesandsimilarityindices• Distances:
– Euclidean:(rememberPythagorastheorem)Σ(xi-yi)2– WeightedEuclidean:χ2 = Σ(ei - oi)2/ei– Hamming/L1,BrayCurtis=Σ 1{xi=yi}– Unifrac (later)– Jensen-Shannon:(D(X|M)+D(Y|M))/2,where
• M =(X +Y)/2• Kullback-Leibler divergence:D(X|Y)=Σ ln[xi/yi]xi
• Similarity:– Correlationcoefficient– Matchingcoefficient:(f11+f00)/(f11 +f10 +f01+f00)– Jaccard SimilarityIndex:f11/(f11 +f10 +f01)
x\y 1 0
1 f11 f100 f01 f00
9
Unifrac distance(Lozupone andKnight,2005)• Isadistancebetweengroupsoforganismsrelatedbyatree• Definition:Ratioofthesumofthelengthofthebranches
leadingtosampleXorY,butnotboth,tothesumofallbranchlengthsofthetree.
10
WeightedUnifrac (Lozupone etal.,2007)
� � � � � �
-RGSVTSVEXI�XE\E�EFYRHERGIW�ERH�TL]PSKIRIXMGXVII
Taxonomy Table taxonomyTableslots: .Data
OTU Abundanceclass: otuTableslots: .Data, speciesAreRows
Sample VariablessampleDataslots: .Data,names,row.names,.S3Class
Phylogenetic Treeclass: phyloslots: see ape package
matrix matrixdata.frame
phyloseqslots:otuTablesampleDatataxTabtre
Experiment-level data object:
Component data objects:
� � � � � �
9RMJVEG�(MWXERGI��0S^YTSRI�ERH�/RMKLX� ���� MW�E�HMWXERGI�FIX[IIR�KVSYTW�SJ�SVKERMWQW�XLEX�EVI�VIPEXIHXS�IEGL�SXLIV�F]�E�XVII�7YTTSWI�[I�LEZI�XLI�389W�TVIWIRX�MR�WEQTPI����FPYI �ERHMR�WEQTPI���VIH �5YIWXMSR� (S�XLI�X[S�WEQTPIW�HMJJIV�TL]PSKIRIXMGEPP]#-X�MW�HI¿RIH�EW�XLI�VEXMS�SJ�XLI�WYQ�SJ�XLI�PIRKXLW�SJ�XLIFVERGLIW�PIEHMRK�XS�QIQFIVW�SJ�KVSYT�% SV�QIQFIVW�SJKVSYT�& FYX�RSX�FSXL�XS�XLI�XSXEP�FVERGL�PIRKXL�SJ�XLIXVII�
� � � � � �
;IMKLXIH�9RMJVEG�HMWXERGI % QSHM¿GEXMSR�SJ�9RM*VEG�[IMKLXIH�9RM*VEG�MW�HI¿RIH�MR��0S^YTSRI�IX�EP�� ���� �EW
R∑
M=1
FM × | %M%8
− &M&8 |
� R = RYQFIV�SJ�FVERGLIW�MR�XLI�XVII� FM !�PIRKXL�SJ�XLI�MXL�FVERGL� %M !�RYQFIV�SJ�HIWGIRHERXW�SJMXL�FVERGL�MR�KVSYT�%
� %8 !�XSXEP�RYQFIV�SJ�WIUYIRGIWMR�KVSYT�%
?#ASV�GER�FI�WIIR�EW�F]�TVSFEFMPMWXW�EW�XLI�;EWWIVWXIMRHMWXERGI��IEVXL�QSZIVW �?#A� � � � � � �
11
KeywarningsabouttheUnifrac familyofdistances.
• Thescalingofthetreeisimportant.• Therootingofthetreeisimportant.
12
Anoteofwarning!
• “Garbagein,garbageout”• Wrongnormalization=>wrongdistance=>wronganswer• However,giventhemanychoicesthereisn’tmuchbeyondpriorknowledge,experienceandintuitiontoguideinselectionofthedistance.
13
PRINCIPALCOORDINATES ANALYSIS-MULTIDIMENSIONALSCALING
14
v2
v1
v3
1
2
3
vvv
é ùê ú= ê úê úë û
v
15
Everymultivariatesamplecanberepresentedasavectorinsomevectorspace
16
4 5 6 7 8
12
34
56
7
2.02.5
3.03.5
4.04.5
iris$Sepal.Length
iris$Sepal.W
idth
iris$Petal.Length
●
●●
●●
●
●
●
●
●
●●
●● ●
●
●
●
●●
● ●●●●
●●
●
● ●●●
●●● ●●
●●
●
●
●
●
● ●
●
●●
●
●●
●● ●
●
●
●
●●●
●
●
●
●●●●
● ● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
VectorBasis
• Abasisisasetoflinearlyindependent(dotproductiszero)vectorsthatspan thevectorspace.
• Spanning thevectorspace:Anyvectorinthisvectorspacemayberepresentedasalinearcombinationofthebasisvectors.
• Thevectorsformingabasisareorthogonaltoeachother.Ifallthevectorsareoflength1,thenthebasisiscalledorthonormal.
17
4 5 6 7 8
12
34
56
7
2.02.5
3.03.5
4.04.5
iris$Sepal.Length
iris$Sepal.W
idth
iris$Petal.Length
●
●●
●●
●
●
●
●
●
●●
●● ●
●
●
●
●●
● ●●●●
●●
●
● ●●●
●●● ●●
●●
●
●
●
●
● ●
●
●●
●
●●
●● ●
●
●
●
●●●
●
●
●
●●●●
● ● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
18
4 5 6 7 8 9 10 11
−3−2
−1 0
1 2
3
−0.8−0.6
−0.4−0.2
0.0 0.2
0.4 0.6
0.8
pca$li$Axis1
pca$
li$Ax
is3
pca$
li$Ax
is2
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●● ●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●● ●●●
Basicideaforanalysisofmultidimensionaldata
• Computedistances• Reducedimensions• EmbedinEuclideanspace• ThegeneralframeworkbehindthisprocessiscalledDualitydiagram:(Xnxp,Qpxp,Dnxn)– Xnxp (centered)datamatrix– Qpxp columnweights(weightsonvariables)– Dnxn rowweights(weightsonobservations)
19
Dualitydiagramdefinesthegeometryofmultivariateanalysis
S. Holmes/The French way. 2
which were translated into English. My PhD advisor, Yves Escoufier [8][10] pub-licized the method to biologists and ecologists presenting a formulation based onhis RV-coe�cient, that I will develop below. The first software implementation ofduality based methods described here were done in LEAS (1984) a Pascal programwritten for Apple II computers. The most recent implementation is the R pack-age ade-4 (see A for a review of various implementations of the methods describedhere).
2.1. Notation
The data are p variables measures on n observations. They are recorded in a matrixX with n rows (the observations) and p columns (the variables). D
n
is an nxnmatrix of weights on the ”observations”, most often diagonal. We will also use a”neighborhood” relation (thought of as a metric on the observations) defined bytaking a symmetric definite positive matrix Q. For example, to standardize thevariables Q can be chosen as
Q =
0
B
B
B
@
1�
21
0 0 0 ...
0 1�
22
0 0 ...
0 0 1�
23
0 ...
... ... ... 0 1�
2p
1
C
C
C
A
.
These three matrices form the essential ”triplet” (X,Q,D) defining a multivariatedata analysis. As the approach here is geometrical, it is important to see that Qand D define geometries or inner products in Rp and Rn respectively through
xtQy =< x, y >Q
x, y 2 Rp
xtDy =< x, y >D
x, y 2 Rn
From these definitions we see there is a close relation between this approach and
kernel based methods, for more details see [24]. Q can be seen as a linear function
from Rp to Rp⇤ = L(Rp), the space of scalar linear functions on Rp. D can be seen
as a linear function from Rn to Rn⇤ = L(Rn). Escoufier[8] proposed to associate to
a data set an operator from the space of observations Rp into the dual of the space
of variables Rn⇤. This is summarized in the following diagram [1] which is made
commutative by defining V and W as XtDX and XQXt respectively.
Rp⇤ ��!X
Rn
Q
x
?
?
?
?
?
?
y
V D
?
?
?
y
x
?
?
?
W
Rp ��Xt
Rn⇤
This is known as the duality diagram because knowledge of the eigendecompo-sition of XtDXQ = V Q leads to that of the dual operator XQXtD. The mainconsequence is an easy transition between principal components and principal axes
imsart-lnms ver. 2005/05/19 file: dfc.tex date: July 30, 2006
• V=XTDX• W=XQXT
• Duality:– Theeigen decompositionofVQleadstoeigen-decompositionofWD
• Inertia isequaltotrace(sumofthediagonalelements)ofVQorWD.
20
PrincipalComponentAnalysis(PCA)
• LetQ=IandD=1/nIandletXbecentered.• VQ=XTDXQ=1/nXTX.• TheinertiaTr(VQ)=sumofthevariances.• PCAdecomposesthevarianceofXintoindependentcomponents.• Todecomposetheinertiameanstofindtheeigen-systemofVQorequivalentlyWDmatrices.
• Eigenvaluesgivetheamountofinertiaexplainedincorrespondingdimension.
• Eigenvectorsgivethedimensionsofvariability.
21
ExamplePCAhead(USArrests)
Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7
−0.2 −0.1 0.0 0.1 0.2 0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
Comp.1
Com
p.2
AlabamaAlaska
Arizona
Arkansas
California
ColoradoConnecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana IowaKansas
KentuckyLouisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−5 0 5−5
05
Murder
Assault
UrbanPop
Rape
Comp.1 Comp.2 Comp.3 Comp.4
pc.cr
Variances
0.0
0.5
1.0
1.5
2.0
Screeplot:plotofinertia
Biplot
Loadings:Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649Assault -0.583 0.188 -0.268 -0.743UrbanPop -0.278 -0.873 -0.378 0.134Rape -0.543 -0.167 0.818
22
Centering
• LetYbenotcentereddatamatrixwithnobservations(rows)andpvariables(columns)
• LetH =(I – 1/n1x1’)• ThenX=HYiscentered
23
FromEuclideandistancestoPCAtoPCoA
• NotethatifD isaEuclideandistance,then• XX’=1/nH D(2) H.• PCoA isageneralizationofPCAinthatknowledgeofXisnotrequired,allyouneedtorepresentthepointsisD,theinter-pointdistancematrix.
24
Representationof(arbitrary)distancesinEuclideanspace
• Theideaistousesingularvaluedecomposition(SVD)onthecenteredinterpoint distancematrixtoextractEuclideandimensions
• SVD:X=USV,whereSisdiagonalmatrixwithdiagonalelementss1,s2,…,sn,andUandVareunitmatrices(i.e.theirdeterminantis1andtheyspantheircorrespondingspaces)
25
PCoA details
• AlgorithmstartingfromD inter-pointdistances:– Centertherowsandcolumnsofthematrixofsquare(element-by-element)distances:S =-1/2H D(2)H
– ComputeSVDbydiagonalizing S,S =U Λ UT
– ExtractEuclideanrepresentations:X =UΛ1/2
• TherelativevaluesofdiagonalelementsofΛ givestheproportionofvariabilityexplainedbyeachoftheaxes.
• ThevaluesofΛ shouldalwaysbelookedatindecidinghowmanydimensionstoretain.
26
Beta-diversity;ordinationanalysis
27ISMEJ.2016Mar25.doi:10.1038/ismej.2016.37
Differentiationofmicrobiota betweendiabeticandnon-diabeticsubjectsandacrossbodysites
28
(a) Hands B. Feet C. Hands and Feet
!
Weighted Unifrac (PC1 [28.5 %] − PC2[12.4 %])
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
● ●
● ●
●
●●
●
●
●
●
Control Diabetic
Weighted Unifrac (PC1 [54.0 %] − PC2[ 7.9 %])
●● ● ●
●●
●
●●●
●
●● ●●
●
● ●
●
●
●
●
●●
● ●●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● Control Diabetic
Weighted Unifrac (PC1 [37.8 %] − PC2[ 7.4 %])
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●
●
●●
●●
●●●
● ●●●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
●●
● ●
●●
●
●
●●
●
●
●
●
●
Arm Control Foot Control
Arm Diabetic Foot Diabetic
Redel etal.JInfectDis.2013207(7):1105-14
Correspondenceanalysis• Isobtainedbyanalyzingtheeigen valuesofthechi-squaretransformedcountsY.
• 𝑌 = 𝑦%% 𝑦%& 𝑦%'𝑦&% 𝑦&& 𝑦&'𝑦'% 𝑦'& 𝑦''
, 𝑦)* =𝑦%% + 𝑦%& + 𝑦%'𝑦&% + 𝑦&& + 𝑦&'𝑦'% + 𝑦'& + 𝑦''
,
• 𝑦*, = 𝑦%% + 𝑦&% + 𝑦'%,… ,…
• 𝑄 = 𝑞), = 0120334013032033 013032�
• SVDanalysisofQresultsinprincipalcomponentsforcorrespondenceanalysis
• Correspondenceanalysispreservesthechi-squaredistance.
Withinclassanalysis
Withinclassanalysis
Suggestedreading/references
32+anyproof-basedlinearalgebratextbook.
Suggestedreading
• SusanHolmes“MultivariateDataAnalysis:TheFrenchWay”,IMSLectureNotes–MonographSeries,2006.
33