lecture%2:% diversity,%distances,%adonis · dimensional%reduction...

Some slides from prof A. Alekseyenko, NYU; and prof S. Holmes, Stanford

Lecture 2: Diversity, Distances, adonis

1

Lecture 2: Diversity, Distances, adonis

• “Diversity” -‐ alpha, beta (, gamma) • Beta-‐Diversity in practice: Ecological Distances • Unsupervised Learning: Clustering, etc • Ordination: e.g. PCA, UniFrac/PCoA, DPCoA

• Testing: Permutational Multivariate ANOVA

2

Alpha-‐Diversity

3

Alpha diversity definition(s)

• Alpha diversity describes the diversity of a single community (specimen).

• In statistical terms, it is a scalar statistic computed for a single observation (column) that represents the diversity of that observation.

• There are many statistics that can describe diversity: e.g. taxonomical richness, evenness, dominance, etc.

4

Rank abundance plots

5

Species richness

• Suppose we observe a community that can contain up to k ‘species’.

• The relative proportions of the species P = {p1, …, pk} • Richness is computed as

R = 1(p1) + 1(p2) + … + 1(pk)

where 1(.) is an indicator function, i.e. 1(x) = 1 if pi≠0, and 0 otherwise.

• Higher R means greater diversity • Very dependent upon depth of sampling and sensitive to presence of rare species

6

• Sanders 1968 • non-parametric richness • estimate coverage

Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist

Rarefaction Curves

Number of species

# Observations / Library Size / # Reads / Sample Size

7

Shannon index• Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species are P = {p1, …, pk}. • Shannon index is related to the notion of information content from

information theory. It roughly represents the amount of information that is available for the distribution of P.

• When pi = pj, for all i and j, then we have no information about which species a random draw will result in. As the inequality becomes more pronounced, we gain more information about the possible outcome of the draw. The Shannon index captures this property of the distribution.

• Shannon index is computed as Sk= – p1log2p1 – p2log2p2 – … – pklog2pk Note as pi ➔0, log2pi ➔ –∞, we therefore define pilog2pi = 0.

• Higher Sk means higher diversity

http://en.wikipedia.org/wiki/Entropy_(information_theory)“Shannon entropy”

8

From Shannon to Evenness

• Shannon index for a community of k species has a maximum at log2k

• We can make different communities more comparable if we normalize by the maximum

• Evenness index is computed as Ek=Sk/log2k

• Ek=1 means total evenness

9

Simpson index

• Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species are P = {p1, …, pk}. • Simpson index is the probability of resampling the same species on

two consecutive draws with replacement. • Suppose on the first draw we picked species i, this event has

probability pi, hence the probability of drawing that species twice is pi*pi.

• Simpson index is usually computed as: D=1 – (p1

2 + p22 + … + pk

2) In this case, the index represents the probability that two individuals randomly selected from a sample will belong to different species.

• D = 0 means no diversity (1 species is completely dominant) • D = 1 means complete diversity

10

Numbers equivalent diversity• Often it is convenient to talk about alpha diversity in terms of equivalent units: – How many equally abundant taxa will it take to get the same diversity as we see in a given community?

• For richness there is no difference in statistic • For Shannon, remember that log2k is the maximum which is attained when all species equal abundance. Hence the diversity in equivalent units is 2Sk

• For Simpson the equivalent units measure of diversity is 1/(1-‐D) Sometimes called “Inverse Simpson Index”

11

Beta-‐Diversity

12

Beta-‐Diversity

http://en.wikipedia.org/wiki/Beta_diversity

• Microbial ecologists typically use beta diversity as a broad umbrella term that can refer to any of several indices related to compositional differences (Differences in species content between samples)

• For some reason this is contentious, and there appears to be ongoing (and pointless?) argument over the possible definitions

• For our purposes, and microbiome research, when you hear “beta-‐diversity”, you can probably think:

“Diversity of species composition”

13

Summary of diversity “types”

• α – diversity within a community, # of species only • β – diversity between communities (differentiation), species identity is taken into account

• γ – (global) diversity of the site • Theoretically, one would wishes to use such measures that result in γ = α × β

• This is only possible if α and β are independent of each other.

14

Beta-‐Diversity “in practice”1.UniFrac or Bray-‐Curtis distance between samples 2.MDS (“PCoA”) 3.Plot first two axes 4.Admire clusters 5.Write Paper 6.Choose new microbiomes 7.Return to Step 1, Repeat

Why? Let’s back up. This is one option in an arsenal of dimensional reduction methods, that come from “unsupervised learning” in “exploratory data analysis”

15

Dimensional Reduction

Regress disc on weight Regress weight on disc

16

Dimensional ReductionMinimize the distance to the line in both directions the purple line is the principal component line

17

Dimensional ReductionPrincipal Components are Linear Combinations of the ‘old’ variables The projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that’s what PCA does.

18

The PCA workflow

19

Ordination Using the Tree1. UniFrac-‐PCoA 2. Double Principal Coordinates

20

(Un)supervised LearningOrdination Best Practice

1. Always look at scree plot 2. Variables, Samples 3. Biplot 4. Altogether (if readable)

21

(Un)supervised LearningOrdination Best Practice

pca.turtles=dudi.pca(Turtles[,-1],scannf=F,nf=2)!scatter(pca.turtles)

22

(Un)supervised LearningWhat did we “learn”? Depends on the data.

• How many axes are probably useful? • Are their clusters? How many? • Are their gradients? • Are the patterns consistent with covariates • (e.g. sample observations) • How might we test this?

23


• Are their clusters? How many? !Gap Statistic

24


• Are their gradients? !PCA regression

25


• Are the patterns consistent with covariates • How might we test this?

(Permutational) Multivariate ANOVA vegan::adonis( )

26

lecture%2:% diversity,%distances,%adonis · dimensional%reduction...

Documents