lecture%2:% diversity,%distances,%adonis · dimensional%reduction...

7
Some slides from prof A. Alekseyenko, NYU; and prof S. Holmes, Stanford Lecture 2: Diversity, Distances, adonis 1 Lecture 2: Diversity, Distances, adonis “Diversity” alpha, beta (, gamma) BetaDiversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc Ordination: e.g. PCA, UniFrac/PCoA, DPCoA Testing: Permutational Multivariate ANOVA 2 AlphaDiversity 3 Alpha diversity definition(s) Alpha diversity describes the diversity of a single community (specimen). In statistical terms, it is a scalar statistic computed for a single observation (column) that represents the diversity of that observation. There are many statistics that can describe diversity: e.g. taxonomical richness, evenness, dominance, etc. 4

Upload: others

Post on 26-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

Some  slides  from  prof  A.  Alekseyenko,  NYU;  and  prof  S.  Holmes,  Stanford

Lecture  2:  Diversity,  Distances,  adonis

1

Lecture  2:  Diversity,  Distances,  adonis

• “Diversity”  -­‐  alpha,  beta  (,  gamma)  • Beta-­‐Diversity  in  practice:  Ecological  Distances  • Unsupervised  Learning:  Clustering,  etc  • Ordination:  e.g.  PCA,  UniFrac/PCoA,  DPCoA  

• Testing:  Permutational  Multivariate  ANOVA

2

Alpha-­‐Diversity

3

Alpha  diversity  definition(s)

• Alpha  diversity  describes  the  diversity  of  a  single  community  (specimen).  

• In  statistical  terms,  it  is  a  scalar  statistic  computed  for  a  single  observation  (column)  that  represents  the  diversity  of  that  observation.  

• There  are  many  statistics  that  can  describe  diversity:  e.g.  taxonomical  richness,  evenness,  dominance,  etc.  

4

Page 2: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

Rank  abundance  plots

5

Species  richness

• Suppose  we  observe  a  community  that  can  contain  up  to  k  ‘species’.    

• The  relative  proportions  of  the  species  P  =  {p1,  …,  pk}  • Richness  is  computed  as  

R  =  1(p1)  +  1(p2)  +  …  +  1(pk)  

where  1(.)  is  an  indicator  function,  i.e.  1(x)  =  1  if  pi≠0,  and  0  otherwise.  

• Higher  R  means  greater  diversity  • Very  dependent  upon  depth  of  sampling  and  sensitive  to  presence  of  rare  species

6

• Sanders 1968 • non-parametric richness • estimate coverage

Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist

Rarefaction  Curves

Number  of  species

#  Observations  /  Library  Size  /  #  Reads  /  Sample  Size

7

Shannon  index• Suppose  we  observe  a  community  that  can  contain  up  to  k  ‘species’.    • The  relative  proportions  of  the  species  are  P  =  {p1,  …,  pk}.  • Shannon  index  is  related  to  the  notion  of  information  content  from  

information  theory.  It  roughly  represents  the  amount  of  information  that  is  available  for  the  distribution  of  P.    

• When  pi  =  pj,  for  all  i  and  j,  then  we  have  no  information  about  which  species  a  random  draw  will  result  in.  As  the  inequality  becomes  more  pronounced,  we  gain  more  information  about  the  possible  outcome  of  the  draw.  The  Shannon  index  captures  this  property  of  the  distribution.    

• Shannon  index  is  computed  as  Sk=  –  p1log2p1  –  p2log2p2  –  …  –  pklog2pk  Note  as  pi  ➔0,  log2pi  ➔ –∞,  we  therefore  define  pilog2pi  =  0.  

• Higher  Sk  means  higher  diversity

http://en.wikipedia.org/wiki/Entropy_(information_theory)“Shannon  entropy”

8

Page 3: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

From  Shannon  to  Evenness

• Shannon  index  for  a  community  of  k  species  has  a  maximum  at  log2k  

• We  can  make  different  communities  more  comparable  if  we  normalize  by  the  maximum  

• Evenness  index  is  computed  as  Ek=Sk/log2k  

• Ek=1  means  total  evenness

9

Simpson  index

• Suppose  we  observe  a  community  that  can  contain  up  to  k  ‘species’.    • The  relative  proportions  of  the  species  are  P  =  {p1,  …,  pk}.  • Simpson  index  is  the  probability  of  resampling  the  same  species  on  

two  consecutive  draws  with  replacement.  • Suppose  on  the  first  draw  we  picked  species  i,  this  event  has  

probability  pi,  hence  the  probability  of  drawing  that  species  twice  is  pi*pi.  

• Simpson  index  is  usually  computed  as:  D=1  –  (p1

2  +  p22  +  …  +  pk

2)  In  this  case,  the  index  represents  the  probability  that  two  individuals  randomly  selected  from  a  sample  will  belong  to  different  species.  

• D  =  0  means  no  diversity  (1  species  is  completely  dominant)  • D  =  1  means  complete  diversity

10

Numbers  equivalent  diversity• Often  it  is  convenient  to  talk  about  alpha  diversity  in  terms  of  equivalent  units:  – How  many  equally  abundant  taxa  will  it  take  to  get  the  same  diversity  as  we  see  in  a  given  community?  

• For  richness  there  is  no  difference  in  statistic  • For  Shannon,  remember  that  log2k  is  the  maximum  which  is  attained  when  all  species  equal  abundance.  Hence  the  diversity  in  equivalent  units  is  2Sk  

• For  Simpson  the  equivalent  units  measure  of  diversity  is  1/(1-­‐D)    Sometimes  called  “Inverse  Simpson  Index”

11

Beta-­‐Diversity

12

Page 4: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

Beta-­‐Diversity

http://en.wikipedia.org/wiki/Beta_diversity

• Microbial  ecologists  typically  use  beta  diversity  as  a  broad  umbrella  term  that  can  refer  to  any  of  several  indices  related  to  compositional  differences  (Differences  in  species  content  between  samples)  

• For  some  reason  this  is  contentious,  and  there  appears  to  be  ongoing  (and  pointless?)  argument  over  the  possible  definitions  

• For  our  purposes,  and  microbiome  research,  when  you  hear  “beta-­‐diversity”,  you  can  probably  think:  

“Diversity  of  species  composition”

13

Summary  of  diversity  “types”

• α  –  diversity  within  a  community,  #  of  species  only  • β  –  diversity  between  communities  (differentiation),         species  identity  is  taken  into  account  

• γ  –  (global)  diversity  of  the  site  • Theoretically,  one  would  wishes  to  use  such  measures  that  result  in  γ  =  α  ×  β  

• This  is  only  possible  if  α  and  β  are  independent  of  each  other.  

14

Beta-­‐Diversity  “in  practice”1.UniFrac  or  Bray-­‐Curtis  distance  between  samples  2.MDS  (“PCoA”)  3.Plot  first  two  axes  4.Admire  clusters  5.Write  Paper  6.Choose  new  microbiomes  7.Return  to  Step  1,  Repeat

Why?  Let’s  back  up.  This  is  one  option  in  an  arsenal  of  dimensional  reduction  methods,  that  come  from  “unsupervised  learning”  in  “exploratory  data  analysis”

15

Dimensional  Reduction

Regress  disc  on  weight Regress  weight  on  disc

16

Page 5: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

Dimensional  ReductionMinimize  the  distance  to  the  line  in  both  directions  the  purple  line  is  the  principal  component  line

17

Dimensional  ReductionPrincipal  Components  are  Linear  Combinations  of  the  ‘old’  variables  The  projection  that  maximizes  the  area  of  the  shadow  and  an  equivalent  measurement  is  the  sums  of  squares  of  the  distances  between  points  in  the  projection,  we  want  to  see  as  much  of  the  variation  as  possible,  that’s  what  PCA  does.

18

The  PCA  workflow

19

Ordination  Using  the  Tree1. UniFrac-­‐PCoA  2. Double  Principal  Coordinates

20

Page 6: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

(Un)supervised  LearningOrdination  Best  Practice

1. Always  look  at  scree  plot  2. Variables,  Samples  3. Biplot  4. Altogether  (if  readable)

21

(Un)supervised  LearningOrdination  Best  Practice

pca.turtles=dudi.pca(Turtles[,-1],scannf=F,nf=2)!scatter(pca.turtles)

22

(Un)supervised  LearningWhat  did  we  “learn”?  Depends  on  the  data.

• How  many  axes  are  probably  useful?  • Are  their  clusters?  How  many?  • Are  their  gradients?  • Are  the  patterns  consistent  with  covariates  •   (e.g.  sample  observations)  • How  might  we  test  this?

23

(Un)supervised  LearningWhat  did  we  “learn”?  Depends  on  the  data.

• Are  their  clusters?  How  many?  !Gap  Statistic

24

Page 7: Lecture%2:% Diversity,%Distances,%adonis · Dimensional%Reduction Minimize%the%distance%to%the%line%in%both%directions% the%purple%line%is%the%principal%componentline 17 Dimensional%Reduction

(Un)supervised  LearningWhat  did  we  “learn”?  Depends  on  the  data.

• Are  their  gradients?  !PCA  regression

25

(Un)supervised  LearningWhat  did  we  “learn”?  Depends  on  the  data.

• Are  the  patterns  consistent  with  covariates  • How  might  we  test  this?

(Permutational)  Multivariate  ANOVA  vegan::adonis(  )

26