statistical methods for microbial community comparison › ~rvbalan › teaching › amsc663fall2007...

38
Statistical methods for microbial community comparison james robert white April 2008 Advisor: Mihai Pop, CBCB.

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Statistical methods for microbial community comparison

james robert whiteApril 2008

Advisor: Mihai Pop, CBCB.

Page 2: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Outline

2

• Brief background in metagenomics

• Introduce my problem

• Methods

• Applications

• Future work

Page 3: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Biology!

• Every microbe has a conserved gene called16S rRNA.

• Easy to recognize and exists in all known microbes.

Bacillus anthracis

E. coli

Mycobacterium tuberculosis

3

IntroOur methodsApplicationsFuture work

Page 4: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Metagenomics

16S gene. . . TAGTCCATGACAGTACCGTACAAAA . . .

4

Prochlorococcus marinus

IntroOur methodsApplicationsFuture work

Page 5: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

CensusEnvironment

(radioactive waste)

5

IntroOur methodsApplicationsFuture work

Page 6: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Census

75%

10%

5%1%

6%

3%

taxa

Environment(radioactive waste)

6

IntroOur methodsApplicationsFuture work

Page 7: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

The Problem

(Healthy colons) (Sick colons)How do two environments

differ?

Which organisms are differentially

abundant?

7

IntroOur methodsApplicationsFuture work

Page 8: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

p1 p2 p3 p4 p5 p6 p7

t1 243 300 120 0 43 21 66

t2 12 34 32 0 0 0 0

t3 0 3 10 200 140 134 70

t4 42 4 12 54 76 80 60

t5 2 0 10 4 6 0 0

t6 5 5 3 15 12 0 43

Taxa abundance matrix

Healthy colons Sick colons

8

IntroOur methodsApplicationsFuture work

Page 9: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Differential abundance

• convert frequencies to relative proportions.

• compute sample means, variances.

9

p1 p2 p3 p4 p5 p6 p7

t1 .37 .42 .35 0.0 .10 .05 .17

IntroOur methodsApplicationsFuture work

Page 10: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Hypothesis tests

• So for each taxa, Ti , we perform a hypothesis test of proportions:

• Ho: μhealthy = μsick

• HA: μhealthy ≠ μsick

• We obtain a test statistic ti, corresponding p-value.

• Reject or accept the null hypothesis?

10

IntroOur methodsApplicationsFuture work

Page 11: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

IntroOur methodsApplicationsFuture work

Page 12: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

IntroOur methodsApplicationsFuture work

Page 13: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

t1 ... tM

IntroOur methodsApplicationsFuture work

Page 14: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

IntroOur methodsApplicationsFuture work

Page 15: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

IntroOur methodsApplicationsFuture work

Page 16: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

t1* ... tM*

IntroOur methodsApplicationsFuture work

Page 17: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

simulated t’s

t1* ... tM*

IntroOur methodsApplicationsFuture work

Page 18: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

simulated t’s

t1* ... tM*

IntroOur methodsApplicationsFuture work

Page 19: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Permuted p-values

11

simulated t’s

t1* ... tM*

IntroOur methodsApplicationsFuture work

Page 20: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Multiple tests

• criteria for rejecting Ho

• p-value estimates of significance

• choose a threshold α, and reject if p ≤ α

• if you reject all Ho with p ≤ 0.05, you expect 5% of all true Ho to be false positives.

• M = 10 tests? 1000 tests? 100,000 tests?

• Bonferroni correction

12

IntroOur methodsApplicationsFuture work

Page 21: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

FDR alternative

13

• False Discovery Rate - “the rate of significant features that are truly null.”

• Analog to p-value => q-value

• if you reject all Ho with p ≤ 0.05, you expect 5% of all true Ho to be false positives.

• if you reject all Ho with q ≤ 0.05, you expect 5% of all rejected Ho to be false positives.

• e.g. 10,000 tests.

IntroOur methodsApplicationsFuture work

Page 22: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Multiple tests

accept null reject null total

null true MaT MrT MT

null false MaF MrF MF

total M-Mr Mr M

14

IntroOur methodsApplicationsFuture work

Page 23: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Hedenfalk p values

15

IntroOur methodsApplicationsFuture work

Page 24: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Hedenfalk q values

16

IntroOur methodsApplicationsFuture work

Page 25: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Additional issues

17

• Low frequency taxa.

p1 p2 p3 p4 p5 p6 p7

t1 243 300 120 0 43 21 66

t2 12 34 32 0 0 0 0

t3 0 3 10 200 140 134 70t4 42 4 12 54 76 80 60

t5 2 0 10 4 6 0 0

t6 5 5 3 15 12 0 43

Healthy colons Sick colons

IntroOur methodsApplicationsFuture work

Page 26: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

heuristic

• N = total number of samples from treatment

• N*p ≥ 25 to use the t statistic

• p ≥ 25/N

• p ≥ 25/5000 = 0.005

18

IntroOur methodsApplicationsFuture work

Page 27: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

small frequencies

• what about 25/N > p ?

• if p is this small, indicates small variance among subjects, so merge all samples into one large sample.

• use Fisher’s exact test to find an appropriate p value.

19

IntroOur methodsApplicationsFuture work

Page 28: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

e.g.s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

0 1 1 2 3 0 0 1 0 0

50 49 49 48 47 50 50 49 50 50

g1 g2

S 7 1

F 243 24920

Page 29: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Real 16S data

21

• Ley et al. 2006, Nature

• metagenomic study of differentially abundant taxa between human guts of obese (12) and lean (5) people

• found significant differences between two high level taxa: Bacteroidetes and Firmicutes

• we generated taxa abundance matrices from original data and tried to replicate their results.

IntroOur methodsApplicationsFuture work

Page 30: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Ley et al. data

Obese vs. Lean

0

10

20

30

40

50

60

70

80

90

100

Firmicutes Bacteroidetes Actinobacteria

rela

tive

abun

danc

e (%

)

ObeseLean

22

(mean ± std. err)(p≤0.05)

IntroOur methodsApplicationsFuture work

Page 31: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

human vs. mouse

23

• Two 16S distal gut studies:

• 5 lean control humans (Ley et al., 2006)

• 12 lean control mice (Ley et al., PNAS, 2005)

• 6,250 16S sequences.

• assigned using the RDP II Bayesian classifier

IntroOur methodsApplicationsFuture work

Page 32: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

human vs. mouse

24

IntroOur methodsApplicationsFuture work

Page 33: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Metabolic profiles

25

• Dinsdale et al., Nature, 2008.

• Collected 87 microbial and viral metagenomes.

• 15 million shotgun sequences!

• subterranean, coral reefs, hypersaline, freshwater, animal guts, mosquito viruses.

IntroOur methodsApplicationsFuture work

Page 34: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Metabolic profiles

26

(mean ± std. err)(p≤0.02)

IntroOur methodsApplicationsFuture work

Page 35: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

27

Page 36: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Timeline• December

• Consider statistical methodology given sampling issues.

• Develop at least two methodologies to compare.

• Design broad simulation to test q-values vs. p-values.

• January • Finish broad simulation.• Finalize statistical methodology.• Finish application of software to Ley

data.• February

• Apply best method to additional metagenomic data.

• Develop documentation for software. 28

• April• Complete final draft of

report including edits from advisor.

• Submit polished version of our software to BioConductor group.

• May• Deliver final report.• Final presentation

• Beyond• Submit paper.• New data.• Correlations between

taxa.

IntroOur methodsApplicationsFuture work

Page 37: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Acknowledgments

• Mihai Pop, CBCB

• Andrea Ottesen, PLSC

• Frank Siewerdt, ANSC

• Paul Smith, STAT

• Radu Balan, CSCAMM

• Aleksey Zimin, IPST

29

Page 38: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice

Questions?

30