text modeling meets the microbiome - stanford...

1
Text Modeling meets the Microbiome Kris Sankaran and Susan P. Holmes Department of Statistics, Stanford University Abstract The human microbiome is a complex ecological system, and de- scribing its structure and function under different environmen- tal conditions is important from both basic scientific and med- ical perspectives. Viewed through a biostatistical lens, many microbiome analysis goals can be approached through latent variable modeling, for which a range of techniques are avail- able. We develop the analogy between text modeling, where documents are approximated as mixtures of topics, and bacte- rial count modeling, where samples are approximated as mix- tures of ecological states, focusing on applications of Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Dynamic Unigrams to the microbiome. We fur- ther illustrate and compare techniques using the data of [Deth- lefsen and Relman, 2011], a study on the effects of antibi- otics on bacterial community composition. A complete preprint is available [Sankaran and Holmes, 2017], and code and data for reproducing model estimates and figures can be found at https://github.com/krisrs1128/microbiome_plvm/. Methods LDA: Let x dv be the number of times word v occurs in document d. Suppose the k th topic places weight β vk on the v th term, so that β k ∈S V -1 . Suppose there are N d words in the d th document. Then, x d· | (β k ) K 1 iid Mult (N d ,Bθ d ) for d =1,...,D θ d iid Dir (α) for d =1,...,D β k iid Dir (γ ) for k =1,...,K, where B = ( β 1 ,...,β K ) . Dynamic Unigrams: In light of LDA’s geometric interpretation, we might consider in some situations a model that identifies samples with a continuous curve on this V -dimensional simplex [Blei and Lafferty, 2006], x d· |μ t(d) iid Mult ( N d ,S ( μ t(d) )) for d =1,...,D μ t |μ t-1 iid ∼N ( μ t-1 2 I V ) for t =1,...,T μ 0 iid ∼N ( 02 I V ) , where S is the multilogit link [ S (μ)] v = exp (μ v ) v exp (μ v ) , and t (d) maps document d to the time it was sampled. NMF: We consider a Gamma-Poisson factorization model [Zhou and Carin, 2015] that models the D × V counts matrix X by X Poi ( ΘB T ) Θ Gam (a 0 1 D×K ,b 0 1 D×K ) B Gam (c 0 1 V ×K ,d 0 1 V ×K ) , where our notation means that each entry in these matrices is sampled independently, with parameters given by the corresponding entry in the parameter matrix. In our simulation studies, we also consider a slight variant, which independently sends entries of X to zero with probability p 0 . Microbiome vs. Text Analysis Methods popular in text analysis can be adapted to the micro- biome setting, upon making the following connections. Document ⇐⇒ Biological Sample: The basic sampling units, over which conclusions are generalized, are documents (text analysis) and biological samples (microbiome analysis). It is of interest to highlight similarities and differences across these units, often through some variation on clustering or dimensionality reduction. Term ⇐⇒ Bacterial species: The fundamental features with which to describe samples are the counts of terms (text analysis) and bacterial species (microbiome analysis). More formally, by bacterial species, we mean Amplicon Sequence Variants [Callahan et al., 2017]. Topic ⇐⇒ Community: For interpretation, it is common to imagine “prototypical” units which can be used as a point of reference for observed samples. In text analysis, these are called topics – for example, “business” or “politics” articles have their own specific vocabularies. On the other hand, in microbiome analysis, these are called “communities” – different communities have different bacterial signatures. Word ⇐⇒ Sequencing Read: A “word” in text analysis refers to a single instance of a term in a document, not its total count. The analog in microbiome analysis is an individual read that has been mapped to a unique sequence variant, though this is rarely an object of intrinsic interest. Case Study We reanalyze the data of Dethlefsen and Relman [2011], a study of bacterial dynamics in response to antibiotic treatment. This study monitored three patients over time, with two antibiotics time courses introduced within small windows, in order to study the effect of antibiotic perturbations within the context of natural long-term dynamics. Variation in bacterial signatures tends to be dominated by strong inter-subject effects [Eckburg et al., 2005], and with only three subjects, there is little reason to cluster across subjects. Instead, we focus on Subject F, who had been reported to exhibit incom- plete recovery of the pre-antibiotic treatment bacterial commu- nity. Figure 1: Boxplots represent approximate posteriors for estimated mixture mem- berships θ d , and their evolution over time. Each row of panels provides a dif- ferent sequence of θ dk for a single k , and different columns distinguish different phases of sampling. Note that the y -axis is on the g -scale, which is defined as a translated logit, g (p) := ( log p 1 - log p,..., log p K - log p ) . The first and second antibiotic time courses result in meaningful shifts in these sequences, and that there appear to be long-term effects of treatment among bacteria in Topic 3. Figure 1 draws attention to the two antibiotic time courses, which took place between days 12-23 and 41-51. Topic 1 appears to represent the bacterial community filling niches left empty during the first time course, 3 are those that fail to recover after time course 1, while 2 and 4 reflect those that thrive during antibiotic time courses, with different response times. To interpret topics in terms of their bacterial community fingerprints, we display the β k in Figure 2. Figure 3 displays a subset of the results of the alternative Dy- namic Unigrams approach instead. Differential responses to an- tibiotic treatment is reflected in the differential change in μ tv across different v ’s, as time is varied. Figure 2: Each credible interval describes an approximate posterior for one β vk . Coupled with Figure 1, this guides the interpretation of which bacterial taxa are more or less prevalent during antibiotic treatments. The x-axis indexes species, sorted according to phylogenetic relatedness, and the y -axis give transformed values of the species probability under that topic. Only the 750 most abundant species are shown. Note the disappearance of otherwise abundant species within Topics 2, 4, and to some extent, 1. Figure 3: Each posterior credible interval refers to one μ vt . The rows are a subset of times t around the first antibiotic time course. This display is read in the same way as Figure 2. This view provides one way of smoothing abundance time series, to see how different species respond to antibiotic treatment. Posterior Predictive Checks Model assessment is important for qualifying interpretations, and can further guide refinements in subsequent analyses. Two ex- ample checks are provided in Figures 4 and 5. For LDA, the posterior predictive time series are on the appropriate scale with approximately the correct shape. However, for series with larger counts, the posterior predictive tends to oversmooth. For exam- ple, the drop to 0 in species 343 is not captured in any posterior predictive samples. On the other hand, for the Dynamic Unigram model, the poste- rior predictive distribution places most of its support close to the observed species series. This is reason for concern – there may be potential to produce more succinct summaries that still preserve the essential structure of the data. Figure 4: We can visualize simulated time series for a subset of species and compare them with the observed ones, as a posterior check. Each panel repre- sents one species. The black lines give observed asinh-transformed abundances for subject F over time. The blue and purple dots give posterior predictive realizations for species over time, according to LDA and Dynamic Unigrams, respectively. Figure 5: The PCA results computed on posterior predictive samples are aligned and overlaid here. The left pair of panels give scores for each species, while the right pair provide loadings for each timepoint. The individual posterior samples have been smoothed into contours, while the posterior means are displayed as shaded text. The observed data PCA results, after alignment with posterior samples, are displayed as black text. References David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120. ACM, 2006. Benjamin J Callahan, Paul J McMurdie, and Susan P Holmes. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal, jul 2017. doi: 10.1038/ismej.2017.119. URL https://doi.org/ 10.1038/ismej.2017.119. Les Dethlefsen and David A Relman. Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation. Proceedings of the National Academy of Sciences, 108(Supplement 1):4554–4561, 2011. Paul B Eckburg, Elisabeth M Bik, Charles N Bernstein, Eliza- beth Purdom, Les Dethlefsen, Michael Sargent, Steven R Gill, Karen E Nelson, and David A Relman. Diversity of the human intestinal microbial flora. science, 308(5728):1635–1638, 2005. Kris Sankaran and Susan Holmes. Latent variable modeling for the microbiome. arXiv preprint arXiv:1706.04969, 2017. Mingyuan Zhou and Lawrence Carin. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):307–320, 2015.

Upload: others

Post on 23-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Modeling meets the Microbiome - Stanford Universitystatweb.stanford.edu/~kriss1/microbiome_plvm_poster.pdf · 2017-10-01 · Title: Text Modeling meets the Microbiome Author:

Text Modeling meets the MicrobiomeKris Sankaran and Susan P. HolmesDepartment of Statistics, Stanford University

AbstractThe human microbiome is a complex ecological system, and de-scribing its structure and function under different environmen-tal conditions is important from both basic scientific and med-ical perspectives. Viewed through a biostatistical lens, manymicrobiome analysis goals can be approached through latentvariable modeling, for which a range of techniques are avail-able. We develop the analogy between text modeling, wheredocuments are approximated as mixtures of topics, and bacte-rial count modeling, where samples are approximated as mix-tures of ecological states, focusing on applications of LatentDirichlet Allocation (LDA), Nonnegative Matrix Factorization(NMF), and Dynamic Unigrams to the microbiome. We fur-ther illustrate and compare techniques using the data of [Deth-lefsen and Relman, 2011], a study on the effects of antibi-otics on bacterial community composition. A complete preprintis available [Sankaran and Holmes, 2017], and code and datafor reproducing model estimates and figures can be found athttps://github.com/krisrs1128/microbiome_plvm/.

Methods

•LDA: Let xdv be the number of times word v occurs indocument d. Suppose the kth topic places weight βvk on thevth term, so that βk ∈ SV−1. Suppose there are Nd words inthe dth document. Then,

xd·| (βk)K1iid∼ Mult (Nd, Bθd) for d = 1, . . . , D

θdiid∼ Dir (α) for d = 1, . . . , D

βkiid∼ Dir (γ) for k = 1, . . . , K,

where B =(β1, . . . , βK

).

•Dynamic Unigrams: In light of LDA’s geometricinterpretation, we might consider in some situations a modelthat identifies samples with a continuous curve on thisV -dimensional simplex [Blei and Lafferty, 2006],

xd·|µt(d)iid∼ Mult

(Nd, S

(µt(d)

))for d = 1, . . . , D

µt|µt−1iid∼ N

(µt−1, σ

2IV)

for t = 1, . . . , Tµ0

iid∼ N(0, σ2IV

),

where S is the multilogit link

[S (µ)]v = exp (µv)∑v′ exp (µv′)

,

and t (d) maps document d to the time it was sampled.•NMF: We consider a Gamma-Poisson factorization model[Zhou and Carin, 2015] that models the D × V counts matrixX by

X ∼ Poi(ΘBT

)Θ ∼ Gam (a01D×K, b01D×K)B ∼ Gam (c01V×K, d01V×K) ,

where our notation means that each entry in these matrices issampled independently, with parameters given by thecorresponding entry in the parameter matrix. In oursimulation studies, we also consider a slight variant, whichindependently sends entries of X to zero with probability p0.

Microbiome vs. Text AnalysisMethods popular in text analysis can be adapted to the micro-biome setting, upon making the following connections.

• Document ⇐⇒ Biological Sample: The basicsampling units, over which conclusions are generalized, aredocuments (text analysis) and biological samples (microbiomeanalysis). It is of interest to highlight similarities anddifferences across these units, often through some variation onclustering or dimensionality reduction.

• Term ⇐⇒ Bacterial species: The fundamental featureswith which to describe samples are the counts of terms (textanalysis) and bacterial species (microbiome analysis). Moreformally, by bacterial species, we mean Amplicon SequenceVariants [Callahan et al., 2017].

• Topic ⇐⇒ Community: For interpretation, it is commonto imagine “prototypical” units which can be used as a pointof reference for observed samples. In text analysis, these arecalled topics – for example, “business” or “politics” articleshave their own specific vocabularies. On the other hand, inmicrobiome analysis, these are called “communities” –different communities have different bacterial signatures.

• Word ⇐⇒ Sequencing Read: A “word” in text analysisrefers to a single instance of a term in a document, not itstotal count. The analog in microbiome analysis is anindividual read that has been mapped to a unique sequencevariant, though this is rarely an object of intrinsic interest.

Case StudyWe reanalyze the data of Dethlefsen and Relman [2011], a studyof bacterial dynamics in response to antibiotic treatment. Thisstudy monitored three patients over time, with two antibioticstime courses introduced within small windows, in order to studythe effect of antibiotic perturbations within the context of naturallong-term dynamics.Variation in bacterial signatures tends to be dominated by stronginter-subject effects [Eckburg et al., 2005], and with only threesubjects, there is little reason to cluster across subjects. Instead,we focus on Subject F, who had been reported to exhibit incom-plete recovery of the pre-antibiotic treatment bacterial commu-nity.

Figure 1: Boxplots represent approximate posteriors for estimated mixture mem-berships θd, and their evolution over time. Each row of panels provides a dif-ferent sequence of θdk for a single k, and different columns distinguish differentphases of sampling. Note that the y-axis is on the g-scale, which is defined asa translated logit, g (p) :=

(log p1 − log p, . . . , log pK − log p

). The first and

second antibiotic time courses result in meaningful shifts in these sequences,and that there appear to be long-term effects of treatment among bacteria inTopic 3.

Figure 1 draws attention to the two antibiotic time courses, whichtook place between days 12-23 and 41-51. Topic 1 appears torepresent the bacterial community filling niches left empty duringthe first time course, 3 are those that fail to recover after timecourse 1, while 2 and 4 reflect those that thrive during antibiotictime courses, with different response times. To interpret topicsin terms of their bacterial community fingerprints, we display theβk in Figure 2.Figure 3 displays a subset of the results of the alternative Dy-namic Unigrams approach instead. Differential responses to an-tibiotic treatment is reflected in the differential change in µtvacross different v’s, as time is varied.

Figure 2: Each credible interval describes an approximate posterior for one βvk.Coupled with Figure 1, this guides the interpretation of which bacterial taxa aremore or less prevalent during antibiotic treatments. The x-axis indexes species,sorted according to phylogenetic relatedness, and the y-axis give transformedvalues of the species probability under that topic. Only the 750 most abundantspecies are shown. Note the disappearance of otherwise abundant species withinTopics 2, 4, and to some extent, 1.

Figure 3: Each posterior credible interval refers to one µvt. The rows are asubset of times t around the first antibiotic time course. This display is read inthe same way as Figure 2. This view provides one way of smoothing abundancetime series, to see how different species respond to antibiotic treatment.

Posterior Predictive ChecksModel assessment is important for qualifying interpretations, andcan further guide refinements in subsequent analyses. Two ex-ample checks are provided in Figures 4 and 5. For LDA, theposterior predictive time series are on the appropriate scale withapproximately the correct shape. However, for series with largercounts, the posterior predictive tends to oversmooth. For exam-ple, the drop to 0 in species 343 is not captured in any posteriorpredictive samples.On the other hand, for the Dynamic Unigram model, the poste-rior predictive distribution places most of its support close to theobserved species series. This is reason for concern – there may bepotential to produce more succinct summaries that still preservethe essential structure of the data.

Figure 4: We can visualize simulated time series for a subset of species andcompare them with the observed ones, as a posterior check. Each panel repre-sents one species. The black lines give observed asinh-transformed abundancesfor subject F over time. The blue and purple dots give posterior predictiverealizations for species over time, according to LDA and Dynamic Unigrams,respectively.

Figure 5: The PCA results computed on posterior predictive samples are alignedand overlaid here. The left pair of panels give scores for each species, while theright pair provide loadings for each timepoint. The individual posterior sampleshave been smoothed into contours, while the posterior means are displayed asshaded text. The observed data PCA results, after alignment with posteriorsamples, are displayed as black text.

References

David M Blei and John D Lafferty. Dynamic topic models. InProceedings of the 23rd international conference on Machinelearning, pages 113–120. ACM, 2006.Benjamin J Callahan, Paul J McMurdie, and Susan P Holmes.Exact sequence variants should replace operational taxonomicunits in marker-gene data analysis. The ISME Journal, jul2017. doi: 10.1038/ismej.2017.119. URL https://doi.org/10.1038/ismej.2017.119.Les Dethlefsen and David A Relman. Incomplete recovery andindividualized responses of the human distal gut microbiota torepeated antibiotic perturbation. Proceedings of the NationalAcademy of Sciences, 108(Supplement 1):4554–4561, 2011.Paul B Eckburg, Elisabeth M Bik, Charles N Bernstein, Eliza-beth Purdom, Les Dethlefsen, Michael Sargent, Steven R Gill,Karen E Nelson, and David A Relman. Diversity of the humanintestinal microbial flora. science, 308(5728):1635–1638, 2005.Kris Sankaran and Susan Holmes. Latent variable modeling forthe microbiome. arXiv preprint arXiv:1706.04969, 2017.Mingyuan Zhou and Lawrence Carin. Negative binomial processcount and mixture modeling. IEEE Transactions on PatternAnalysis and Machine Intelligence, 37(2):307–320, 2015.