nips machine learning in computational biology presentation
TRANSCRIPT
Order under uncertainty: probabilistic approaches to pseudotime
NIPS Machine Learning in Computational Biology
Kieran Campbell University of Oxford
Outline
Introduction
A probabilistic model for pseudotime
Applications
Discussion
Pseudotime: artificial measure of a cell’s progression through some process
pseudotime
ordering
Unordered profiles Ordered profiles
Gene A
Gene B
Cell ordering problem: assign each cell a pseudotime based on expression profile
• Genes differentially expressed across pseudotime
• Clusters of co-expression
Current method: monocle
Proliferating cell
Differentiating myoblast
Interstitial mesenchymal cell
Trapnell, Cole, et al. "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells." Nature biotechnology (2014).
Independent component analysis
Minimum spanning tree
Ordering
What about uncertainty?All current methods give point estimates of pseudotime
Easy to say whether one cell precedes another
Could have large impact on downstream analyses
Gaussian Process Latent Variable Models
Gaussian processes - nonparametric prior on functions
GP latent variable models assume input parameter x is unknown (t)
Behaviour defined entirely by covariance matrix between different t
Rasmussen & Williams 2006
Lawrence 2004
Probabilistic approaches to pseudotime
Want to learn pseudotime from reduced dimension representation
Bayesian GPLVM to learn probabilistic pseudotime in reduced space
Gives us posterior uncertainty to propagate through to functional analyses
Prior issues
Bayesian inference requires us to define a prior distribution on our parameters
How do we want our pseudotime to look?• Pseudotime artificial - equivalent on any interval • Would ideally like to ‘fill out’ on [0,1] • Identifiability issues
What’s the best strategy?Repulsive prior - low probability when adjacent cells are close
Wang, Ye, and David B. Dunson. "Probabilistic Curve Learning: Coulomb Repulsion and the Electrostatic Gaussian Process." arXiv preprint arXiv:1506.03768 (2015).
Applications to single-cell RNA-seq datasets
Trapnell et al. 2014 (Monocle) - differentiating myoblasts time series data
Shin et al. 2015 (Waterfall) - adult hippocampal neurogenesis
Burns et al. 2015 (Ear) - sensory epithelia in the inner ear
Low dimensional representations
MonocleLaplacian eigenmaps
representation
EarLaplacian eigenmaps
representation
Waterfall PCA
representation
Uncertainty in posterior mean curve (trajectory)
Diffuseness of predictive data
distribution
Posterior uncertainty in pseudotime
Four cells drawn from Monocle dataset (155 cells in total)
95% credible interval typically covers ~ 1/4 pseudotime
Tell whether a cell is at the start, middle or end of a process
“This cell has a pseudotime of 0.12 and this one 0.14” doesn’t make sense
Posterior uncertainty in pseudotime
Approximating the false discovery rateInference gives us samples from the pseudotime posterior
Refit differential expression model for each gene for each sample
Compute p and q values for each sample for each gene
Compute proportion significant for each gene across all samples
Compare to point estimate: false positive if q < 0.05 but proportion significant < 0.95
Approximate false discovery rates
AFDR varies from 4% to 16%
Variable between datasets
Up to around 3x expected, so if you need robust differential expression use a probabilistic approach
Examining genes in pathways still valid
Effect of smoothing parametersCovariance matrix for each dimension
Corresponds to arc-length
Set a hierarchical prior on λ to penalise longer curves
Need some prior expectation of how the pseudotime will look with respect to marker genes
Small levels of shrinkage lead to unstable fits (lumpy posteriors)
Any unsupervised learning of pseudo times in single-cell genomics requires these smoothness considerations
Initial dimensionality reduction stepWe lose some uncertainty in the initial dimensionality reduction step But…
• Posterior already highly multi-modal using two (optimised) reduced dimensions
• Informative to visualise and understand representations with respect to clusters and marker genes
• Most methods involve some dimensionality reduction first - important to understand uncertainty
One solution: use (Bayesian) Hierarchical GPLVM
Dimension D 3 2 1
Multiple representation learning
Likelihood conditionally independent across latent dimensions
Naturally extend to integrate different reduced dimension representations (multiview learning)
Framework for integrating heterogeneous data sources
= + +
Multiple representation learning (II)
Pseudotimes fit individually to each representation
Pseudotimes fit jointly for all representation
Take home messages1. Don’t think of pseudotimes as point estimates
2. Use pseudotime as a rough guide for where a cell is through a biological process
3. Your FDR is probably higher than you think it is
4. If you need robust differential expression, use probabilistic methods
5. All pseudotime methods come with prior expectations about structure and smoothness
6. Don’t get caught up with a particular dimensionality reduction algorithm - they all work
Scatergithub.com/davismcc/scater
Acknowledgements
Chris Yau
Caleb WebberChris Ponting & groups
Michalis Titsias
[email protected] @kieranrcampbell kieranrcampbell.github.io