nips machine learning in computational biology presentation

Order under uncertainty: probabilistic approaches to pseudotime

NIPS Machine Learning in Computational Biology

Kieran Campbell University of Oxford

[email protected]

Outline

Introduction

A probabilistic model for pseudotime

Applications

Discussion

Pseudotime: artificial measure of a cell’s progression through some process

pseudotime

ordering

Unordered profiles Ordered profiles

Gene A

Gene B

Cell ordering problem: assign each cell a pseudotime based on expression profile

• Genes differentially expressed across pseudotime

• Clusters of co-expression

Current method: monocle

Proliferating cell

Differentiating myoblast

Interstitial mesenchymal cell

Trapnell, Cole, et al. "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells." Nature biotechnology (2014).

Independent component analysis

Minimum spanning tree

Ordering

What about uncertainty?All current methods give point estimates of pseudotime

Easy to say whether one cell precedes another

Could have large impact on downstream analyses

Gaussian Process Latent Variable Models

Gaussian processes - nonparametric prior on functions

GP latent variable models assume input parameter x is unknown (t)

Behaviour defined entirely by covariance matrix between different t

Rasmussen & Williams 2006

Lawrence 2004

Probabilistic approaches to pseudotime

Want to learn pseudotime from reduced dimension representation

Bayesian GPLVM to learn probabilistic pseudotime in reduced space

Gives us posterior uncertainty to propagate through to functional analyses

Prior issues

Bayesian inference requires us to define a prior distribution on our parameters

How do we want our pseudotime to look?• Pseudotime artificial - equivalent on any interval • Would ideally like to ‘fill out’ on [0,1] • Identifiability issues

What’s the best strategy?Repulsive prior - low probability when adjacent cells are close

Wang, Ye, and David B. Dunson. "Probabilistic Curve Learning: Coulomb Repulsion and the Electrostatic Gaussian Process." arXiv preprint arXiv:1506.03768 (2015).

Applications to single-cell RNA-seq datasets

Trapnell et al. 2014 (Monocle) - differentiating myoblasts time series data

Shin et al. 2015 (Waterfall) - adult hippocampal neurogenesis

Burns et al. 2015 (Ear) - sensory epithelia in the inner ear

Low dimensional representations

MonocleLaplacian eigenmaps

representation

EarLaplacian eigenmaps

representation

Waterfall PCA

representation

Uncertainty in posterior mean curve (trajectory)

Diffuseness of predictive data

distribution

Posterior uncertainty in pseudotime

Four cells drawn from Monocle dataset (155 cells in total)

95% credible interval typically covers ~ 1/4 pseudotime

Tell whether a cell is at the start, middle or end of a process

“This cell has a pseudotime of 0.12 and this one 0.14” doesn’t make sense

Posterior uncertainty in pseudotime

Approximating the false discovery rateInference gives us samples from the pseudotime posterior

Refit differential expression model for each gene for each sample

Compute p and q values for each sample for each gene

Compute proportion significant for each gene across all samples

Compare to point estimate: false positive if q < 0.05 but proportion significant < 0.95

Approximate false discovery rates

AFDR varies from 4% to 16%

Variable between datasets

Up to around 3x expected, so if you need robust differential expression use a probabilistic approach

Examining genes in pathways still valid

Effect of smoothing parametersCovariance matrix for each dimension

Corresponds to arc-length

Set a hierarchical prior on λ to penalise longer curves

Need some prior expectation of how the pseudotime will look with respect to marker genes

Small levels of shrinkage lead to unstable fits (lumpy posteriors)

Any unsupervised learning of pseudo times in single-cell genomics requires these smoothness considerations

Initial dimensionality reduction stepWe lose some uncertainty in the initial dimensionality reduction step But…

• Posterior already highly multi-modal using two (optimised) reduced dimensions

• Informative to visualise and understand representations with respect to clusters and marker genes

• Most methods involve some dimensionality reduction first - important to understand uncertainty

One solution: use (Bayesian) Hierarchical GPLVM

Dimension D 3 2 1

Multiple representation learning

Likelihood conditionally independent across latent dimensions

Naturally extend to integrate different reduced dimension representations (multiview learning)

Framework for integrating heterogeneous data sources

= + +

Multiple representation learning (II)

Pseudotimes fit individually to each representation

Pseudotimes fit jointly for all representation

Take home messages1. Don’t think of pseudotimes as point estimates

2. Use pseudotime as a rough guide for where a cell is through a biological process

3. Your FDR is probably higher than you think it is

4. If you need robust differential expression, use probabilistic methods

5. All pseudotime methods come with prior expectations about structure and smoothness

6. Don’t get caught up with a particular dimensionality reduction algorithm - they all work

Scatergithub.com/davismcc/scater

http://github.com/davismcc/scater

Acknowledgements

Chris Yau

Caleb WebberChris Ponting & groups

Michalis Titsias

[email protected] @kieranrcampbell kieranrcampbell.github.io

nips machine learning in computational biology presentation

Data & Analytics