modelling gene expression dynamics with gaussian processes · di erential equations: pol-ii to mrna...
TRANSCRIPT
Modelling gene expression dynamicswith Gaussian processes
Regulatory Genomics and EpigenomicsMarch 10th 2016
Magnus RattrayFaculty of Life Sciences
University of Manchester
Talk Outline
Introduction to Gaussian process regression
Example 1. Hierarchical models: batches and clusters
Example 2. Branching models: perturbations and bifurcations
Example 3. Differential equations: Pol-II to mRNA dynamics
Gaussian processes: flexible non-parametric models
Probability distributions over functions
f (t) ∼ GP(mean(t), cov(t, t ′)
)Covariance function k = cov(t, t ′) defines typical properties,
I Static . . . Dynamic
I Smooth . . . Rough
I Stationary. . . non-Stationary
I Periodic. . . Chaotic
The covariance function has parameters tuning these properties
Bayesian Machine Learning perspective: Rasmussen & Williams“Gaussian Processes for Machine Learning” (MIT Press, 2006)
Gaussian processes
Samples from a 25-dimensional multivariate Gaussian distribution:
n
fn
5 10 15 20 25
−1
−0.5
0.5
1
1.5
2
n
m
5 10 15 20 25
5
10
15
20
25
0.2
0.4
0.6
0.8
[f1, f2, . . . , f25] ∼ N (0,C )
Learning and Inference in Computational Systems Biology, MIT Press
Gaussian processes
Take dimension →∞
−3 −2 −1 1 2 3
−2.5
−2
−1.5
−1
−0.5
0.5
1
1.5
2
2.5
−3 −2 −1 1 2 3
−3
−2
−1
1
2
3
f ∼ GP(0, k) k(t, t ′)
= exp
(−(t − t ′)2
l2
)
Learning and Inference in Computational Systems Biology, MIT Press
Gaussian processes for inference: Bayesian Regression
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
2
1
0
1
2
Figures from James Hensman
Regression example
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
2
1
0
1
2
Figures from James Hensman
Regression example
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
2
1
0
1
2
Figures from James Hensman
Regression example
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
2
1
0
1
2
Figures from James Hensman
Ex1. Hierarchical models: batches and clusters
0 5 10 15 2021012
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
0 5 10 15 2021012
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Data from Kalinka et al. ”Gene expression divergence recapitulatesthe developmental hourglass model” Nature 2010
Joint work with James Hensman and Neil Lawrence
Usual processing options for time course batches
Lumped
0 5 10 15 20
2
1
0
1
2
Averaged
0 5 10 15 20
2
1
0
1
2
Hierarchical Gaussian process
gene:
g(t) ∼ GP(0, kg (t, t ′)
)replicate:
fi (t) ∼ GP(g(t), kf (t, t ′)
)
Hierarchical Gaussian process
3
2
1
0
1
2
J. Hensman, N.D. Lawrence, M.Rattray ”Hierarchical Bayesianmodelling of gene expression time series across irregularly sampledreplicates and clusters” BMC Bioinformatics 2013
Hierarchical Gaussian process for clustering
An extended hierarchy
h(t) ∼ GP(
0, kh(t, t ′))
cluster
gi (t) ∼ GP(h(t), kg (t, t ′)
)gene
fir (t) ∼ GP(gi (t), kf (t, t ′)
)replicate
Hierarchical Gaussian process for clustering
2
1
0
1
2
CG11786
2
1
0
1
2
CG12723
2
1
0
1
2
CG13196
2
1
0
1
2
CG13627
2
1
0
1
2
CG4386
2
1
0
1
2
Osi15
0 5 10 15 202
1
0
1
2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (hours)
Norm
alis
ed
gen
e e
xp
ress
ion
Hierarchical Gaussian process for clustering
Modifying an existing algorithm to include this model of replicateand cluster structure leads to more meaningful clustering
MF BP CC L N. clust.
agglomerative HGP 0.46 0.16 0.50 7360.8 50agglomerative GP 0.39 0.13 0.36 6203.7 128Mclust (concat.) 0.39 0.07 0.25 1324.0 26
Mclust (averaged) 0.40 0.08 0.24 -736.2 20
Variational Bayes algorithm is more efficient, allowing Bayesianclustering of >10K profiles with a Dirichlet Process prior
J. Hensman, M.Rattray, N.D. Lawrence “Fast non-parametricclustering of time-series data” IEEE TPAMI 2015
Clustering with a periodic covariance
Ex2. Branching models: perturbations and bifurcations
0 5 10 15Time (hrs)
−4
−3
−2
−1
0
1
2
3
4
5
Sim
ulated gene expression
Simulated control data
Simulated perturbed data
Joint work with Jing Yang, Chris Penfold and Murray Grant
Samples from a branching model
Sample 1
Samples from a branching model
Sample 2
Samples from a branching model
Sample 3
Samples from a branching model
Sample 4
Samples from a branching model
Sample 5
Joint distribution to two functions crossing at tp
f ∼ GP(0,K ) , g ∼ GP(0,K ) , g(tp) = f (tp)
Σ =
(Kff Kfg
Kgf Kgg
)=
(K (T,T)
K(T,tp)K(tp ,T)k(tp ,tp)
K(T,tp)K(tp ,T)k(tp ,tp)
K (T,T)
)(1)
Joint distribution of two datasets diverging at tp
y c(tn) ∼ N (f (tn), σ2)
yp(tn) ∼ N (f (tn), σ2) for tn ≤ tp
yp(tn) ∼ N (g(tn), σ2) for tn > tp
Posterior probability of the perturbation time tp
p(tp|y c(T), yp(T)) ' p(y c(T), yp(T)|tp)∑t=tmaxt=tmin
p(y c (T), yp(T)|t)
0 5 10 15Time (hrs)
−4
−3
−2
−1
0
1
2
3
4
5
Sim
ulated gene expression
Simulated control data
Simulated perturbed data
Comparison to DE thresholding approach
Alternative: use first point where some DE threshold is passedWe tested several DE metrics in the nsgp package
0 5 10 15 20
0
5
10
15
20
Esti
mate
d p
ert
urb
ati
on
tim
es
DEtime
Mean
Median
MAP
0 5 10 15 20
nsgp
EMLL0.5
EMLL1
Testing perturbation times
Investigating a plant’s response to bacterial challenge
Infection with virulent Pseudomonas syringage pv. tomato DC3000vs. disarmed strain DC3000hrpA
0 10 17
Times (hrs)
8
9
10
11
12
13
Gen
e e
xp
ressio
n
CATMA1c72124
Condition 1
Condition 2
0 10 17
Times (hrs)
CATMA1a09335
Condition 1
Condition 2
Yang et al. “Inferring the perturbation time from biological timecourse data” Bioinformatics (accepted) preprint: arxiv 1602.01743
Current work: modelling branching in single-cell data
12
48
1632
64start
5
10
15
20
25
30
35
40
1
2
3
4
5
6
7
with A. Boukouvalas, M. Zweissele, J. Hensman and N. Lawrence
Ex 3. Linking Pol-II activity to mRNA profiles
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II
0
2
4
6
Pol II
ENSG00000163659 (TIPARP)
0 40 80 160 320 64012800
50
100
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Joint work with Antti Honkela,Jaakko Peltonen, Neil Lawrence
Linking Pol-II activity to mRNA profiles
dm(t)
dt= βp(t −∆)− αm(t)
I m(t) is mRNA concentration (RNA-Seq data)
I p(t) is mRNA production rate (3’ pol-II ChIP-Seq data)
I α is degradation rate (mRNA half-life t1/2 = 2/α)
I ∆ is processing delay
We model p(t) ∼ GP(0, kp) as a Gaussian process (GP)
Likelihood can be worked out exactly
Bayesian MCMC used to estimate parameters α, β, ∆ and GPcovariance (2) and noise variance (1) parameters
Linking Pol-II activity to mRNA profiles
dm(t)
dt= βp(t −∆)− αm(t)
I m(t) is mRNA concentration (RNA-Seq data)
I p(t) is mRNA production rate (3’ pol-II ChIP-Seq data)
I α is degradation rate (mRNA half-life t1/2 = 2/α)
I ∆ is processing delay
We model p(t) ∼ GP(0, kp) as a Gaussian process (GP)
Likelihood can be worked out exactly
Bayesian MCMC used to estimate parameters α, β, ∆ and GPcovariance (2) and noise variance (1) parameters
Example fits
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II a: Early pol-II, no production delay
0
2
4
Pol II
ENSG00000166483 (WEE1)
0 40 80 160 320 64012800
20
40
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Example fits
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II b: Early pol-II, delayed production
0
1
2
3
Pol II
ENSG00000019549 (SNAI2)
0 40 80 160 320 64012800
2
4
6
8
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Example fits
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II c: Later pol-II, delayed production
0
2
4
6
Pol II
ENSG00000163659 (TIPARP)
0 40 80 160 320 64012800
50
100
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Example fits
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II d: Late pol-II, no delay
0
0.5
1
Pol II
ENSG00000162949 (CAPN13)
0 40 80 160 320 64012800
5
10
15
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Example fits
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II e: Late pol-II, no delay
0
0.5
1
Pol II
ENSG00000117298 (ECE1)
0 40 80 160 320 64012800
20
40
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Example fits
040
80160
320640
1280
t (min)
mRNA
a
b
c
d
e
f
040
80160
320640
1280
t (min)
Pol−II f: Late pol-II, delayed production
0
1
2
3
Pol II
ENSG00000181026 (AEN)
0 40 80 160 320 64012800
10
20
30
mR
NA
(F
PK
M)
t (min)0 50 100
0
0.2
0.4
0.6
0.8
1
Delay (min)
Large processing delays observed in 11% of genes
Delay (min)
# of
gen
es
0 40 80 >120
050
100
1500
1550
Delay linked with gene length and intron structure
0 20 40 60 80t (min)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Fra
ctio
n of
gen
es w
ith ∆
>t
02
46
810
12−
log 1
0(p−
valu
e)
m<10000(N=282)m>10000(N=1504)
p−value
0 20 40 60 80t (min)
0.00
0.05
0.10
0.15
0.20
0.25
Fra
ctio
n of
gen
es w
ith ∆
>t
02
46
810
−lo
g 10(p
−va
lue)
f<0.20(N=1443)f>0.20(N=343)
p−value
∆: delay m: gene length f: final intron length / gene length
Delayed genes show evidence of late-intron retention
DLX
3
0246
10 m
in
0
5
10
20 m
in
0102030
40 m
in
05
101520
80 m
in
02468
10
160
min
02468
320
min
10 20 30 40 50 60t (min)
−0.
010.
000.
010.
020.
030.
04M
ean
pre−
mR
NA
end
acc
umul
atio
n in
dex
01
23
45
67
−lo
g 10(p
−va
lue)
∆ > t∆ < tp−value
Left: density of RNA-Seq reads uniquely mapping to the introns inthe DLX3 gene
Right: Differences in the mean pre-mRNA accumulation index inlong delay genes (blue) and short delay genes (red)
Comparison of processing and transcription times
RNA processing delay ∆ (min)
Tran
scrip
tiona
l del
ay (
min
)
1 3 10 30 100
13
1030
100
Transcription time = length/velocityestimate from Danko Mol Cell 2013
Transcription time > processing delayin 87% of genes
Honkela et al. “Genome-wide modeling of transcription kineticsreveals patterns of RNA production delays” PNAS 2015
Extension - inferring time-varying degradation rates
dm(t)
dt= βp(t)− α(t)m(t)
0 2 4 6 8 10Times
0.5
0.0
0.5
1.0
1.5
2.0
2.5
Degra
dati
on r
ate
D
0 2 4 6 8 10Times
5
0
5
10
15
20
25
30
35
Gene e
xpre
ssio
n
mRNA measurementspre-mRNA measurements
Conclusion
I Gaussian process models provide a good mix of flexibility andtractability in temporal and spatial modelling:
I Hierarchical modelling, e.g. replicates, clusters, speciesI Periodic models without strong sinuisoidal assumptionsI Tractable under branching and bifurcationsI Tractable under linear operations on functions
I Easy addition and multiplication of kernels
I Good approximate inference algorithms for non-Gaussian data
I Codebase is growing making methods increasingly flexible andcomputationally efficient (e.g. GPy, GPflow and some R)
Funding: BBSRC (EraSysBio+ SYNERGY), EU FP7 (RADIANT)
Collaborators: Neil Lawrence, James Hensman, Antti Honkela,Jaakko Peltonen, Jing Yang, Alexis Boukouvalas, Max Zweissele
Our existential crisis