riemannian manifolds and statistical...
TRANSCRIPT
CSML Lunch Time TalkFriday 23rd Movember 2012
Ben CalderheadResearch Fellow
CoMPLEXUniversity College London
Riemannian Manifolds and Statistical Models
The use geometry in Markov chain Monte Carlo
“Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”Mark Girolami and Ben CalderheadJournal of the Royal Statistical Society: Series B (with discussion)
www.ucl.ac.uk/statistics/research/csi
“Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”Mark Girolami and Ben CalderheadJournal of the Royal Statistical Society: Series B (with discussion)
www.ucl.ac.uk/statistics/research/csi
Bernhard Riemann A manifold William HamiltonPaul Langevin A casino...
Differential Geometric MCMC Methods and ApplicationsBen CalderheadPhD Thesis, University of Glasgow (2011)
(google “ben calderhead thesis”)
Many statistical models have a natural geometric structure that is Riemannian in nature.
Many statistical models have a natural geometric structure that is Riemannian in nature.
Can we use this geometric information to design better Markov chain Monte Carlo (MCMC) algorithms?
• Ben CalderheadDifferential Geometric MCMC Methods and ApplicationsPhD Thesis, University of Glasgow (2011)
• Ben Calderhead & Mark GirolamiStatistical Analysis of Nonlinear Dynamical Systems using Differential Geometric Sampling MethodsJournal of the Royal Society, Interface Focus (2011)
• Mark Girolami & Ben CalderheadRiemann Manifold Langevin and Hamiltonian Monte Carlo MethodsJournal of the Royal Statistical Society - Series B (with discussion), (2011) Vol. 73(2), 123-214
• Ben Calderhead & Mark GirolamiEstimating Bayes Factors via Thermodynamic Integration and Population MCMCComputational Statistics and Data Analysis, (2009), Elsevier Press, Vol. 53, 4028-4045
SOME MOTIVATION
HIGHLY NONLINEAR MODELS
• Often sparse, uncertain data with unobserved species
• Often multiple network topologies consistent with the known biology
MODELLING QUESTIONS
• Which parameters should we use for a given model?
• Which model structure is most appropriate to describe the system of interest?
BCR‐ABL
pBCR‐ABL
JAK2
pJAK2
pSTAT5
STAT5
TKI JAKI
GrowthFactor
Nucleus
Self
phosphoryla@on
TKI_BCR‐ABL JAKI_JAK2
Dephosphorylation
and nuclear export
k4
k3
k10 k9
k10
k9
k7 k8
k7
k8
k1
k2
k5 k6
k12
k11
pBCR‐ABL
pJAK2
k13
k14
k16
k15
Model 2:
with interaction between BCR-ABL and JAK2
BCR‐ABL
pBCR‐ABL
JAK2
pJAK2
pSTAT5
STAT5
TKI JAKI
GrowthFactor
Nucleus
Self
phosphoryla@on
TKI_BCR‐ABL JAKI_JAK2
Model 1:
no interaction between BCR-ABL and JAK2
Dephosphorylation
and nuclear export
k4
k3
k10 k9
k10
k9
k7 k8
k7
k8
k1
k2
k5 k6
k12
k11
Nonlinear dynamics, correlation structure, identifiability...all create problems for standard MCMC.
BAYESIAN APPROACH
Posterior distribution characterises the uncertainty in the parameters
1.4 Monte Carlo Methods
and rational framework for making sense of the world around us, letting us explicitly
state our assumptions and update our current knowledge in light of newly acquired
data.
Probability theory has been around since the 18th century (16, 51) as a means of
making inferences in light of incomplete information. The axiomatic formulation of
probability theory by Kolmogorov (110), together with a derivation by Cox (39) from a
set of postulates that satisfy the desirable properties we would wish to have in a system
of reasoning, have made Bayesian methods arguably the preferred method for inductive
inference. Recent contributions by Knuth and Skilling (180) add further support for
the use of Bayesian probability; based on symmetry assumptions, they show that one is
led to the probability calculus as the only logical and consistent calculus for reasoning
under uncertainty.
Bayes theorem is simply an expression based on conditional probability and it states
the conditional probability of an event A given an event B in terms of the probability of
A, and the probability of B given A. In the context of a statistical model, the posterior
distribution of the model parameters, ✓ = [✓1
...✓D]T , given the data, y = [y1
...yN ]T , is
proportional to the prior distribution of the parameters multiplied by the likelihood of
the data given the parameters.
p(✓|y) = p(y|✓)p(✓)p(y)
=p(y|✓)p(✓)Rp(y|✓)p(✓)d✓
(1.3)
Here the marginal likelihood in the denominator normalises the posterior density, such
that it integrates to one and is a correctly defined probability distribution.
1.4 Monte Carlo Methods
For the purpose of making predictions, we often want to calculate expectations of a
function with respect to the posterior distribution
µf = Ep(✓|y)(f(✓)) =
Zf(✓)p(✓|y)d✓ (1.4)
Since calculating an expectation is essentially just the same task as evaluating an
integral, we could use quadrature methods and other numerical integration schemes.
9
BAYESIAN APPROACH
Posterior distribution characterises the uncertainty in the parameters
1.4 Monte Carlo Methods
and rational framework for making sense of the world around us, letting us explicitly
state our assumptions and update our current knowledge in light of newly acquired
data.
Probability theory has been around since the 18th century (16, 51) as a means of
making inferences in light of incomplete information. The axiomatic formulation of
probability theory by Kolmogorov (110), together with a derivation by Cox (39) from a
set of postulates that satisfy the desirable properties we would wish to have in a system
of reasoning, have made Bayesian methods arguably the preferred method for inductive
inference. Recent contributions by Knuth and Skilling (180) add further support for
the use of Bayesian probability; based on symmetry assumptions, they show that one is
led to the probability calculus as the only logical and consistent calculus for reasoning
under uncertainty.
Bayes theorem is simply an expression based on conditional probability and it states
the conditional probability of an event A given an event B in terms of the probability of
A, and the probability of B given A. In the context of a statistical model, the posterior
distribution of the model parameters, ✓ = [✓1
...✓D]T , given the data, y = [y1
...yN ]T , is
proportional to the prior distribution of the parameters multiplied by the likelihood of
the data given the parameters.
p(✓|y) = p(y|✓)p(✓)p(y)
=p(y|✓)p(✓)Rp(y|✓)p(✓)d✓
(1.3)
Here the marginal likelihood in the denominator normalises the posterior density, such
that it integrates to one and is a correctly defined probability distribution.
1.4 Monte Carlo Methods
For the purpose of making predictions, we often want to calculate expectations of a
function with respect to the posterior distribution
µf = Ep(✓|y)(f(✓)) =
Zf(✓)p(✓|y)d✓ (1.4)
Since calculating an expectation is essentially just the same task as evaluating an
integral, we could use quadrature methods and other numerical integration schemes.
9
We can consider an MCMC approach the “gold standard” as we can calculate quantities to arbitrary precision, given sufficient samples
BAYESIAN APPROACH
Posterior distribution characterises the uncertainty in the parameters
1.4 Monte Carlo Methods
and rational framework for making sense of the world around us, letting us explicitly
state our assumptions and update our current knowledge in light of newly acquired
data.
Probability theory has been around since the 18th century (16, 51) as a means of
making inferences in light of incomplete information. The axiomatic formulation of
probability theory by Kolmogorov (110), together with a derivation by Cox (39) from a
set of postulates that satisfy the desirable properties we would wish to have in a system
of reasoning, have made Bayesian methods arguably the preferred method for inductive
inference. Recent contributions by Knuth and Skilling (180) add further support for
the use of Bayesian probability; based on symmetry assumptions, they show that one is
led to the probability calculus as the only logical and consistent calculus for reasoning
under uncertainty.
Bayes theorem is simply an expression based on conditional probability and it states
the conditional probability of an event A given an event B in terms of the probability of
A, and the probability of B given A. In the context of a statistical model, the posterior
distribution of the model parameters, ✓ = [✓1
...✓D]T , given the data, y = [y1
...yN ]T , is
proportional to the prior distribution of the parameters multiplied by the likelihood of
the data given the parameters.
p(✓|y) = p(y|✓)p(✓)p(y)
=p(y|✓)p(✓)Rp(y|✓)p(✓)d✓
(1.3)
Here the marginal likelihood in the denominator normalises the posterior density, such
that it integrates to one and is a correctly defined probability distribution.
1.4 Monte Carlo Methods
For the purpose of making predictions, we often want to calculate expectations of a
function with respect to the posterior distribution
µf = Ep(✓|y)(f(✓)) =
Zf(✓)p(✓|y)d✓ (1.4)
Since calculating an expectation is essentially just the same task as evaluating an
integral, we could use quadrature methods and other numerical integration schemes.
9
We can consider an MCMC approach the “gold standard” as we can calculate quantities to arbitrary precision, given sufficient samples
The challenge: How do we do this most efficiently?
ODES AS STATISTICAL MODELS
We can define the log-likelihood as
Derivatives of the LL require the sensitivities
SOLVING ODES
General form for an ODE and sensitivities
Sensitivities of an ODE may be computed
Sensitivity information is important for exploring this parameter space
POSTERIOR DISTRIBUTIONS
• Strong correlation structure
• Possible heavy tails
• High dimensional?
QUICK REMINDER OF THE MCMC BASICS
REMINDER OF MCMC BASICS
Rate of convergence is independent of dimensionalitygiven independently drawn samples.
We can use a Monte Carlo estimator:
REMINDER OF MCMC BASICS
We may obtain samples from an ergodic Markov process with the required stationary distribution.
We can construct such a Markov chain that converges if
where,
REMINDER OF MCMC BASICS
Detailed balance is a convenient, sufficient condition,
and chains that satisfy this are reversible.
REMINDER OF MCMC BASICS
Convergence is guaranteed...
REMINDER OF MCMC BASICS
Convergence is guaranteed...
as the number of samples tends to infinity.
REMINDER OF MCMC BASICS
Convergence is guaranteed...
as the number of samples tends to infinity.
Unfortunately we don’t have an infinite amount of time!
WHAT CAN GO WRONG?
If the output of the model depends on combination of two parameters, the conditional distributions may be much more
constrained than the marginal distribution.
WHAT CAN GO WRONG?
If the output of the model depends on combination of two parameters, the conditional distributions may be much more
constrained than the marginal distribution.
THE MAIN ISSUES
• Global convergence (is our target multimodal?)
• Local mixing (once we’ve found the mode)
• Computational cost (distant moves accepted with high probability)
MCMC FLAVOURS
“Vanilla” Metropolis-HastingsSlice Sampling
Reversible Jump MCMCAdaptive MCMCParticle MCMC
Metropolis-adjusted Langevin AlgorithmHamiltonian Monte Carlo
Bridge Sampling/Simulated TemperingDifferential Geometric MCMC
Introducing auxiliary variables is a common trick fordeveloping new MCMC algorithms.
• Slice Sampling
• Hamiltonian Monte Carlo
• Bridge Sampling/Simulated Tempering
• Riemannian Manifold HMC
MCMC FLAVOURS
Estimating local covariance structure to improve proposals.
• Adaptive MCMC
• Particle MCMC
MCMC FLAVOURS
MCMC proposals based on physically inspired dynamics.
• Metropolis-adjusted Langevin Algorithm
• Hamiltonian Monte Carlo
MCMC FLAVOURS
METROPOLIS-ADJUSTED LANGEVIN DIFFUSION
The following stochastic differential equation defines a Langevin diffusion:
An Euler discretisation gives us the following proposal mechanism
where and is the integration stepsize. We propose
, where
and accept with probability
DIFFERENTIAL GEOMETRY IN MCMC
1st order geometry can be useful... but sometimes misleading.
1st order geometry can be useful... but sometimes misleading.
Instead of estimating local covariance structure,why not calculate it directly?
1st order geometry can be useful... but sometimes misleading.
SOME INTUITION
Actual distances depend not only on location,but also the geometry at that point.
• Expected Fisher Information defines a metric (symmetric, bilinear, positive-definite), such that the parameter space can be represented as a Riemannian manifold (Rao, 1945)
• Defines a local basis on a vector space at each point and distance measure at each point on the manifold - link with sensitivity of the statistical model
3.2 An Introduction to Riemannian Geometry
intrinsic qualities rather than the extrinsic parameter-dependent description, which
hints at the power of using a di↵erential geometric approach in statistics.
For now let us focus on some manifold, M , whose points are given by the parameters
of an unnormalised density function, which we can consider as a log-likelihood of some
statistical model given some data. At a particular point ✓, the derivatives of the log-
likelihood are tangent to the manifold and form a basis for the tangent space at ✓,
denoted by T✓M . These tangent basis vectors are simply the score vectors at ✓,
r✓L =
@L
@✓1...
@L
@✓n
�T(3.13)
The tangent space is a linear approximation of the manifold at a given point and it has
the same dimensionality. A natural inner product for this vector space is given by the
covariance of the basis score vectors, since the covariance function satisfies the same
properties, namely symmetry, bilinearity, and positive-definiteness. This inner product
then turns out simply to be the Expected Fisher Information
Gi,j = Cov
✓@L
@✓i,@L
@✓j
◆(3.14)
= Ep(x|✓)
@L
@✓i
T @L
@✓j
!(3.15)
which follows from the fact that the expectation of the score is zero,
Ep(x|✓)
✓@L
@✓i
◆=
Z1
p(x|✓)@
@✓ip(x|✓)p(x|✓)dx (3.16)
=@
@✓i
Zp(x|✓)dx (3.17)
= 0 (3.18)
The Expected Fisher Information can also be expressed in terms of second partial
derivatives, which may be easier to compute for certain problems. This can be obtained
by considering the expectation of the score function,
60
3.2 An Introduction to Riemannian Geometry
intrinsic qualities rather than the extrinsic parameter-dependent description, which
hints at the power of using a di↵erential geometric approach in statistics.
For now let us focus on some manifold, M , whose points are given by the parameters
of an unnormalised density function, which we can consider as a log-likelihood of some
statistical model given some data. At a particular point ✓, the derivatives of the log-
likelihood are tangent to the manifold and form a basis for the tangent space at ✓,
denoted by T✓M . These tangent basis vectors are simply the score vectors at ✓,
r✓L =
@L
@✓1...
@L
@✓n
�T(3.13)
The tangent space is a linear approximation of the manifold at a given point and it has
the same dimensionality. A natural inner product for this vector space is given by the
covariance of the basis score vectors, since the covariance function satisfies the same
properties, namely symmetry, bilinearity, and positive-definiteness. This inner product
then turns out simply to be the Expected Fisher Information
Gi,j = Cov
✓@L
@✓i,@L
@✓j
◆(3.14)
= Ep(x|✓)
@L
@✓i
T @L
@✓j
!(3.15)
which follows from the fact that the expectation of the score is zero,
Ep(x|✓)
✓@L
@✓i
◆=
Z1
p(x|✓)@
@✓ip(x|✓)p(x|✓)dx (3.16)
=@
@✓i
Zp(x|✓)dx (3.17)
= 0 (3.18)
The Expected Fisher Information can also be expressed in terms of second partial
derivatives, which may be easier to compute for certain problems. This can be obtained
by considering the expectation of the score function,
60
• Expected Fisher Information is equivalent to the covariance of the tangent vectors at a point
• Metric tensors transform using the Jacobian of any reparameterisation - squared distance is invariant!
MANIFOLD MCMC
• Diffusion processes across the manifold
• Hamiltonian dynamics across the manifold
We can make MCMC proposals based on:
MANIFOLD MALA
The following stochastic differential equation defines a Langevin diffusion on a Riemannian manifold:
where the natural gradient is denoted by
and the Brownian motion on the manifold is
Discretising gives us the update step for a Markov chain
If we assume a locally constant metric tensor we obtain
which we can compare to a pre-conditioned MALA proposal
We may therefore use proposal and acceptance probability-1
AN EXAMPLE
CIRCADIAN RHYTHM MODEL
Note the saturating reaction rates due to the Michaelis-Menten kinetic terms
CIRCADIAN ODE MODEL
We can define the log-likelihood with Gaussian error model as
Derivatives of the log-likelihood require the sensitivities
The metric tensor also requires the sensitivities
CALCULATING THE METRIC
ESTIMATING THE METRIC
• There may be cases where we cannot calculate the metric tensor directly. For example, employing a robust Student-t error model renders the expected Fisher Information intractable
• In such cases we can employ the standard trick of extending the state space, and use a sampling scheme to estimate the metric tensor at each iteration
In particular, we may define the extended state-space to have a joint distribution
Given the current position we can propose a new state which we accept with probability
This is a reversible transition and we may define to be the likelihood function, such that represents samples of pseudodata, which we can use to obtain an empirical estimate of the expected Fisher Information at each iteration, simply by calculating the covariance of the tangent vectors, since
Results from the circadian ODE model with a Student-t likelihood for Metropolis-Hastings, MALA and mMALA.
CONCLUSIONS
• Use of Riemannian geometry is extremely useful in MCMC
• Such algorithms can help us sample efficiently from high dimensional and strongly correlated distributions by following geometric structure of the manifold
SOME FURTHER THOUGHTS
• Currently developing software for highly parallelised version of differential geometric MCMC for use on HECToR (UK national supercomputer) - EPSRC grant in collaboration with NAG.