riemannian manifolds and statistical...

CSML Lunch Time TalkFriday 23rd Movember 2012

Ben CalderheadResearch Fellow

CoMPLEXUniversity College London

Riemannian Manifolds and Statistical Models

The use geometry in Markov chain Monte Carlo

http://mobro.co/bencalderhead



“Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”Mark Girolami and Ben CalderheadJournal of the Royal Statistical Society: Series B (with discussion)

www.ucl.ac.uk/statistics/research/csi

http://www.ucl.ac.uk/statistics/research/csi


“Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”Mark Girolami and Ben CalderheadJournal of the Royal Statistical Society: Series B (with discussion)

www.ucl.ac.uk/statistics/research/csi

Bernhard Riemann A manifold William HamiltonPaul Langevin A casino...



Differential Geometric MCMC Methods and ApplicationsBen CalderheadPhD Thesis, University of Glasgow (2011)

(google “ben calderhead thesis”)

Many statistical models have a natural geometric structure that is Riemannian in nature.

Many statistical models have a natural geometric structure that is Riemannian in nature.

Can we use this geometric information to design better Markov chain Monte Carlo (MCMC) algorithms?

• Ben CalderheadDifferential Geometric MCMC Methods and ApplicationsPhD Thesis, University of Glasgow (2011)

• Ben Calderhead & Mark GirolamiStatistical Analysis of Nonlinear Dynamical Systems using Differential Geometric Sampling MethodsJournal of the Royal Society, Interface Focus (2011)

• Mark Girolami & Ben CalderheadRiemann Manifold Langevin and Hamiltonian Monte Carlo MethodsJournal of the Royal Statistical Society - Series B (with discussion), (2011) Vol. 73(2), 123-214

• Ben Calderhead & Mark GirolamiEstimating Bayes Factors via Thermodynamic Integration and Population MCMCComputational Statistics and Data Analysis, (2009), Elsevier Press, Vol. 53, 4028-4045

SOME MOTIVATION

HIGHLY NONLINEAR MODELS

• Often sparse, uncertain data with unobserved species

• Often multiple network topologies consistent with the known biology

MODELLING QUESTIONS

• Which parameters should we use for a given model?

• Which model structure is most appropriate to describe the system of interest?

BCR‐ABL

pBCR‐ABL

JAK2

pJAK2

pSTAT5

STAT5

TKI JAKI

GrowthFactor

Nucleus

Self

phosphoryla@on

TKI_BCR‐ABL JAKI_JAK2

Dephosphorylation

and nuclear export

k4

k3

k10 k9

k10

k9

k7 k8

k7

k8

k1

k2

k5 k6

k12

k11

pBCR‐ABL

pJAK2

k13

k14

k16

k15

Model 2:

with interaction between BCR-ABL and JAK2

BCR‐ABL

pBCR‐ABL

JAK2

pJAK2

pSTAT5

STAT5

TKI JAKI

GrowthFactor

Nucleus

Self

phosphoryla@on

TKI_BCR‐ABL JAKI_JAK2

Model 1:

no interaction between BCR-ABL and JAK2

Dephosphorylation

and nuclear export

k4

k3

k10 k9

k10

k9

k7 k8

k7

k8

k1

k2

k5 k6

k12

k11

Nonlinear dynamics, correlation structure, identifiability...all create problems for standard MCMC.

BAYESIAN APPROACH

Posterior distribution characterises the uncertainty in the parameters

1.4 Monte Carlo Methods

and rational framework for making sense of the world around us, letting us explicitly

state our assumptions and update our current knowledge in light of newly acquired

data.

Probability theory has been around since the 18th century (16, 51) as a means of

making inferences in light of incomplete information. The axiomatic formulation of

probability theory by Kolmogorov (110), together with a derivation by Cox (39) from a

set of postulates that satisfy the desirable properties we would wish to have in a system

of reasoning, have made Bayesian methods arguably the preferred method for inductive

inference. Recent contributions by Knuth and Skilling (180) add further support for

the use of Bayesian probability; based on symmetry assumptions, they show that one is

led to the probability calculus as the only logical and consistent calculus for reasoning

under uncertainty.

Bayes theorem is simply an expression based on conditional probability and it states

the conditional probability of an event A given an event B in terms of the probability of

A, and the probability of B given A. In the context of a statistical model, the posterior

distribution of the model parameters, ✓ = [✓1

...✓D]T , given the data, y = [y1

...yN ]T , is

proportional to the prior distribution of the parameters multiplied by the likelihood of

the data given the parameters.

p(✓|y) = p(y|✓)p(✓)p(y)

=p(y|✓)p(✓)Rp(y|✓)p(✓)d✓

(1.3)

Here the marginal likelihood in the denominator normalises the posterior density, such

that it integrates to one and is a correctly defined probability distribution.


For the purpose of making predictions, we often want to calculate expectations of a

function with respect to the posterior distribution

µf = Ep(✓|y)(f(✓)) =

Zf(✓)p(✓|y)d✓ (1.4)

Since calculating an expectation is essentially just the same task as evaluating an

integral, we could use quadrature methods and other numerical integration schemes.

9

BAYESIAN APPROACH





data.









under uncertainty.






...yN ]T , is



p(✓|y) = p(y|✓)p(✓)p(y)

=p(y|✓)p(✓)Rp(y|✓)p(✓)d✓

(1.3)






µf = Ep(✓|y)(f(✓)) =

Zf(✓)p(✓|y)d✓ (1.4)



9

We can consider an MCMC approach the “gold standard” as we can calculate quantities to arbitrary precision, given sufficient samples

BAYESIAN APPROACH





data.









under uncertainty.






...yN ]T , is



p(✓|y) = p(y|✓)p(✓)p(y)

=p(y|✓)p(✓)Rp(y|✓)p(✓)d✓

(1.3)






µf = Ep(✓|y)(f(✓)) =

Zf(✓)p(✓|y)d✓ (1.4)



9

We can consider an MCMC approach the “gold standard” as we can calculate quantities to arbitrary precision, given sufficient samples

The challenge: How do we do this most efficiently?

ODES AS STATISTICAL MODELS

We can define the log-likelihood as

Derivatives of the LL require the sensitivities

SOLVING ODES

General form for an ODE and sensitivities

Sensitivities of an ODE may be computed

Sensitivity information is important for exploring this parameter space

POSTERIOR DISTRIBUTIONS

• Strong correlation structure

• Possible heavy tails

• High dimensional?

QUICK REMINDER OF THE MCMC BASICS

REMINDER OF MCMC BASICS

Rate of convergence is independent of dimensionalitygiven independently drawn samples.

We can use a Monte Carlo estimator:


We may obtain samples from an ergodic Markov process with the required stationary distribution.

We can construct such a Markov chain that converges if

where,


Detailed balance is a convenient, sufficient condition,

and chains that satisfy this are reversible.


Convergence is guaranteed...



as the number of samples tends to infinity.



as the number of samples tends to infinity.

Unfortunately we don’t have an infinite amount of time!

WHAT CAN GO WRONG?

If the output of the model depends on combination of two parameters, the conditional distributions may be much more

constrained than the marginal distribution.

THE MAIN ISSUES

• Global convergence (is our target multimodal?)

• Local mixing (once we’ve found the mode)

• Computational cost (distant moves accepted with high probability)

MCMC FLAVOURS

“Vanilla” Metropolis-HastingsSlice Sampling

Reversible Jump MCMCAdaptive MCMCParticle MCMC

Metropolis-adjusted Langevin AlgorithmHamiltonian Monte Carlo

Bridge Sampling/Simulated TemperingDifferential Geometric MCMC

Introducing auxiliary variables is a common trick fordeveloping new MCMC algorithms.

• Slice Sampling

• Hamiltonian Monte Carlo

• Bridge Sampling/Simulated Tempering

• Riemannian Manifold HMC

MCMC FLAVOURS

Estimating local covariance structure to improve proposals.

• Adaptive MCMC

• Particle MCMC

MCMC FLAVOURS

MCMC proposals based on physically inspired dynamics.

• Metropolis-adjusted Langevin Algorithm

• Hamiltonian Monte Carlo

MCMC FLAVOURS

METROPOLIS-ADJUSTED LANGEVIN DIFFUSION

The following stochastic differential equation defines a Langevin diffusion:

An Euler discretisation gives us the following proposal mechanism

where and is the integration stepsize. We propose

, where

and accept with probability

DIFFERENTIAL GEOMETRY IN MCMC

1st order geometry can be useful... but sometimes misleading.

Instead of estimating local covariance structure,why not calculate it directly?

1st order geometry can be useful... but sometimes misleading.

SOME INTUITION

Actual distances depend not only on location,but also the geometry at that point.

• Expected Fisher Information defines a metric (symmetric, bilinear, positive-definite), such that the parameter space can be represented as a Riemannian manifold (Rao, 1945)

• Defines a local basis on a vector space at each point and distance measure at each point on the manifold - link with sensitivity of the statistical model

3.2 An Introduction to Riemannian Geometry

intrinsic qualities rather than the extrinsic parameter-dependent description, which

hints at the power of using a di↵erential geometric approach in statistics.

For now let us focus on some manifold, M , whose points are given by the parameters

of an unnormalised density function, which we can consider as a log-likelihood of some

statistical model given some data. At a particular point ✓, the derivatives of the log-

likelihood are tangent to the manifold and form a basis for the tangent space at ✓,

denoted by T✓M . These tangent basis vectors are simply the score vectors at ✓,

r✓L =

@L

@✓1...

@L

@✓n

�T(3.13)

The tangent space is a linear approximation of the manifold at a given point and it has

the same dimensionality. A natural inner product for this vector space is given by the

covariance of the basis score vectors, since the covariance function satisfies the same

properties, namely symmetry, bilinearity, and positive-definiteness. This inner product

then turns out simply to be the Expected Fisher Information

Gi,j = Cov

✓@L

@✓i,@L

@✓j

◆(3.14)

= Ep(x|✓)

@L

@✓i

T @L

@✓j

!(3.15)

which follows from the fact that the expectation of the score is zero,

Ep(x|✓)

✓@L

@✓i

◆=

Z1

p(x|✓)@

@✓ip(x|✓)p(x|✓)dx (3.16)

=@

@✓i

Zp(x|✓)dx (3.17)

= 0 (3.18)

The Expected Fisher Information can also be expressed in terms of second partial

derivatives, which may be easier to compute for certain problems. This can be obtained

by considering the expectation of the score function,

60

3.2 An Introduction to Riemannian Geometry

intrinsic qualities rather than the extrinsic parameter-dependent description, which

hints at the power of using a di↵erential geometric approach in statistics.

For now let us focus on some manifold, M , whose points are given by the parameters

of an unnormalised density function, which we can consider as a log-likelihood of some

statistical model given some data. At a particular point ✓, the derivatives of the log-

likelihood are tangent to the manifold and form a basis for the tangent space at ✓,

denoted by T✓M . These tangent basis vectors are simply the score vectors at ✓,

r✓L =

@L

@✓1...

@L

@✓n

�T(3.13)

The tangent space is a linear approximation of the manifold at a given point and it has

the same dimensionality. A natural inner product for this vector space is given by the

covariance of the basis score vectors, since the covariance function satisfies the same

properties, namely symmetry, bilinearity, and positive-definiteness. This inner product

then turns out simply to be the Expected Fisher Information

Gi,j = Cov

✓@L

@✓i,@L

@✓j

◆(3.14)

= Ep(x|✓)

@L

@✓i

T @L

@✓j

!(3.15)

which follows from the fact that the expectation of the score is zero,

Ep(x|✓)

✓@L

@✓i

◆=

Z1

p(x|✓)@

@✓ip(x|✓)p(x|✓)dx (3.16)

=@

@✓i

Zp(x|✓)dx (3.17)

= 0 (3.18)

The Expected Fisher Information can also be expressed in terms of second partial

derivatives, which may be easier to compute for certain problems. This can be obtained

by considering the expectation of the score function,

60

• Expected Fisher Information is equivalent to the covariance of the tangent vectors at a point

• Metric tensors transform using the Jacobian of any reparameterisation - squared distance is invariant!

MANIFOLD MCMC

• Diffusion processes across the manifold

• Hamiltonian dynamics across the manifold

We can make MCMC proposals based on:

MANIFOLD MALA

The following stochastic differential equation defines a Langevin diffusion on a Riemannian manifold:

where the natural gradient is denoted by

and the Brownian motion on the manifold is

Discretising gives us the update step for a Markov chain

If we assume a locally constant metric tensor we obtain

which we can compare to a pre-conditioned MALA proposal

We may therefore use proposal and acceptance probability-1

AN EXAMPLE

CIRCADIAN RHYTHM MODEL

Note the saturating reaction rates due to the Michaelis-Menten kinetic terms

CIRCADIAN ODE MODEL

We can define the log-likelihood with Gaussian error model as

Derivatives of the log-likelihood require the sensitivities

The metric tensor also requires the sensitivities

CALCULATING THE METRIC

ESTIMATING THE METRIC

• There may be cases where we cannot calculate the metric tensor directly. For example, employing a robust Student-t error model renders the expected Fisher Information intractable

• In such cases we can employ the standard trick of extending the state space, and use a sampling scheme to estimate the metric tensor at each iteration

In particular, we may define the extended state-space to have a joint distribution

Given the current position we can propose a new state which we accept with probability

This is a reversible transition and we may define to be the likelihood function, such that represents samples of pseudodata, which we can use to obtain an empirical estimate of the expected Fisher Information at each iteration, simply by calculating the covariance of the tangent vectors, since

Results from the circadian ODE model with a Student-t likelihood for Metropolis-Hastings, MALA and mMALA.

CONCLUSIONS

• Use of Riemannian geometry is extremely useful in MCMC

• Such algorithms can help us sample efficiently from high dimensional and strongly correlated distributions by following geometric structure of the manifold

SOME FURTHER THOUGHTS

• Currently developing software for highly parallelised version of differential geometric MCMC for use on HECToR (UK national supercomputer) - EPSRC grant in collaboration with NAG.

riemannian manifolds and statistical...

Documents