mcqmc 2012 from inference to modelling to algorithms and back again kerrie mengersen qut brisbane

MCQMC 2012

From inference to modelling to algorithms and back again

Kerrie MengersenQUT Brisbane

Collaborative Centre for Data Analysis,

Modelling and Computation QUT, GPO Box 2434

Brisbane 4001, Australia

Acknowledgements: BRAG

Bayesian methods and models+ Fast computation

+ Applications in environment, health, biology, industry

So what’s the problem?

Matchmaking 101

Study 1: Presence/absence modelsSama Low ChoyMark Stanaway

http://www.google.com/imgres?imgurl=http://gallery.photo.net/photo/4072646-lg.jpg&imgrefurl=http://photo.net/photodb/photo%3Fphoto_id%3D4072646&h=680&w=511&sz=120&tbnid=pMaYZ0k_472XMM::&tbnh=139&tbnw=104&prev=/images%3Fq%3Drock%2Bwallaby&usg=___p42wiuKt5K_SkTVTDtRvSGTwMU=&sa=X&oi=image_result&resnum=4&ct=image&cd=1

Plant biosecurity

• Observations and data– Visual inspection symptoms– Presence / absence data– Space and time

• Dynamic invasion process– Growth, spread

• Inference– Map probability of extent over time– Useful scale for managing trade / eradication– Currently use informal qualitative approach

• Hierarchical Bayesian model to formalise the information

From inference to model

Hierarchical Bayesian model for plant pest spread

• Data Model: Pr(data | incursion process and data parameters) – How data is observed given underlying pest extent

• Process Model: Pr(incursion process | process parameters) – Potential extent given epidemiology / ecology

• Parameter Model: Pr(data and process parameters)– Prior distribution to describe uncertainty in detectability, exposure, growth …

• The posterior distribution of the incursion process (and parameters) is related to the prior distribution and data by:

Pr(process, parameters | data) Pr(data | process, parameters ) Pr( process | parameters ) Pr(parameters)

Early Warning Surveillance Priors based on emergency

plant pest characteristics exposure rate for

colonisation probability spread rates to link sites

together for spatial analysis Add surveillance data

Posterior evaluation modest reduction in area

freedom large reduction in estimated

extent residual “risk” maps to target

surveillance

Observation Parameter Estimates

• Taking into account invasion process

• Hosts– Host suitability

• Inspector efficiency– Identify contributions

Study 2: Mixture modelsClair Alston

CAT scanning sheep

• Finite mixture model yi ~ j N(j,j

2)

• Include spatial information


What proportions of the sheepcarcase are muscle, fat and bone?

Inside a sheep

Study 3: State space modelsNicole White

Parkinson’s Disease

PD symptom data• Current methods for PD subtype classification rely

on a few criteria and do not permit uncertainty in subgroup membership.

• Alternative: finite mixture model (equivalent to a latent class analysis for multivariate categorical outcomes)

• Symptom data: Duration of diagnosis, early onset PD, gender, handedness, side of onset

1. Define a finite mixture model based on patient responses to Bernoulli and Multinomial questions.

2. Describe subgroups w.r.t. explanatory variables

3. Obtain patient’s probability of class membership

yij: ith subject’sresponse to item j


PD: Symptom data

PD Signal data:“How will they respond?”

Inferential aims

Identify spikesand assign tounknown no.source neurons

Compare clustersbetween segmentswithin a recordingand betweenrecordings atdifferent locationsof the brain

3 depths

Microelectrode recordings

Each recording wasdivided into 2.5sec.segments

Discriminatingfeatures foundvia PCA

DP Modelyi | i ~ p(yi | i)

i ~ G

G ~ DP(, G0)

P PCs, yi=(yi1,..,yiP) ~ MVN()

G0 = p(p

~ Ga(2,2)


Average waveforms

Study 4: Spatial dynamic factor modelsChris StricklandIan Turner

What can we learn about landuse from MODIS data?

Differentiate landuse SDFM

• 1st factor has influence on temporal dynamics in right half of image (woodlands)

• 3rd factor has influence on LH image (grasslands)

1st trend component 2nd trend comp. common cyclical comp.

Matchmaking 101

Smart models

Example 1: GeneralisationMixtures are greatbut how do we choose k?

Propose an overfitting model (k>k0)

Non-identifiable!All values of = (p1

0,..,pk00, 0, 1

0,..,k00)

and all values of = (p10,..,pj,…,pk0

0, pk+1, 10,..,k0

0, j0)

with pj+pk+1=pj0 fit equally well.

Judith Rousseauf0(x) = j=1,..,k0 pj gj(x)

So what?

• Multiplicity of possible solutions => MLE does not have a stable asymptotic behaviour.

• Not important when f is the main object of interest, but important if we want to recover

• It thus becomes crucial to know that the posterior distribution under overfitted mixtures give interpretable results

Possible alternatives to avoid overfitting

Fruhwirth-Schnatter (2006): either one of the component weights is zero or two of the component parameters are equal.

• Choose priors that bound the posterior away from the unidentifiability sets.

• Choose priors that induce shrinkage for elements of the component parameters.

Problem: may not be able to fit the true

model

Our result

Assumptions:– L1 consistency of the posterior – Model g is three times differentiable, regular, and

integrable– Prior on is continuous and positive, and the prior

on (p1,..,pk) satisfies (p) p1

1-1…pkk-1

Our result - 1

• If max(1)<d/2, where d=dim(), then asymptotically f(|x) concentrates on the subset of parameters for which f = f0, so k-k0 components have weight 0.

• The reason for this stable behaviour as opposed as the unstable behaviour of the maximum likelihood estimator is that integrating out the parameter acts as a penalization: the posterior essentially puts mass on the sparsest way to approximate the true density.

Our result - 2

• In contrast, if min(j, j≤k)>d/2 and k>k0, then 2 or more components will tend to merge with non-neglectable weights each. This will lead to less stable behaviour.

• In the intermediate case, if min(j, j≤k) ≤d/2 ≤max(j,j ≤k), then the situation varies depending on the j’s, and on the difference between k and k0.

Implications: Model dimension

• When d/2>max{j, j=1,..,k},dk0+k0-1+j≥k0+1j appears as an effective dimension of the model

• This is different from the number of parameters, dk+k-1, or from other “effective number of parameters”

• Similar results are obtained for other situations

Example 1 yi ~ N(0,1); fit pN(1,1)+(1-p)N(2,1)

i=1 > d/2

Example 2 yi ~ N(0,1)

G=pN2(1,1)+(1-p)N2(2, 2), j diagonal d = 3; 1=2=1 < d/2

Conclusions• The result validates the use of Bayesian estimation

in mixture models with too many components.• It is one of the few examples where the prior can

actually have an impact asymptotically, even to first order (consistency) and where choosing a less informative prior leads to better results.

• It also shows that the penalization effect of integrating out the parameter, as considered in the Bayesian framework is not only useful in model choice or testing contexts but also in estimating contexts.

Example 2: Empirical likelihoods

• Sometimes the likelihood associated with the data is not completely known or cannot be computed in a manageable time (eg population genetic models, hidden Markov models, dynamic models), so traditional tools based on stochastic simulation (eg, regular MCMC) are unavailable or unreliable.

• Eg, biosecurity spread model.

Christian Robert

Model alternative: ELvIS• Define parameters of interest as functionals of the cdf

F (eg moments of F), then use Importance Sampling via the Empirical Likelihood.

• Select the F that maximises the likelihood of the data under the moment constraint.

• Given a constraint of the form E((Y)) = the EL is defined as

Lel(|y) = maxFi=1:n{F(yi)-F(Yi-1}• For example, in the 1-D case when = E(Y)

the empirical likelihood in is the maximum of p,…,pn under the constraint i=1:npiyi =

Quantile distributions

• A quantile distribution is defined by a closed-form quantile function F-1(p;) and generally has no closed form for the density function.

• Properties: very flexible, very fast to simulate (simple inversion of the uniform distribution).

• Examples: 3/4/5-parameter Tukey’s lambda distribution and generalisations; Burr family; g-and-k and g-and-h distributions.

g-and-k quantile distribution

Methods for estimating a quantile distribution

• MLE using numerical approximation to the likelihood

• Moment matching• Generalised bootstrap• Location and scale-free functionals• Percentile matching• Quantile matching• ABC• Sequential MC approaches for multivariate

extensions of the g-and-k

ELvIS in practice

• Two values of =(A,B,g,k):=(0,1,0,0) standard normal distribution=(3,2,1,0.5) Allingham’s choice

• Two priors for :U(0,5)4

A~U(-5,5), B~U(0,5), g~U(5,5), k~(-1,1)• Two sample sizes:

n=100n=1000

ELvIS in practice: =(3,2,1,0.5), n=100

Matchmaking 101

A wealth of algorithms!

From model to algorithm

Models:• Logistic regression• Non-Gaussian state space models• Spatial dynamic factor models

Evaluate:• Computation time• Maximum bias• sd• Inefficiency factor (IF)• Accuracy rate

Chris Strickland

Logistic Regression

k = 2, 4, 8, 20 covariates; n=1000 observations

• Importance sampling (IS):– E[h()] = h() [p(|y)/q(|y)] q(|q) d

with q(|y) proportional to exp(-0.5(-*)TV-1(-*))

– MLE * (mode found by IRWLS) variance V=-2∂2p(y|X,)/(ddT)|=*

• Random walk Metropolis-Hastings (RWMH):– same proposal distribution

• Adaptive RWMHGarthwaite, Yan, Sisson

– Only needs starting values (*) – easiest candidate!

Results

Algorithm k time bias sd IF acc.rateIS 2 5.1 0.07 0.06 - -

8 5.6 0.18 0.09 - - 20 6.3 0.30 0.11 - -

RWMH 2 8.1 0.08 0.07 12 0.56 8 8.8 0.15 0.10 30 0.20

20 9.2 0.20 0.12 120 0.03

ARWMH 2 10.5 0.08 0.07 10 0.24 8 10.8 0.20 0.07 35 0.23 20 12.1 0.30 0.12 50 0.21

So what?

• If k is small (e.g. 8), even a naïve candidate is ok.• When k is larger (e.g. 20), we need something

more intelligent, e.g. adaptive MH• As models become more complicated, we need to

get more sophisticated

Importance samplervs

MCMCvs

Particle filtervs

Laplace approximation (INLA)

Non-Gaussian state space model

Non-Gaussian state space model• Importance sampler

√ General algorithms (Durbin&Koopman 2001 - global approximation) andTailored algorithms (Liesenfeld&Richard 2006 – local approximation)

√ Independence sampler - don’t have to worry about correlated draws

√ Parallelisable – potentially much faster× Difficult to come up with a good candidate distribution× More difficult as the sample and model complexity

increase× More complex than MCMC to obtain non-standard

expectations

Non-Gaussian state space model• MCMC

√ Very flexible w.r.t extensions to other models

√ The same algorithms can be used as the sample size and/or parameter dimension grows (with some provisos)

× Can be slow, and complicated to achieve good acceptance and mixing

× Single move samplers perform poorly in terms of mixingPitt&Shephard; Kim,Shephard&Chib

√ Very efficient (mixing) algorithms can be designed for specific problems, eg stochastic volatilityK,S,C

√ General approaches available – simple reparametrisation can lead to vastly improved simulation efficiencyStrickland et al


• Particle filters√ Easy to implement – at least intuitively

√ Updating estimates as the sample size grows is possibly a lot simpler and cheaper than a full MCMC or IS sampler update

× Perhaps not as easy as appears with parameter learning

× Particle approximation degenerates – may need MCMC steps, or alternatives

× To do full MCMC updates, need to store the entire history of particle approximation


• Integrated Nested Laplace Approximation (INLA)√ Extremely fast

√ If model complexity stays the same, then it can work for very large problems

√ R interface to code

× Can only hold a small number of hyperparameters than can be forced in the GMRF approximation, so very restrictive to the problem set

So what?

• Many algorithms: general versus specific, flexible versus tailored

• Pros and cons should be weighed against inferential aims, model and computational resources

• Blocking and reparametrisation are two good tricks, but we need to be clever about non-centred reparametrisation

Spatial Dynamic Factor Models

Yt = Bft + t, t ~ N(0, ), diag.

factor loadings bj ~ N(0, V(s))

• Spatial correlation:– Lopes et al. use a GRF, O((p×k*)3)– Strickland et al. use a GMRF (images: large discrete

spatial domain)+ Krylov subspace methods to sample from GMRF posterior, scales linearly O(p×k*).

– Rue uses Cholesky decomposition, more complex, O(p×k*)3/2)

So what?

Difference between (i) O(p×k*) and (ii) O(p×k*)3/2):

If the data set becomes 100 times larger, (i) will take 1 million times longer, compared to 100 times longer for (ii). Instead of waiting 1 hour, you might have to wait more than 1 year…

Case study• Spatial domain: 900 pixels

• Temporal: approx 200 periods (every 16 days)

• Total 180,000 observations, ~3600 parameters

10 mins for 15000 MCMC iterations (on a laptop!)

Conclusions• The model is too complicated and the datasets too

large for IS or PF.• There are too many parameters for INLA.• MCMC is thus desirable, but it is extremely

important to choose good algorithms!

As the model becomes more complex, the choice of smart algorithms becomes more important

and can make a large difference in computation and estimation

Matchmaking 102

Smart algorithms

Hybrid algorithms

Tierney (1994)

Design an efficient algorithm that combines features of other

algorithms, in order to overcome identified weaknesses in the

component algorithms

Comparing algorithms• Accuracy

– bias (H)• Efficiency

– rate of convergence, rate of acceptance (A)– mixing speed (integrated autocorrelation time) (H)

• Applicability– simplicity of set-up, flexibility of tailoring

• Implementation– coding difficulty, memory storage– computational demand:

• total number of iterations (T), burnin (T0) • correlation along the chain, measured by the effective sample size

ESSH = (T-T0)/(H)

Hybrid algorithms - 1

• Metropolis-Hastings Algorithm (MHAs)– Improve mixing speed via:

• parallel chains (MHA)

• repulsive proposal (MHARP M&R), pinball sampler

• delayed rejection (DRATierney & Mira, DRALP, DRAPinball)

– Improve applicability by: • reversible jumpGreen

• Metropolis adjusted Langevin (MALATierney&Roberts, Besag&Green)

Kate Lee, Christian Robert, Ross McVinish

Simulation study

• Model– mixture of 2-D normals, well separated

= 0.5 N([0,0]T, I2) + 0.5 N([5,5]T, I2)– 100 replicated simulations– Each result is obtained after running the algorithm for

1200 seconds using 10 particles– Proposal variance = 4; value in brackets is MSE

(similar results for variance = 2)• Platform

– Version 7.0.4 Matlab– run by SGI Altix XE cluster containing 112 x 64 bit

Intel Xeon cores.

Results

T=no. simulations; A=acceptance rate; H=accuracy; 2H=var(H),

p(H)=autocorrelation time; ESSH=effective sample size

Results• MHA

– shortest CPU time per iteration and largest sample size– need to tune proposal variance to optimise performance

• MALA– can get trapped in nearest mode if the scaling parameter in the

variance of the proposal is not sufficiently large• MHA with RP

– induces a fast mixing chain, but need to choose tuning parameter– expensive to compute– in rare cases the algorithm can be unstable and get stuck

• DRA– less correlated chains, higher acceptance rate– higher computational demand– Langevin proposal: improves mixing, but the loss in

computational efficiency overwhelmed gain in statistical efficiency

– Normal random walk faster to compute and improves mixing.


• Population Monte Carlo (PMC)– Extension of IS by allowing importance function to

adapt to the target in an iterative mannerCappe et al

– PMC with repulsive proposal: create ‘holes’ around existing particles

• Particle systems and MCMC– IS + MCMCM&R

– parallel MCMC + SMCdel Moral

Simulation study repeated

• 100 replicates, run for 500 seconds using 50 particles, with first 100 iterations ignored

√ accuracy of estimation√ fast exploration (significantly reduced

integrated autocorrelation time)√ No instability (unlike MCMC algorithms)√ Less sensitivity to importance function√ Repulsive effect improved mixing

Summary

Algorithm Statistical Efficiency Computation Applicability

EPM CR RC CE SP FH CP Mode

MALA 1 1 1 -2 -1 0 -1 S

MHARP 0 1 0 -2 -1 -1 -2 B

DRA 1 1 0 -1 -1 0 0 B

DRALP 2 1 0 -2 -1 0 0 B

DRAPinball 2 1 0 -2 -1 0 0 B

PS 2 1 0 -3 -2 -1 -2 B

PMCR - 2 0 -1 -1 -1 0 B

Relative performance compared to MHA and PMC

EPM=efficiency of proposal move; CR=correlation reduction of chain;RC=rate of convergence; CE=cost effectiveness; SP=simplicity of programming;FH=flexibility of hyperparameters; CP=consistency of performance; Mode=preference between a single mode and multimodal problem

ABCelUnlike ABC, ABCel does not require:

Conclusions

1. Combining features of individual algorithms may lead to complicated characteristics in a hybrid algorithm.

2. Each individual algorithm may have a strong individual advantage with respect to a particular performance criterion, but this does not guarantee that the hybrid method will enjoy a joint benefit of these strategies.

3. The combination of algorithms may add complexity in set-up, programming and computational expense.

Implementing smart algorithms: PyMCMC

• Python package for fast MCMC• takes advantage of Python libraries Numpy, Scipy • Classes for Gibbs, M-H, orientational bias MC,

slice samplers, etc.• linear (with stochastic search), logit, probit, log-

linear, linear mixed-model, probit mixed-model, nonlinear mixed models, mixture, spatial mixture, spatial mixture with regressors, time series suite (including DFM and SDFM)

• Straightforward to optimise, extensible to C or Fortran, parallelisable (GPU)

Chris Strickland

PyMCMC

[email protected]

Matchmaking algorithm!

Model Algorithm

Inference

Coolproblems

Past Future

Key References

• Lee, K., Mengersen, K., Robert, C.P. (2012) Hybrid models. In Case Studies in Bayesian Modelling. Eds Alston, Mengersen, Pettitt. Wiley, to appear.

• Stanaway, M., Reeves, R., Mengersen, K. (2010) Hierarchical Bayesian modelling of early detection surveillance for plant pest invasions. J. Environmental and Ecological Statistics.

• Strickland, C., Simpson, D, Denham, R., Turner, I., Mengersen, K. Fast methods for spatial dynamic factor models. CSDA.

• Strickland, C., Alston, C., Mengersen, K. (2011) PyMCMC. J. Statistical Software. Under review.

• White, N. Johnson, H., Silburn, P., Mengersen, K. Unsupervised sorting and comparison of extracellular spikes with Dirichlet Process Mixture Models. Annals Applied Statistics. Under review

mcqmc 2012 from inference to modelling to algorithms and back again kerrie mengersen qut brisbane

Documents

model slide

surveillance slide

onset slide

symptom data slide

industry slide

pca slide

data model

plant biosecurity slide