mcqmc 2012 from inference to modelling to algorithms and back again kerrie mengersen qut brisbane
TRANSCRIPT
MCQMC 2012
From inference to modelling to algorithms and back again
Kerrie MengersenQUT Brisbane
Collaborative Centre for Data Analysis,
Modelling and Computation QUT, GPO Box 2434
Brisbane 4001, Australia
Acknowledgements: BRAG
Bayesian methods and models+ Fast computation
+ Applications in environment, health, biology, industry
So what’s the problem?
Matchmaking 101
Study 1: Presence/absence modelsSama Low ChoyMark Stanaway
Plant biosecurity
• Observations and data– Visual inspection symptoms– Presence / absence data– Space and time
• Dynamic invasion process– Growth, spread
• Inference– Map probability of extent over time– Useful scale for managing trade / eradication– Currently use informal qualitative approach
• Hierarchical Bayesian model to formalise the information
From inference to model
Hierarchical Bayesian model for plant pest spread
• Data Model: Pr(data | incursion process and data parameters) – How data is observed given underlying pest extent
• Process Model: Pr(incursion process | process parameters) – Potential extent given epidemiology / ecology
• Parameter Model: Pr(data and process parameters)– Prior distribution to describe uncertainty in detectability, exposure, growth …
• The posterior distribution of the incursion process (and parameters) is related to the prior distribution and data by:
Pr(process, parameters | data) Pr(data | process, parameters ) Pr( process | parameters ) Pr(parameters)
Early Warning Surveillance Priors based on emergency
plant pest characteristics exposure rate for
colonisation probability spread rates to link sites
together for spatial analysis Add surveillance data
Posterior evaluation modest reduction in area
freedom large reduction in estimated
extent residual “risk” maps to target
surveillance
Observation Parameter Estimates
• Taking into account invasion process
• Hosts– Host suitability
• Inspector efficiency– Identify contributions
Study 2: Mixture modelsClair Alston
CAT scanning sheep
• Finite mixture model yi ~ j N(j,j
2)
• Include spatial information
From inference to model
What proportions of the sheepcarcase are muscle, fat and bone?
Inside a sheep
Inside a sheep
Study 3: State space modelsNicole White
Parkinson’s Disease
PD symptom data• Current methods for PD subtype classification rely
on a few criteria and do not permit uncertainty in subgroup membership.
• Alternative: finite mixture model (equivalent to a latent class analysis for multivariate categorical outcomes)
• Symptom data: Duration of diagnosis, early onset PD, gender, handedness, side of onset
1. Define a finite mixture model based on patient responses to Bernoulli and Multinomial questions.
2. Describe subgroups w.r.t. explanatory variables
3. Obtain patient’s probability of class membership
yij: ith subject’sresponse to item j
From inference to model
PD: Symptom data
PD Signal data:“How will they respond?”
Inferential aims
Identify spikesand assign tounknown no.source neurons
Compare clustersbetween segmentswithin a recordingand betweenrecordings atdifferent locationsof the brain
3 depths
Microelectrode recordings
Each recording wasdivided into 2.5sec.segments
Discriminatingfeatures foundvia PCA
DP Modelyi | i ~ p(yi | i)
i ~ G
G ~ DP(, G0)
P PCs, yi=(yi1,..,yiP) ~ MVN()
G0 = p(p
~ Ga(2,2)
From inference to model
Average waveforms
Study 4: Spatial dynamic factor modelsChris StricklandIan Turner
What can we learn about landuse from MODIS data?
Differentiate landuse SDFM
• 1st factor has influence on temporal dynamics in right half of image (woodlands)
• 3rd factor has influence on LH image (grasslands)
1st trend component 2nd trend comp. common cyclical comp.
Matchmaking 101
Smart models
Example 1: GeneralisationMixtures are greatbut how do we choose k?
Propose an overfitting model (k>k0)
Non-identifiable!All values of = (p1
0,..,pk00, 0, 1
0,..,k00)
and all values of = (p10,..,pj,…,pk0
0, pk+1, 10,..,k0
0, j0)
with pj+pk+1=pj0 fit equally well.
Judith Rousseauf0(x) = j=1,..,k0 pj gj(x)
So what?
• Multiplicity of possible solutions => MLE does not have a stable asymptotic behaviour.
• Not important when f is the main object of interest, but important if we want to recover
• It thus becomes crucial to know that the posterior distribution under overfitted mixtures give interpretable results
Possible alternatives to avoid overfitting
Fruhwirth-Schnatter (2006): either one of the component weights is zero or two of the component parameters are equal.
• Choose priors that bound the posterior away from the unidentifiability sets.
• Choose priors that induce shrinkage for elements of the component parameters.
Problem: may not be able to fit the true
model
Our result
Assumptions:– L1 consistency of the posterior – Model g is three times differentiable, regular, and
integrable– Prior on is continuous and positive, and the prior
on (p1,..,pk) satisfies (p) p1
1-1…pkk-1
Our result - 1
• If max(1)<d/2, where d=dim(), then asymptotically f(|x) concentrates on the subset of parameters for which f = f0, so k-k0 components have weight 0.
• The reason for this stable behaviour as opposed as the unstable behaviour of the maximum likelihood estimator is that integrating out the parameter acts as a penalization: the posterior essentially puts mass on the sparsest way to approximate the true density.
Our result - 2
• In contrast, if min(j, j≤k)>d/2 and k>k0, then 2 or more components will tend to merge with non-neglectable weights each. This will lead to less stable behaviour.
• In the intermediate case, if min(j, j≤k) ≤d/2 ≤max(j,j ≤k), then the situation varies depending on the j’s, and on the difference between k and k0.
Implications: Model dimension
• When d/2>max{j, j=1,..,k},dk0+k0-1+j≥k0+1j appears as an effective dimension of the model
• This is different from the number of parameters, dk+k-1, or from other “effective number of parameters”
• Similar results are obtained for other situations
Example 1 yi ~ N(0,1); fit pN(1,1)+(1-p)N(2,1)
i=1 > d/2
Example 2 yi ~ N(0,1)
G=pN2(1,1)+(1-p)N2(2, 2), j diagonal d = 3; 1=2=1 < d/2
Conclusions• The result validates the use of Bayesian estimation
in mixture models with too many components.• It is one of the few examples where the prior can
actually have an impact asymptotically, even to first order (consistency) and where choosing a less informative prior leads to better results.
• It also shows that the penalization effect of integrating out the parameter, as considered in the Bayesian framework is not only useful in model choice or testing contexts but also in estimating contexts.
Example 2: Empirical likelihoods
• Sometimes the likelihood associated with the data is not completely known or cannot be computed in a manageable time (eg population genetic models, hidden Markov models, dynamic models), so traditional tools based on stochastic simulation (eg, regular MCMC) are unavailable or unreliable.
• Eg, biosecurity spread model.
Christian Robert
Model alternative: ELvIS• Define parameters of interest as functionals of the cdf
F (eg moments of F), then use Importance Sampling via the Empirical Likelihood.
• Select the F that maximises the likelihood of the data under the moment constraint.
• Given a constraint of the form E((Y)) = the EL is defined as
Lel(|y) = maxFi=1:n{F(yi)-F(Yi-1}• For example, in the 1-D case when = E(Y)
the empirical likelihood in is the maximum of p,…,pn under the constraint i=1:npiyi =
Quantile distributions
• A quantile distribution is defined by a closed-form quantile function F-1(p;) and generally has no closed form for the density function.
• Properties: very flexible, very fast to simulate (simple inversion of the uniform distribution).
• Examples: 3/4/5-parameter Tukey’s lambda distribution and generalisations; Burr family; g-and-k and g-and-h distributions.
g-and-k quantile distribution
Methods for estimating a quantile distribution
• MLE using numerical approximation to the likelihood
• Moment matching• Generalised bootstrap• Location and scale-free functionals• Percentile matching• Quantile matching• ABC• Sequential MC approaches for multivariate
extensions of the g-and-k
ELvIS in practice
• Two values of =(A,B,g,k):=(0,1,0,0) standard normal distribution=(3,2,1,0.5) Allingham’s choice
• Two priors for :U(0,5)4
A~U(-5,5), B~U(0,5), g~U(5,5), k~(-1,1)• Two sample sizes:
n=100n=1000
ELvIS in practice: =(3,2,1,0.5), n=100
Matchmaking 101
A wealth of algorithms!
From model to algorithm
Models:• Logistic regression• Non-Gaussian state space models• Spatial dynamic factor models
Evaluate:• Computation time• Maximum bias• sd• Inefficiency factor (IF)• Accuracy rate
Chris Strickland
Logistic Regression
k = 2, 4, 8, 20 covariates; n=1000 observations
• Importance sampling (IS):– E[h()] = h() [p(|y)/q(|y)] q(|q) d
with q(|y) proportional to exp(-0.5(-*)TV-1(-*))
– MLE * (mode found by IRWLS) variance V=-2∂2p(y|X,)/(ddT)|=*
• Random walk Metropolis-Hastings (RWMH):– same proposal distribution
• Adaptive RWMHGarthwaite, Yan, Sisson
– Only needs starting values (*) – easiest candidate!
Results
Algorithm k time bias sd IF acc.rateIS 2 5.1 0.07 0.06 - -
8 5.6 0.18 0.09 - - 20 6.3 0.30 0.11 - -
RWMH 2 8.1 0.08 0.07 12 0.56 8 8.8 0.15 0.10 30 0.20
20 9.2 0.20 0.12 120 0.03
ARWMH 2 10.5 0.08 0.07 10 0.24 8 10.8 0.20 0.07 35 0.23 20 12.1 0.30 0.12 50 0.21
So what?
• If k is small (e.g. 8), even a naïve candidate is ok.• When k is larger (e.g. 20), we need something
more intelligent, e.g. adaptive MH• As models become more complicated, we need to
get more sophisticated
Importance samplervs
MCMCvs
Particle filtervs
Laplace approximation (INLA)
Non-Gaussian state space model
Non-Gaussian state space model• Importance sampler
√ General algorithms (Durbin&Koopman 2001 - global approximation) andTailored algorithms (Liesenfeld&Richard 2006 – local approximation)
√ Independence sampler - don’t have to worry about correlated draws
√ Parallelisable – potentially much faster× Difficult to come up with a good candidate distribution× More difficult as the sample and model complexity
increase× More complex than MCMC to obtain non-standard
expectations
Non-Gaussian state space model• MCMC
√ Very flexible w.r.t extensions to other models
√ The same algorithms can be used as the sample size and/or parameter dimension grows (with some provisos)
× Can be slow, and complicated to achieve good acceptance and mixing
× Single move samplers perform poorly in terms of mixingPitt&Shephard; Kim,Shephard&Chib
√ Very efficient (mixing) algorithms can be designed for specific problems, eg stochastic volatilityK,S,C
√ General approaches available – simple reparametrisation can lead to vastly improved simulation efficiencyStrickland et al
Non-Gaussian state space model
• Particle filters√ Easy to implement – at least intuitively
√ Updating estimates as the sample size grows is possibly a lot simpler and cheaper than a full MCMC or IS sampler update
× Perhaps not as easy as appears with parameter learning
× Particle approximation degenerates – may need MCMC steps, or alternatives
× To do full MCMC updates, need to store the entire history of particle approximation
Non-Gaussian state space model
• Integrated Nested Laplace Approximation (INLA)√ Extremely fast
√ If model complexity stays the same, then it can work for very large problems
√ R interface to code
× Can only hold a small number of hyperparameters than can be forced in the GMRF approximation, so very restrictive to the problem set
So what?
• Many algorithms: general versus specific, flexible versus tailored
• Pros and cons should be weighed against inferential aims, model and computational resources
• Blocking and reparametrisation are two good tricks, but we need to be clever about non-centred reparametrisation
Spatial Dynamic Factor Models
Yt = Bft + t, t ~ N(0, ), diag.
factor loadings bj ~ N(0, V(s))
• Spatial correlation:– Lopes et al. use a GRF, O((p×k*)3)– Strickland et al. use a GMRF (images: large discrete
spatial domain)+ Krylov subspace methods to sample from GMRF posterior, scales linearly O(p×k*).
– Rue uses Cholesky decomposition, more complex, O(p×k*)3/2)
So what?
Difference between (i) O(p×k*) and (ii) O(p×k*)3/2):
If the data set becomes 100 times larger, (i) will take 1 million times longer, compared to 100 times longer for (ii). Instead of waiting 1 hour, you might have to wait more than 1 year…
Case study• Spatial domain: 900 pixels
• Temporal: approx 200 periods (every 16 days)
• Total 180,000 observations, ~3600 parameters
10 mins for 15000 MCMC iterations (on a laptop!)
Conclusions• The model is too complicated and the datasets too
large for IS or PF.• There are too many parameters for INLA.• MCMC is thus desirable, but it is extremely
important to choose good algorithms!
As the model becomes more complex, the choice of smart algorithms becomes more important
and can make a large difference in computation and estimation
Matchmaking 102
Smart algorithms
Hybrid algorithms
Tierney (1994)
Design an efficient algorithm that combines features of other
algorithms, in order to overcome identified weaknesses in the
component algorithms
Comparing algorithms• Accuracy
– bias (H)• Efficiency
– rate of convergence, rate of acceptance (A)– mixing speed (integrated autocorrelation time) (H)
• Applicability– simplicity of set-up, flexibility of tailoring
• Implementation– coding difficulty, memory storage– computational demand:
• total number of iterations (T), burnin (T0) • correlation along the chain, measured by the effective sample size
ESSH = (T-T0)/(H)
Hybrid algorithms - 1
• Metropolis-Hastings Algorithm (MHAs)– Improve mixing speed via:
• parallel chains (MHA)
• repulsive proposal (MHARP M&R), pinball sampler
• delayed rejection (DRATierney & Mira, DRALP, DRAPinball)
– Improve applicability by: • reversible jumpGreen
• Metropolis adjusted Langevin (MALATierney&Roberts, Besag&Green)
Kate Lee, Christian Robert, Ross McVinish
Simulation study
• Model– mixture of 2-D normals, well separated
= 0.5 N([0,0]T, I2) + 0.5 N([5,5]T, I2)– 100 replicated simulations– Each result is obtained after running the algorithm for
1200 seconds using 10 particles– Proposal variance = 4; value in brackets is MSE
(similar results for variance = 2)• Platform
– Version 7.0.4 Matlab– run by SGI Altix XE cluster containing 112 x 64 bit
Intel Xeon cores.
Results
T=no. simulations; A=acceptance rate; H=accuracy; 2H=var(H),
p(H)=autocorrelation time; ESSH=effective sample size
Results• MHA
– shortest CPU time per iteration and largest sample size– need to tune proposal variance to optimise performance
• MALA– can get trapped in nearest mode if the scaling parameter in the
variance of the proposal is not sufficiently large• MHA with RP
– induces a fast mixing chain, but need to choose tuning parameter– expensive to compute– in rare cases the algorithm can be unstable and get stuck
• DRA– less correlated chains, higher acceptance rate– higher computational demand– Langevin proposal: improves mixing, but the loss in
computational efficiency overwhelmed gain in statistical efficiency
– Normal random walk faster to compute and improves mixing.
Hybrid algorithms - 2
• Population Monte Carlo (PMC)– Extension of IS by allowing importance function to
adapt to the target in an iterative mannerCappe et al
– PMC with repulsive proposal: create ‘holes’ around existing particles
• Particle systems and MCMC– IS + MCMCM&R
– parallel MCMC + SMCdel Moral
Simulation study repeated
• 100 replicates, run for 500 seconds using 50 particles, with first 100 iterations ignored
√ accuracy of estimation√ fast exploration (significantly reduced
integrated autocorrelation time)√ No instability (unlike MCMC algorithms)√ Less sensitivity to importance function√ Repulsive effect improved mixing
Summary
Algorithm Statistical Efficiency Computation Applicability
EPM CR RC CE SP FH CP Mode
MALA 1 1 1 -2 -1 0 -1 S
MHARP 0 1 0 -2 -1 -1 -2 B
DRA 1 1 0 -1 -1 0 0 B
DRALP 2 1 0 -2 -1 0 0 B
DRAPinball 2 1 0 -2 -1 0 0 B
PS 2 1 0 -3 -2 -1 -2 B
PMCR - 2 0 -1 -1 -1 0 B
Relative performance compared to MHA and PMC
EPM=efficiency of proposal move; CR=correlation reduction of chain;RC=rate of convergence; CE=cost effectiveness; SP=simplicity of programming;FH=flexibility of hyperparameters; CP=consistency of performance; Mode=preference between a single mode and multimodal problem
Hybrid algorithms - 3
ABCelUnlike ABC, ABCel does not require:
Conclusions
1. Combining features of individual algorithms may lead to complicated characteristics in a hybrid algorithm.
2. Each individual algorithm may have a strong individual advantage with respect to a particular performance criterion, but this does not guarantee that the hybrid method will enjoy a joint benefit of these strategies.
3. The combination of algorithms may add complexity in set-up, programming and computational expense.
Implementing smart algorithms: PyMCMC
• Python package for fast MCMC• takes advantage of Python libraries Numpy, Scipy • Classes for Gibbs, M-H, orientational bias MC,
slice samplers, etc.• linear (with stochastic search), logit, probit, log-
linear, linear mixed-model, probit mixed-model, nonlinear mixed models, mixture, spatial mixture, spatial mixture with regressors, time series suite (including DFM and SDFM)
• Straightforward to optimise, extensible to C or Fortran, parallelisable (GPU)
Chris Strickland
PyMCMC
Matchmaking algorithm!
Model Algorithm
Inference
Coolproblems
Past Future
Key References
• Lee, K., Mengersen, K., Robert, C.P. (2012) Hybrid models. In Case Studies in Bayesian Modelling. Eds Alston, Mengersen, Pettitt. Wiley, to appear.
• Stanaway, M., Reeves, R., Mengersen, K. (2010) Hierarchical Bayesian modelling of early detection surveillance for plant pest invasions. J. Environmental and Ecological Statistics.
• Strickland, C., Simpson, D, Denham, R., Turner, I., Mengersen, K. Fast methods for spatial dynamic factor models. CSDA.
• Strickland, C., Alston, C., Mengersen, K. (2011) PyMCMC. J. Statistical Software. Under review.
• White, N. Johnson, H., Silburn, P., Mengersen, K. Unsupervised sorting and comparison of extracellular spikes with Dirichlet Process Mixture Models. Annals Applied Statistics. Under review