computational statistics for big data - lancaster …computational statistics for big data jack...

45
Lancaster University Computational Statistics for Big Data Author: Jack Baker 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount of data stored by organisations and individuals is growing at an as- tonishing rate. As statistical models grow in complexity and size, traditional machine learning algorithms are struggling to scale well to the large datasets required for model fitting. Markov chain Monte Carlo (MCMC) is one algorithm that has been left behind. However the algorithm has proven to be an invaluable tool for training complex statis- tical models. This report discusses a number of possible solutions that enable MCMC to scale more effectively to large datasets. We focus on two particular solutions to this problem: batch methods and stochastic gradient Monte Carlo methods. Batch methods split the full dataset into disjoint subsets, and run traditional MCMC on each subset. The difficulty of these methods is in recombining the MCMC output run on each subset. The idea is that this will be a close approximation to the posterior using the full dataset. Stochastic gradient Monte Carlo approximately samples from the full posterior but uses only a subsample of data at each iteration. It does this by combining two key ideas. Stochastic optimization, which is an alogorithm used to find the mode of the posterior but uses only a subset of the data at each iteration; Hamiltonian Monte Carlo, which is a method used to provide efficient proposals for Metropolis-Hastings algorithms with high acceptance rates. After discussing the methods and important extensions, we perform a simulation study, which compares the methods and how they are affected by various model properties.

Upload: others

Post on 21-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Lancaster University

Computational Statistics for Big Data

Author:Jack Baker1

Supervisors:Paul Fearnhead1

Emily Fox2

1Lancaster University 2The University of Washington

September 1, 2015

Abstract

The amount of data stored by organisations and individuals is growing at an as-tonishing rate. As statistical models grow in complexity and size, traditional machinelearning algorithms are struggling to scale well to the large datasets required for modelfitting. Markov chain Monte Carlo (MCMC) is one algorithm that has been left behind.However the algorithm has proven to be an invaluable tool for training complex statis-tical models. This report discusses a number of possible solutions that enable MCMCto scale more effectively to large datasets. We focus on two particular solutions to thisproblem: batch methods and stochastic gradient Monte Carlo methods.

Batch methods split the full dataset into disjoint subsets, and run traditional MCMCon each subset. The difficulty of these methods is in recombining the MCMC output runon each subset. The idea is that this will be a close approximation to the posterior usingthe full dataset. Stochastic gradient Monte Carlo approximately samples from the fullposterior but uses only a subsample of data at each iteration. It does this by combiningtwo key ideas. Stochastic optimization, which is an alogorithm used to find the modeof the posterior but uses only a subset of the data at each iteration; Hamiltonian MonteCarlo, which is a method used to provide efficient proposals for Metropolis-Hastingsalgorithms with high acceptance rates. After discussing the methods and importantextensions, we perform a simulation study, which compares the methods and how theyare affected by various model properties.

Page 2: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Contents

1 Introduction 21.1 An overview of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Batch methods 42.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Splitting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Efficiently sampling from products of Gaussian mixtures . . . . . . . . . . . 42.4 Parametric recombination methods . . . . . . . . . . . . . . . . . . . . . . . 62.5 Nonparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Semiparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Stochastic gradient methods 123.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Stochastic optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Stochastic gradient Langevin Monte Carlo . . . . . . . . . . . . . . . . . . . 183.5 Stochastic gradient Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . 223.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Simulation study 274.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Batch methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Stochastic gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Future Work 385.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Further comparison of batch methods . . . . . . . . . . . . . . . . . . . . . . 385.3 Tuning guidance for stochastic gradient methods . . . . . . . . . . . . . . . . 395.4 Using batch methods to analyse complex hierarchical models . . . . . . . . . 40

1

Page 3: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

1 Introduction

As the amount of data stored by individuals and organisations grows, statistical modelshave advanced in complexity and size. Often much statistical methodology has focussed onfitting models with limited data. Now we are faced by the opposite problem, we have somuch data that traditional statistical methods struggle to cope and run exceptionally slowly.These problems have led to a rapidly evolving area of statistics and machine learning, whichdevelops algorithms which are scalable as the size of data increases. The ‘size’ of data isgenerally used to mean one of two things: the dimensionality of the data or the number ofobservations. In this report we focus on methods which have been designed to be scalableas the number of observations increases. Data with a large number of observations is oftenreferred to as tall data.

Currently, large scale machine learning models are being trained mainly using optimiza-tion methods such as stochastic optimization. These algorithms are mainly used for theirspeed, they are fast to train models even when there are a huge number of observationsavailable. The methods’ speed is due to the fact that at each iteration the algorithms onlyuse a subset of all the available data. The downside is that these methods only find localmaxima of the posterior distribution, meaning they only produce a point estimate which canlead to overfitting.

A key appeal of Bayesian methods is that they produce a whole distribution of possibleparameter values, which allows uncertainty to be quantified, reducing the risk of overfitting.While approximating parameter uncertainty using stochastic optimization can be done, forcomplex models this approximation can be very poor. Generally the Bayesian posteriordistribution is simulated from using statistical algorithms known as Markov chain MonteCarlo (MCMC). The problem is that these algorithms require calculations over the wholedataset at each iteration, meaning the algorithms are slow for large datasets. Therefore thenext generation of MCMC algorithms which scale to large datasets needs to be developed.

1.1 An overview of methods

We begin this section with a more formal statement of the problem. Suppose we wish totrain a model with probability density p(x|θ), where θ is an unknown parameter vector, andx ∈ x is the model data. Let the likelihood of the model be p(x|θ) =

∏Ni=1 p(xi|θ) and the

prior for the parameter be p(θ). Our interest is in the posterior p(θ|x) ∝ p(x|θ)p(θ), whichquantifies the most likely values of θ given the data x.

Commonly we simulate from the posterior using the Metropolis-Hastings (MH) algorithm,arguably the most popular MCMC algorithm. At each iteration, given a current state θ, thealgorithm proposes some new state θ′ from some proposal q(.). This new state is thenaccepted as part of the sample with probability

α =q(θ)p(θ′|x)

q(θ′)p(θ|x)=q(θ)p(x|θ′)p(θ′)q(θ′)p(x|θ)p(θ)

.

Notice that at each iteration, the MH algorithm requires calculation of the likelihood atthe new state θ′. This requires a computation over the whole dataset, which is infeasibly

2

Page 4: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

slow when N is large. This is the key bottleneck in Metropolis-Hastings, and other MCMCalgorithms, when they are being used with large datasets.

A number of solutions have been proposed for this problem, and they can generally bedivided into three categories. We refer to these categories as batch methods, stochastic gradi-ent methods and subsampling methods. Batch methods aim to make use of recent hardwaredevelopments which makes the parallelisation of computational work more accessible. Theysplit the dataset x into disjoint batches xB1 , . . . ,xBS

. The structure of the posterior allowsseparate MCMC algorithms to be run on these batches in parallel in order to simulate fromeach subposterior p(θ|xBs) ∝ p(θ)1/Sp(xBs|θ). These simulations must then be combined inorder to generate a sample which approximates the full posterior p(θ|x). This is where themain challenge lies.

Stochastic gradient methods make use of sophisticated proposals that have been suggestedfor MCMC. These methods use gradients of the log posterior in order to suggest new stateswhich have very high acceptance rates. When free constants of these proposals are tuned ina certain way these rates can be so high that we can get rid of the acceptance step and stillsample from a good approximation to the posterior. However the gradient calculation stillrequires a computation over the whole dataset. Therefore the gradients of the log posteriorneed to be estimated using only a subsample of the data, which introduces extra noise.

Subsampling methods propose various methods to keep the MCMC algorithm largely asis but use only a subset of the data in the acceptance step at each iteration. Certain methodsexist which allow this to be done while still sampling from the true posterior distribution.However this advantage often comes at the cost of poor mixing. Other methods achieve theresult by introducing controlled biases, these methods often mix better.

1.2 Report outline

This report provides a review of batch methods and stochastic gradient methods outlined inSection 1.1. The reviewed methods are then implemented and compared under a variety ofscenarios.

In Section 2 we discuss batch methods, including parametric contributions by Scott et al.(2013) and Neiswanger et al. (2013), nonparametric and semiparametric methods introducedby Neiswanger et al. (2013) as well as more recent developments. Section 3 sees a review ofstochastic gradient methods, including the stochastic gradient Langevin dynamics (SGLD)algorithm of Welling and Teh (2011) and the stochastic gradient Hamiltonian Monte Carlo(SGHMC) algorithm of Chen et al. (2014). Stochastic optimization methods which arecurrently employed to train algorithms which rely on large datasets are considered. Anintroduction of Hamiltonian Monte Carlo, which is used to produce proposals for the SGHMCalgorithms is provided. Finally we examine the literature which provide further theoreticalresults for the algorithms, as well as proposed improvements.

In Section 4 the algorithms reviewed in the report are compared, code for the implemen-tations are available on GitHub: https://goo.gl/9ZGHP2. A relatively simple model is usedfor comparison, a multivariate t-distribution. Therefore in order to really test the methods,the number of observations is kept small. First the effect of bandwidth choice for nonpara-metric/semiparametric methods is investigated. The performance effect of the number of

3

Page 5: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

observations and the dimensionality of the target are compared for all the methods. Thebatch size for the batch methods, and the subsample size for the stochastic gradient methodsare considered too.

2 Batch methods

2.1 Introduction

In order to speed up MCMC, it is natural to consider parallelisation. Advances in hardwareallow many jobs to be run in parallel over separate cores. These advances have been used tospeed up many other computationally intensive algorithms. Parallelising MCMC has provendifficult however since MCMC is inherently sequential in nature and parallelisation requiresminimal communication between machines. A natural way to parallelise MCMC is to splitthe data into different subsets. MCMC for each subset is then run separately on differentmachines. In this case the main problem is how to recombine our MCMC samples of eachsubset while ensuring the final sample is close as possible to the true posterior. In this section,we discuss parametric and nonparametric methods suggested to do this.

2.2 Splitting the data

Suppose we have N i.i.d. data points x. We wish to investigate a model with probabilitydensity p(θ|x), where θ is an unknown parameter vector. Let the likelihood be p(x|θ) =∏N

i=1 p(θ|xi) and the prior we assign to θ be p(θ). Then the full posterior for the data p(θ|x)is given by

p(θ|x) ∝ p(θ)p(x|θ). (2.1)

Let B1, . . . , BS be a partition of {1, . . . , n}, and xBibe the corresponding set of data

points xBi= {xi : i ∈ Bi}. We refer to xBi

as the ith batch of data. We can rewrite (2.1) as

p(θ|x) ∝ p(θ)S∏s=1

p(xBs|θ) =S∏s=1

p(θ)1/Sp(xBs|θ).

For brevity we will write the S batches of data as x1, . . . ,xS from now on. Let us define thesubposterior p(θ|xs) by

p(θ|xs) ∝ p(θ)1/Sp(θ|xs).

Therefore we have that p(θ|x) ∝∏S

s=1 p(θ|xs). The idea of batch methods for big data is torun MCMC separately to sample from each subposterior. These samples are then combinedin some way so that the final sample follows the full posterior p(θ|x) as closely as possible.

2.3 Efficiently sampling from products of Gaussian mixtures

Before we outline recombination methods in more detail, we discuss certain important prop-erties of the multivariate Normal distribution which will prove useful later.

4

Page 6: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Suppose we have S multivariate Normal densities N (θ|µs,Σs) for s ∈ {1, . . . , S}, thenWu (2004) shows that their product can be written, up to a constant of proportionality, as

S∏s=1

N (θ|µs,Σs) ∝ N (θ|µ,Σ),

where

Σ =

(S∑s=1

Σ−1s

)−1, µ = Σ

(S∑s=1

Σ−1s µs

). (2.2)

Now suppose we have a set of S Gaussian mixtures {ps(θ)}Ss=1,

ps(θ) =M∑m=1

ωm,sN (θ|µm,s,Σs),

where ωm,s denote the mixture weights. For simplicity we assume that the number of compo-nents in each mixture is the same and that each Gaussian component in the mixture sharesa common variance which is diagonal.

We wish to sample from the product of these Gaussian mixtures,

p(θ) ∝S∏s=1

ps(θ). (2.3)

It can be shown using induction that

S∏s=1

M∑m=1

ωm,sN (θ|µm,s,Σs) =∑l1

· · ·∑lS

S∏s=1

ωls,sN (θ|µls,s,Σs),

where we label each component of the sum using L = (l1, . . . , lS), where ls ∈ {1, . . . ,M}.It follows from this and results above about products of Gaussians, (2.3) is equivalent to

a Gaussian mixture with MS mixture components. Therefore sampling from this productcan be performed exactly in two steps. Firstly we sample from one of the MS components ofthe mixture according to its weight, then we draw a sample from the corresponding Gaussiancomponent (Ihler et al., 2004).

The parameters of the Lth Gaussian component be calculated using (2.2) and are givenby

ΣL =

(S∑s=1

Σ−1s

)−1, µL = ΣL

(S∑s=1

Σ−1s µls,s

).

The unnormalised weight of the Lth mixture component is given by (Ihler et al., 2004)

ωL ∝∏S

s=1 ωls,sN (θ|µls,s,Σs)

N (θ|µL,ΣL).

5

Page 7: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

In order to use this exact method we need to calculate the normalising constant for theweights Z =

∑L ωL. As M and S grow this exact sampling method becomes computationally

infeasible as the calculation of Z and the drawing a sample from p(.) both take O(MS) time.This fact, along with memory requirements mean that sampling from p(θ) using the exactmethod quickly becomes impossible.

In cases where exact sampling from the mixture is infeasible, a number of methods havebeen proposed. For a review the reader is suggested to refer to Ihler et al. (2004). A commonapproach is to use a Gibbs sampling style approach. At each iteration, S − 1 of the labelsli are fixed, while one label, call it lj, is sampled from the corresponding conditional densityp(θ|l−j). The notation l−j refers to {li|i ∈ {1, . . . , S}, i 6= j}. After a fixed number of newlabel values have been drawn, a sample is drawn from the mixture component indicated bythe current label values. While this approach often produces good results, it can requirea large number of samples before it accurately represents the true mixture density due tomultimodality. A number of suggestions have been made to improve this standard Gibbssampling approach, for example using multiscale sampling (Ihler et al., 2004) and paralleltempering (Rudoy and Wolfe, 2007).

2.4 Parametric recombination methods

There are a number of methods proposed to recombine subposterior samples which exactlytarget the full posterior p(θ|x) when it is Normally distributed. We refer to these methods asparametric. Intuition for why this assumption might be valid for a large class of models comesfrom the Bernstein-von Mises Theorem (Le Cam, 2012), which is a central limit theorem forBayesian statistics. Assuming suitable regularity conditions, and that the data is realisedfrom a unique true parameter value θ0, the theorem states that the posterior for the datatends to a Normal distribution centred around θ0. In particular, for large N the posterior isfound to be well approximated by N(θ0, I

−1(θ0)), where I(θ) is Fisher’s information matrix.Since we are aiming to efficiently sample from models with large amounts of data, thisapproximation appears to be particularly relevant.

Neiswanger et al. (2013) propose to combine samples by approximating each subposteriorusing a Normal distribution, and then using results for products of Gaussians in order tocombine these approximations. Let µs and Σs denote the sample mean and sample varianceof the MCMC output for batch s. Then we can approximate the distribution of each subpos-terior by N(µs, Σs). Using (2.2), the full posterior can be estimated by simply multiplyingthese subposterior estimates together. It follows the estimate will be multivariate Gaussianwith mean µ and variance Σ given by

Σ =

(S∑s=1

Σ−1s

)−1, µ = Σ

(S∑s=1

Σ−1s µs

). (2.4)

Scott et al. (2013) propose a similar method, where samples are combined using averaging.Their method is known as consensus Monte Carlo. Denote the jth sample from subposteriors by θsj. Then suppose each subposterior is assigned a weight denoted by Ws (this is a

matrix in the multivariate case), the jth draw θj from the consensus approximation to the

6

Page 8: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

full posterior is given by

θj =

(S∑s=1

Ws

)−1 S∑s=1

Wsθsj.

When each subposterior is Normal, then the full posterior is also Normal, and when weset the weights to be Ws = V ar(θ|xs) then θj will be exact draws from the full posterior.

The idea is that even when the subposteriors are non-Gaussian, the draw θj will still be aclose approximation to the posterior. Scott et al. (2013) suggests using the sample varianceof each batch as the weight values in practice, due to exact results in the Normal case.

Key advantages of the two approximations outlined above are that they are fast andrelatively quick to converge when models are close to Gaussian. However they only targetthe full posterior exactly if either each subposterior is Normally distributed, or the size ofeach batch tends to infinity. Therefore the methods’ performance on non-Gaussian targetsshould be explored, especially when they are multi-modal, since the methods may conceivablystruggle in these cases.

Rabinovich et al. (2015) suggest extending the Consensus Monte Carlo algorithm of Scottet al. (2013) by relaxing the restriction of aggregation using averaging. Suppose we pick adraw from each subposterior, θ1, . . . , θS. Then let us refer to the function used to aggregatethese draws as F (θ1, . . . , θS), so in the case of Consensus Monte Carlo we have

F (θ1, . . . , θS) =

(S∑s=1

Ws

)−1 S∑s=1

Wsθs.

Rabinovich et al. (2015) suggest trying to adaptively choose the best aggregation functionF (.). Motivation for this is that the averaging function used in Scott et al. (2013) is onlyknown to be exact in the case of Gaussian posteriors. In order to adaptively choose F (.),Rabinovich et al. (2015) use variational Bayes. However the method requires the introductionof an optimization step, and it would be interesting to investigate the relative improvementin the approximation in using the method, versus the increase in computation time.

2.5 Nonparametric methods

While the methods outlined above work relatively well when subposteriors approximatelyGaussian, it is not clear how they behave when models are far away from Gaussian, or whenbatch sizes are small. Neiswanger et al. (2013) therefore suggest an alternative method basedon kernel density estimation which can be shown to target the full posterior asymptotically,as the number of samples drawn from each subposterior tends to infinity.

Let x1, . . . , xN be a sample from a distribution of dimension d with density f . Kerneldensity estimation is a method for providing an estimate f of the density. The kernel densityestimation for f at a point x is

f(x) =1

N

N∑i=1

KH(x− xi),

7

Page 9: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

where H is a d × d symmetric, positive-definite matrix known as the bandwidth and K isthe unscaled kernel, which is a symmetric, d-dimensional density. KH is related to K byKH(x) = |H|−1/2K(H−1/2x). Commonly the kernel function K is chosen to be Gaussiansince it leads to smooth density estimates and it simplifies mathematical analysis (Duong,2004). The bandwidth is an important factor in determining the accuracy of a kernel densityestimate as it controls the smoothing of the estimate.

Suppose we have a sample {θm,s}Mm=1 from each subposterior s ∈ {1, . . . , S}. Neiswangeret al. (2013) suggest approximating each subposterior using a kernel density estimate withGaussian kernel and diagonal bandwidth matrix h2I, where I is the d-dimensional identitymatrix. Denote this estimate by ps(θ), then we can write it as

ps(θ) =1

M

M∑m=1

N (θ|θm,s, h2I),

where N (.|θm,s, h2I) denotes a d-dimensional Gaussian density with mean θm,s and varianceh2I.

The estimate for the full posterior p(θ|x) is then defined to be the product of the estimatesfor each batch

p(θ|x) =S∏s=1

ps(θ) =1

MS

S∏s=1

M∑m=1

N (θ|θm,s, h2I). (2.5)

Therefore the estimate for the full posterior becomes a product of Gaussian mixtures asdiscussed in Section 2.3. By introducing a similar labelling system L = (l1, . . . , lS) withls ∈ {1, . . . ,M}, we can again derive an explicit expression for the resulting mixture. WhileNeiswanger et al. (2013) uses common variance h2I for each kernel, we suggest it mightbe better to use a diagonal matrix Λ since different parameters may differ considerably invariance. In either case, assuming a common, diagonal variance Λ across the kernel estimatesfor each batch, the weights in the product (2.5) simplify to

ωL ∝S∏s=1

N (θls,s|θL,Λ), θL =1

S

S∑s=1

θls,s. (2.6)

The Lth component of the mixture simplifies to N (θ|θL,Λ/S).Given that this method is designed for use with large datasets, the number of components

of the resulting Gaussian mixture will be very large. Therefore efficiently sampling from itis an important issue to consider. Neiswanger et al. (2013) recommends sampling from thefull posterior estimate using a similar method to the Gibbs sampling approach as outlinedin Section 2.3. In order to avoid calculating the conditional distribution of the weightshowever, they use a Metropolis within Gibbs approach as follows. Setting all labels exceptthe current, ls, fixed, we randomly sample a new value for ls. We then accept this new labelwith probability equal to the corresponding values for the weights. The full algorithm is

8

Page 10: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

detailed in Algorithm 1.

Algorithm 1: Combining Batches Using Kernel Density Estimation.

Data: Samples from each subposterior s ∈ {1, . . . , S}, {θm,s}Mm=1.Result: Sample from an estimate of the full posterior p(θ|x).Draw an initial label L by simulating ls ∼ Unif({1, . . . ,M}), s ∈ {1, . . . , S}.for i = 1 to T do

h← h(i)for s = 1 to S do

Create a new label C := (c1, . . . , cS) and set C ← LDraw a new value for index s in C, cs ∼ Unif({1, . . . ,M})Simulate u ∼ Unif(0, 1)if u < ωC/ωL then

L← Cend

end

Simulate θi ∼ N(θL,h2

MI)

end

Notice that in the algorithm, h is changed as a function of the iteration i. In particularNeiswanger et al. (2013) specify the function h(i) = i−1/(4+d). This causes the bandwidth todecrease at each iteration and is referred to as annealing. The properties of annealing areinvestigated further in Section 4. In their paper Neiswanger et al. (2013) assume that thenumber of iterations is the same as the size of the sample from each subposterior. Howeverthis is not necessary, in fact when we are trying to sample from a mixture with a large numberof components we may need to simulate more times than this in order to ensure the sampleaccurately represents the true KDE approximation.

While this algorithm may improve results as models move away from Gaussianity, ker-nel density estimation is known to perform poorly at high dimensions so the algorithm willdeteriorate as the dimensionality of θ increases. The algorithm suffers from the curse of di-mensionality in the number of batches and the size of the MCMC sample simulated from eachsubposterior. This suggests that as the number of batches increases the accuracy and mixingof the algorithm will be affected. The algorithm requires the user to choose a bandwidthestimate, the performance of the algorithm to different bandwidth choices would thereforebe interesting to investigate.

In the original paper by Neiswanger et al. (2013), it is suggested to use a Gaussiankernel with bandwidth h2I. However as mentioned earlier, different parameters may havedifferent variances. The algorithm would probably perform better by using a more generaldiagonal matrix Λ, especially as this does not particularly increase the complexity of thealgorithm. Using a common bandwidth parameter across batches eases computation howeverit may negatively affect the performance of the algorithm. Note when discussing products ofGaussian mixtures in 2.3, the variances across different mixtures did not need to be assumedcommon. Therefore further improvements might be made by varying bandwidths acrossbatches, though this would increase computational expense. Finally improvements could begained by using more sophisticated methods to sample from the product of kernel density

9

Page 11: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

estimates (Ihler et al., 2004; Rudoy and Wolfe, 2007).A number of developments have been proposed for Algorithm 1. Wang and Dunson

(2013) note that the algorithm performs poorly when samples from each subposterior donot overlap. In order to improve this they suggest to smooth each subposterior using aWeierstrass transform, which simply takes the convolution of the density with a Gaussianfunction. The transformed function can be seen as a smoothed version of the original whichtends to increase the overlap between subposteriors. They then approximate the full posterioras a product of the Weierstrass transform of each subposterior. However, since in general theapproximation to each subposterior will be empirical, its Weierstrass transform correspondsto a kernel density estimator. Therefore this method, for all intents and purposes, is thesame as the original algorithm by Neiswanger et al. (2013), so still suffers from many of thesame problems.

An alternative method to improve overlap between the supports of each subposterior isto use heavier tailed kernels in the kernel density estimation. Implementing this however willrequire some work in order to be able to sample from the resulting product of mixtures, sincenice properties for the product of these heavier tailed distributions may not hold. Thereforealternative methods for sampling will need to be developed.

Wang et al. (2015) rather than using kernel density estimation use space partitioningmethods to partition the space into disjoint subsets, and produce counts of the number ofpoints contained in each of these subsets. This produces an estimate of each subposteriorakin to a multi-dimensional histogram. An estimate to the full posterior can then be made bymultiplying subposterior estimates together and normalizing. This algorithm helps solve theexplosion of mixture components that affects algorithm 1. Despite this, the algorithm willstill suffer when the supports of each subposterior do not overlap. Moreover the algorithm ismore complicated to implement and will be affected by the choice of partitioning used.

Alternatively there have been suggestions to introduce suitable metrics which allow sum-maries of a set of probability measures to be defined. This allows batches to be recombinedin terms of these summaries. For example Minsker et al. (2014) use a metric known as theWasserstein distance measure in order to define the median posterior from a set of subpos-teriors. Similarly Srivastava et al. (2015) also use the Wasserstein distance to calculate asummary of the subposteriors known as the barycenter. This allows them to produce anestimate for the full posterior which they refer to as the Wasserstein posterior or WASP.However the statistical properties of these measures is unclear and needs to be investigatedfurther.

2.6 Semiparametric methods

In order to account for the fact that the nonparametric method Algorithm 1 is slow toconverge, Neiswanger et al. (2013) suggest producing a semiparametric estimator (Hjortand Glad, 1995) of each subposterior. This estimator combines the parametric estimatorcharacterised by (2.4) and the nonparametric estimator detailed by Algorithm 1. Morespecifically, each subposterior is estimated by (Hjort and Glad, 1995)

ps(θ) = fs(θ)r(θ),

10

Page 12: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

where fs(θ) = N (θ|µs, Σs) and r(θ) is a nonparametric estimator of the correction functionr(θ) = ps(θ)/fs(θ).

Assuming a Gaussian kernel for r(θ), Neiswanger et al. (2013) write down an explicitexpression for ps(θ)

ps(θ) =1

M

M∑m=1

N (θ|θm,s, h2I)N (θ|µs, Σs)

fs(θm,s)=

1

M

M∑m=1

N (θ|θm,s, h2I)N (θ|µs, Σs)

N (θm,s|µs, Σs).

Similarly to the nonparametric method, we can produce an estimate for the full posteriorp(θ|x) as the product of estimates for each subposterior. Once again this results in a mixtureof Gaussians with MS components. Using the label L = (l1, . . . , lS) then the Lth mixtureweight WL and component cL is given by

WL ∝ωLN (θL|µ, Σ + h

SI)∏S

s=1N (θls,s|µs, Σs), cL = N (θ|µL,ΣL),

where ωL and θL are as defined in (2.6), and the parameters of the mixture component are

ΣL =

(S

hI + Σ−1

)−1, µL = ΣL

(S

hIθL + Σ−1µ

),

where Σ and µ are as defined in (2.4). Sampling from this mixture can be performed byusing Algorithm 1 replacing weights and parameters where appropriate.

As h → 0, the semiparametric component parameters ΣL and µL approach the corre-sponding nonparametric component parameters. This motivates Neiswanger et al. (2013) tosuggest an alternative semiparametric algorithm where the nonparametric component weightsωL are used instead of WL. Their reasoning is that the resulting algorithm may have a higheracceptance probability and is still asymptotically exact as the batch size tends to infinity. Asin Section 2.5, a bandwidth matrix with identical diagonal elements hI will not necessarilybe the best choice for the bandwidth if different dimensions of the parameters have differ-ent scales or variances. However the algorithm can easily be extended to using a diagonalbandwidth matrix Λ in a similar way to the nonparametric method.

While this method may solve the problem that the nonparametric method is slow toconverge in high dimensions, the performance of the algorithm is not well understood. Forexample as models tend away from Gaussianity, how will the algorithm perform when itincludes this parametric term. Moreover the model still suffers from the curse of dimension-ality in terms of the number of mixture components. The model will also be affected bybandwidth choice.

2.7 Conclusion

In this section we outlined batch methods. Batch methods split a large dataset up into smallersubsets, run parallel MCMC on these subsets, and then combine the MCMC output to obtainan approximation to the full posterior. A couple of methods appealed to the Bernstein-von Mises theorem in order to approximate each subposterior by a Normal distribution.

11

Page 13: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

The resulting approximation to the full posterior could be found using standard results forproducts of Gaussians. However these methods are only exact if each subposterior is Normal,or as the number of observations in each batch tends to infinity. Performance of the methodswhen these assumptions are violated needs to be investigated.

Alternative methods used kernel density estimation or a mixture of a Normal estimateand a kernel density estimate to approximate each subposterior. These estimates could thenbe combined by using results for the product of mixtures of Gaussians. However the resultingapproximation was a mixture of MS components, which is difficult to sample from efficiently.Moreover kernel density estimation is known to deteriorate as dimensionality increases andrequires the choice of a bandwidth.

To conclude, each of the batch methods have either undesirable qualities or propertieswhich are not well understood. These issues need reviewing before the methods can be usedwith confidence in practice. Batch methods are particularly suited to models which exhibitstructure, for example hierarchical models.

3 Stochastic gradient methods

3.1 Introduction

Methods currently employed in large scale machine learning are generally optimization basedmethods. One method employed frequently in training machine learning models is knownas stochastic optimization (Robbins and Monro, 1951). This method is used to optimizea likelihood function in a similar way to traditional gradient ascent. The key difference isthat at each iteration rather than using the whole dataset only a subset is used. While themethod produces impressive results at low computational cost, it has a number of downsides.Parameter uncertainty is not captured using this method, since it only produces a pointestimate. Though uncertainty can be estimated using a Normal approximation, for morecomplex models this estimate may be poor. This means models fitted using stochastic opti-mization can suffer from overfitting. Since the method does not sample from the posterioras in traditional MCMC, the algorithm can get stuck in local maxima.

Methods outlined in this section aim to combine the subsampling approach of stochasticoptimization, with posterior sampling, which helps capture uncertainty in parameter esti-mates. The section begins by outlining stochastic optimization, before introducing stochas-tic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo(SGHMC), the two key algorithms for big data discussed in this section. Hamiltonian MonteCarlo (HMC), a technique used extensively by SGHMC, is reviewed.

3.2 Stochastic optimization

Let x1, . . . , xN be data observed from a model with probability density function p(x|θ) whereθ denotes an unknown parameter vector. Assigning a prior p(θ) to θ, as usual our interest is

12

Page 14: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

in the posterior

p(θ|x) ∝ p(θ)N∏i=1

p(xi|θ),

where we define p(x|θ) =∏N

i=1 p(xi|θ) to be the likelihood.Stochastic optimization (Robbins and Monro, 1951) aims to find the mode θ∗ of the

posterior distribution, otherwise known as the MAP estimate of θ. The idea of findingthe mode of the posterior rather than the likelihood is that the prior p(θ) regularizes theparameters, meaning it acts as a penalty for model complexity which helps prevent overfitting.At each iteration t, stochastic optimization takes a subset of the data st and updates theparameters as follows (Welling and Teh, 2011)

θt+1 = θt +εt2

(∇ log p(θt) +

N

n

∑xi∈st

∇ log p(xi|θt)

)

where εt is the stepsize at each iteration and |st| = n. The idea is that over the long runthe noise in using a subset of the data is averaged out, and the algorithm tends towards astandard gradient descent. Clearly when the number of observations N is large, using onlya subset of the data is much less computationally expensive. This is a key advantage ofstochastic optimization.

Provided that∞∑t=1

εt =∞,∞∑t=1

ε2t <∞, (3.1)

and p(θ|x) satisfies certain technical conditions, this algorithm is guaranteed to converge toa local maximum.

A common extension of stochastic optimization which will be needed later is knownas stochastic optimization with momentum. This is commonly employed when the likelihoodsurface exhibits a particular structure, one example where the method is employed extensivelyis in the training of deep neural networks. In this case we introduce a variable ν, which isreferred to as the velocity of the trajectory. The parameter updates then proceed as follows

νt+1 = (1− α)νt + ηεt2

(∇ log p(θt) +

N

n

∑xi∈st

∇ log p(xi|θt)

),

θt+1 = νt+1 + θt (3.2)

where α and η are free parameters to be tuned.While stochastic optimization is used frequently by large scale machine learning practi-

tioners, it does not capture parameter uncertainty since it only produces a point estimate ofθ. This means that models fit using stochastic optimization can often suffer from overfittingand requires some form of regularization. One common method to provide an approximationto the true posterior is to fit a Gaussian approximation at the point estimate.

13

Page 15: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Suppose θ0 is the true mode of the posterior p(θ|x). Then using Taylor’s expansion aboutθ0, we find (Bishop, 2006)

log p(θ|x) ≈ log p(θ0|x) + (θ − θ0)T∇ log p(θ|x)− 1

2(θ − θ0)TH[log p(θ0|x)](θ − θ0)

= log p(θ0|x)− 1

2(θ − θ0)TH[log p(θ0|x)](θ − θ0),

where H[g(.)] is the Hessian matrix of the function g(.), and we have used the fact that thegradient of the log posterior at θ0 is 0.

Let us denote the Hessian H[log p(θ|x)] := V −1[θ], then taking the exponential of bothsides we find

p(θ|x) ≈ A exp

{−1

2(θ − θ0)TV −1[θ0](θ − θ0)

},

where A is some constant. This is the kernel of a Gaussian density, suggesting an approxi-mation to the posterior of the form N(θ∗, V [θ∗]), where θ∗ is an estimate of the mode to befound. This is often referred to as a Laplace approximation.

By the Bernstein-von Mises theorem, this approximation is expected to become an in-creasingly accurate approximation as the number of observations increases. However sincethe approximation is based only on distributional aspects at one point, the approximationcan miss important properties of the distribution (Bishop, 2006). Moreover distributionswhich are multimodal will be approximated very poorly by this approximation. Thereforewhile the approximation may work well for less complex distributions when plenty of datais available, the approximation may struggle for more complex models. This motivates usto consider methods which aim to combine the performance of stochastic optimization whilebeing able to account for parameter uncertainty.

3.3 Hamiltonian Monte Carlo

Hamiltonian dynamics was originally developed as an important reformulation of Newtoniandynamics, and serves as a vital tool in statistical physics. More recently though, Hamiltoniandynamics has been used to produce proposals for the Metropolis-Hastings algorithm whichexplore the parameter space rapidly and have very high acceptance rates. The acceptancecalculations in the Metropolis-Hastings algorithm is computationally intensive when a lot ofdata is available. However as outlined later, by combining ideas from stochastic optimiza-tion and Hamiltonian dynamics, we are able to approximately simulate from the posteriordistribution without using an acceptance calculation. In light of this, we review Hamilto-nian Monte Carlo, a method which produces efficient proposals for the Metropolis-Hastingsalgorithm.

3.3.1 Hamiltonian dynamics

Hamiltonian dynamics was traditionally developed to describe the motion of objects undera system of forces. In two dimensions a common analogy used to visualise the dynamics isa frictionless puck sliding over a surface of varying height (Neal, 2010). The state of the

14

Page 16: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

system consists of the puck’s position θ, and its momentum (mass times velocity) r. Both ofwhich are 2-dimensional vectors. The state of the system is governed by its potential energyU(θ) and its kinetic energy K(r). If the puck is moving on a flat part of the space, then itwill have constant velocity. However as the puck begins to pick up height, its kinetic energydecreases and its potential energy increases as it slows. If its kinetic energy reaches zerothe puck moves back down the hill, and its potential energy decreases as its kinetic energyincreases.

More formally Hamiltonian dynamics is described by a Hamiltonian function H(r, θ),where r and θ are both d-dimensional. The Hamiltonian determines how r and θ change overtime as follows

dθidt

=∂H

∂ri,

dridt

= −∂H∂θi

. (3.3)

Hamiltonian dynamics has a number of properties which are crucial for its use in constructingMCMC proposals. Firstly, Hamiltonian dynamics is reversible, meaning that the mappingfrom the state (r(t), θ(t)) at time t to the state (r(t+ s), θ(t+ s)) at time t+ s is one-to-one.A second property is that the dynamics keeps the Hamiltonian invariant or conserved. Thiscan be easily shown using (3.3) as follows

dH

dt=

d∑i=1

(dθidt

∂H

∂θi+dridt

∂H

∂ri

)=

d∑i=1

(∂H

∂ri

∂H

∂θi+∂H

∂θi

∂H

∂ri

)= 0.

In order to use Hamiltonian dynamics to simulate from a distribution we need to trans-late the density function to a potential energy function, and introduce artificial momentumvariables to go with these position variables of interest. A Markov chain can then be sim-ulated where at each iteration we resample the momentum variables, simulate Hamiltoniandynamics for a number of iterations, and then perform a Metropolis Hastings acceptancestep with the new variables obtained from the simulation.

In light of this, for Hamiltonian Monte Carlo we generally define the Hamiltonian H(r, θ)to be of the following form

H(r, θ) = U(θ) +K(r),

where θ is the vector we are simulating from and the momentum vector r is constructedartificially. Using the notation in Section 3.2 the potential energy is then defined to be

U(θ) = − log

(p(θ)

N∏i=1

p(xi|θ)

)= − log p(θ)−

N∑i=1

log p(xi|θ). (3.4)

The kinetic energy is defined as

K(r) =1

2rTM−1r, (3.5)

where M is a symmetric, positive definite mass matrix.

15

Page 17: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

3.3.2 Using Hamiltonian dynamics in MCMC

In order to relate the potential and kinetic energy functions to the distribution of interest,we can use the concept of a canonical distribution. Given some energy function E(x), definedover the state of x, the canonical distribution over the states of x is defined to be

P (x) =1

Zexp{−E(x)/(kBT )}, (3.6)

where Z is a normalizing constant, kB is Boltzmann’s constant, and T is defined to be thetemperature of the system. The Hamiltonian is an energy function defined over the jointstate of r and θ, so that we can write down the joint distribution defined by the function as

P (r, θ) ∝ exp{−H(r, θ)/(kBT )}.

If we now assume the Hamiltonian is of the form described by (3.4), (3.5), and that kBT = 1,then we find that

P (r, θ) ∝ exp{−U(θ)} exp{−K(r)}∝ p(θ|x)N (r|0,M).

So that the distribution for r and θ defined by the Hamiltonian are independent and themarginal distribution of θ is its posterior distribution.

This relationship enables us to describe Hamiltonian Monte Carlo (HMC), which canbe used to simulate from continuous distributions whose density can be evaluated up to anormalizing constant. A requirement of HMC is that we can calculate the derivatives of thelog of the target density. HMC samples from the joint distribution for (θ, r). Therefore bydiscarding the samples for r we obtain a sample from the posterior p(θ|x). Generally wechoose the components of r (ri) to be independent, each with variance mi. This allows us towrite the kinetic energy as

K(r) =d∑i=1

r2i2mi

.

In order to approximate Hamiltonian’s equations computationally, we need to discretizetime using a small stepsize ε. There are a number of ways to do this, however in practice theleapfrog method often produces good results. The method works as follows:

1. ri(t+ ε/2) = ri(t)− ε2∂U∂θi

(θ(t)),

2. θi(t+ h) = θi(t) + ε∂K∂ri

(r(t+ h/2)),

3. ri(t+ h) = ri(t+ h/2)− h2∂U∂θi

(θ(t+ h)).

The leapfrog method has a number of desirable properties, including that it is reversible andvolume preserving. An effect of this is that at the acceptance step, the proposal distributionscancel, so that the acceptance probability is simply a ratio of the canonical distributions atthe proposed and current states. Since we must discretize the equations in order to simulatefrom them, the posterior p(θ|x) is not invariant under the approximate dynamics. This is

16

Page 18: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

why the acceptance step is required, as it corrects for this error. As the stepsize ε tendsto zero, the acceptance rate of the leapfrog method tends to 1 as the approximation movescloser to true Hamiltonian dynamics.

Now that we have outlined how to approximate the Hamiltonian equations, we can outlineHamiltonian Monte Carlo. HMC is performed in two steps as follows:

1. Simulate new values for the momentum variables r ∼ N(0,M).

2. Simulate Hamiltonian dynamics for L steps with stepsize ε using the leapfrog method.The momentum variables are then negated, and the new state (θ∗, r∗) is accepted withprobability

min {1, exp{H(θ, r)−H(θ∗, r∗)}} .

3.3.3 Developments in HMC and tuning

HMC allows the state space to be explored rapidly and has high acceptance rates. Howeverin order to gain these benefits, we need to ensure that L and ε are properly tuned. Generallyit is recommended to use trial values for L and ε and to use traceplots and autocorrelationplots to decide on how quickly the resulting algorithm converges and how well it is exploringthe state space. The presence of multiple modes can be an issue for HMC, and requiresspecial treatment (Neal, 2010). Therefore it is recommended the algorithm is run at differentstarting points to ensure multimodality is not present.

Suppose we have an estimate of the variance matrix for θ, if the variables appear to becorrelated then HMC may not explore the parameter space effectively. One way to improvethe performance of HMC in this case is to set M = Σ−1, where Σ is our estimate of V ar(θ|x).

The selection of the stepsize ε is very important in HMC, since selecting a size that is toobig will result in a low acceptance rate, while selecting a size that is too small will result inslow exploration of the space. Selecting ε too large can be particularly problematic as it cancause instability in the Hamiltonian error, which leads to very low acceptance. In situationswhere the mass matrix M is the diagonal matrix, the stability limit for ε is given by thewidth of the distribution in its most constrained direction. For a Gaussian distribution, thisis the square root of the smallest eigenvalue of the covariance matrix for θ.

The value of L is also an important quantity to choose when tuning the HMC algorithm.Selecting L too small will mean the HMC explores the space with inefficient random walkbehaviour as the next state will still be correlated with the previous state. On the otherhand selecting L too large will waste computation and lower acceptance rates.

There have been a number of important developments to HMC. Girolami and Calderhead(2011) introduced Riemannian Manifold Hamiltonian Monte Carlo, which simulates HMC ina Riemannian space rather than a Euclidean one. This effectively enables the use of position-dependent mass matricesM . Using this result, the algorithm will sample more efficiently fromdistributions where parameters of interest exhibit strong correlations. A recent developmentby Hoffman and Gelman (2014) led to the development of the ‘No U-turn Sampler’. Thisenables the automatic and adaptive tuning of the stepsize ε and the trajectory length L. Thisis an important development since the tuning of HMC algorithms is a non-trivial task.

Alternative methods to the leapfrog method for simulating Hamiltonian dynamics have

17

Page 19: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

been developed. These enable us to to handle constraints on the variables, or to exploitpartially analytic solutions (Neal, 2010). As mentioned earlier, HMC can have considerabledifficulty moving between the modes of a distribution. A number of schemes have beendeveloped to solve this problem including tempered transitions Neal (1996) and annealedimportance sampling Neal (2001).

3.4 Stochastic gradient Langevin Monte Carlo

A special case of HMC arises, known as Langevin Monte Carlo, when we only use a singleleapfrog step to propose a new state. Its name comes from its similarity to the theory ofLangevin dynamics in physics. Welling and Teh (2011) noticed that the discretized form ofLangevin Monte Carlo has a comparable structure to that of stochastic optimization, outlinedin Section 3.2. This motivates them to develop an algorithm based on Langevin Monte Carlo,which only uses a subsample of the dataset to calculate the gradient of the potential energy∇U . They show that by using a stepsize that decreases with time, the algorithm will smoothlytransition from a stochastic gradient descent to sampling approximately from the posteriordistribution, without the need for an acceptance step. This result along with the fact thatonly a subsample of the data is used at each iteration, means that the algorithm is scalableto large datasets.

3.4.1 Stochastic gradient Langevin Monte Carlo

Langevin Monte Carlo arises from HMC when we use only one leapfrog step in generatinga new state (r, θ). In this case we can remove any explicit mention of momentum variablesand propose a new value for θ as follows (Neal, 2010)

θt+1 = θt −a2

2

∂U

∂θ+ η,

where η ∼ N(0, a2) and a is some constant. Using our particular expression of the potentialenergy (3.4), we can write

θt+1 = θt +ε

2

(∇ log p(θt) +

N∑i=1

∇ log p(xi|θt)

)+ η,

= θt −ε

2∇U(θt) + η (3.7)

where ε = a2.While being a special case of Hamiltonian Monte Carlo, the properties of Langevin dy-

namics are somewhat different. We cannot typically set a very large, so the state space isnormally explored a lot slower than using HMC. The proposal for Langevin Monte Carlois a particular discretization of a stochastic differential equation (SDE) known as Langevindynamics. Writing this discretization as an SDE we obtain

dθ = −1

2∇U(θ)dt+ dW = −1

2∇U(θ)dt+N (0, dt), (3.8)

18

Page 20: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

where W is a Wiener process and we have informally written dW as N (0, dt). A Wienerprocess is a stochastic process with the following properties:

1. W (0) = 0 with probability 1;2. W (t+ h)−W (t) ∼ N(0, h) and is independent of W (τ) for τ ≤ t.

It can be shown that, under certain conditions, the posterior distribution p(θ|x) is the sta-tionary distribution of (3.8). This motivates the Metropolis-adjusted Langevin algorithm(MALA), which uses (3.7) as a proposal for the Metropolis-Hastings algorithm.

When there are a large number of observations available, ∇U(θ) is expensive to calculateat each iteration, since it requires the evaluation of the log likelihood gradient. Welling andTeh (2011) therefore suggest introducing an unbiased estimator of ∇U(θ) which uses only asubset st of the data at each iteration. The estimator ∇U(θ) is given as follows

∇U(θ) = −∇ log p(θ)− N

n

∑xi∈st

∇ log p(xi|θ). (3.9)

We use that∇U(θ) = ∇U(θ) + ν, (3.10)

where ν is some noise term which we refer to as the stochastic gradient noise.Using this estimator in place of ∇U(θ) in a Langevin Monte Carlo update we obtain the

following

θt+1 = θt +ε

2

(∇ log p(θt) +

N

n

∑xi∈st

∇ log p(xi|θt)

)+ η, (3.11)

= θt +ε

2U(θt) +

ε

2νt + η.

If we assume that the stochastic gradient noise νt has variance V (θt), then the term ε2νt has

variance ε2

2V (θt). Therefore for small ε, η, which has variance ε, will dominate. As we send

ε → 0, (3.11) will approximate Langevin dynamics and sample approximately from p(θ|x),without the need for an acceptance step.

This result motivates Welling and Teh (2011) to suggest an algorithm that uses (3.11) toupdate θt, but to decrease the stepsize ε to 0 as the number of iterations t increases. Leadingto the SGLD algorithm update

θt+1 = θt +εt2

(∇ log p(θt) +

N

n

∑xi∈st

∇ log p(xi|θt)

)+ ηt (3.12)

Noting the similarity between (3.12) and stochastic optimization, they suggest decreasing εtaccording to the conditions (3.1) to ensure that the noise in the stochastic gradients averageout. The result is an algorithm that transitions smoothly between stochastic gradient descentand approximately sampling from the posterior using an increasingly accurate discretizationof Langevin dynamics. Since the stepsize must decrease to zero, the mixing rate of the

19

Page 21: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

algorithm will slow as the number of iterations increases. Putting this all together we outlinethe full SGLD procedure in Algorithm 2.

Algorithm 2: Stochastic gradient Langevin dynamics (SGLD).

Input: Initial estimate θ1, stepsize function ε(t), subsample size |st| = n,likelihood and prior gradients ∇p(x|θ) and ∇p(θ).Result: Approximate sample from the full posterior p(θ|x).for t = 1 to T do

ε← ε(t)Sample st from full dataset xη ∼ N(0, ε)θ ← θ + ε

2

(∇ log p(θ) + N

n

∑xi∈st ∇ log p(xi|θ)

)+ η

if ε small enough thenStore θ as part of the sample

end

end

3.4.2 Discussion and tuning

Teh et al. (2014) study SGLD theoretically and show that, given regularity conditions, es-timators derived from an SGLD sample are consistent and satisfy a central limit theorem.They reveal that for polynomial stepsizes of the form εt = a(b + t)−α, the optimal choice ofα is 1/3. The rate of convergence of SGLD is shown to be T−1/3, where T is the number ofiterations of SGLD. This is slower than the traditional Monte Carlo rate of T−1/2, and is dueto the decreasing stepsizes.

In tuning the algorithm the key constants that need to be chosen are those used in thestepsize, a and b, and the subsample size n. To avoid divergence it is important to keepthe stochastic gradient noise under control, especially as N gets large. This can be donein two ways. One is to increase the subsample size n, another is to keep the stepsize εsmall. However in order to keep the algorithm efficient the subsample size needs to be keptrelatively small, Welling and Teh (2011) suggest keeping it in the hundreds. Therefore themain constant that needs to be considered in tuning is a. Set a too large and the stochasticgradient noise dominates for too long and the algorithm never moves to posterior sampling.Set a too small however and the parameter space is not explored efficiently enough.

Problems with this method include that it is important for the step sizes to decrease tozero so that the acceptance rate is not needed. However this means the mixing rate of thealgorithm will slow down as the number of iterations increase. There are a few ways aroundthis. One is to stop decreasing the step size once it falls below a threshold and the rejectionrate is negligible, however in this case the posterior will still be explored slowly. The otheris to use this algorithm initially for burn-in, then switch to an alternative MCMC methodlater which is more efficient. However both these solutions require significant hand-tuningbeforehand. The decelerating mixing rate makes it less clear how the algorithm compares toother samplers, while it requires only a fraction of the dataset per iteration, this is offset bythe fact that more iterations are required to reach the accuracy of other samplers (Bardenet

20

Page 22: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

et al., 2015).Another problem with the method is that it often explores the state space inefficiently.

This is because Langevin dynamics explores the state space less efficiently than more generalHMC. This is motivation for stochastic gradient HMC (Chen et al., 2014) which is discussedin Section 3.5.

Note that similar to HMC, certain parameters may have a much higher variance thanothers. In this case we can use a preconditioning matrix M to bring all the parameters ontoa similar scale, allowing the algorithm to explore the space more efficiently. The algorithmincluding preconditioning can simply be written as

θt+1 = θt +εt2M

(∇ log p(θt) +

N

n

∑xi∈st

∇ log p(xi|θt)

)+ ηt,

where ηt ∼ N(0, εtM).Provided the size of the subset n is large enough, we can use the central limit theorem to

approximate V (θt) by its empirical covariance

V (θt) ≈N2

n2

∑xi∈st

(y(xi, θt)− y(θt))(y(xi, θt)− y(θt))T =

N2

nVs, (3.13)

where y(xi, θt) = ∇ log p(xi|θt) + 1N∇ log p(θt) and y(θt) = 1

n

∑xi∈st y(xi, θt). From (3.13) we

determine that the variance of a stochastic gradient step can be estimated byε2tN

2

4nMVsM

(Welling and Teh, 2011), so that for the injected noise to dominate, denoting the largesteigenvalue of MVsM by λ, we require

α =ε2tN

2

4nλ� 1.

Therefore using the fact that the Fisher’s information I ≈ NVs, and that the posteriorvariance Σθ ≈ I−1 for large n, we can find the approximate stepsize at which the injectednoise will dominate. Denoting the smallest eigenvalue of Σθ by λθ, the stepsize can be givenby εt ≈ 4αn

Nλθ. This stepsize is generally small.

Suppose we have a sample θ1, . . . , θm which is output from the algorithm. Since the mixingof the algorithm decelerates, standard Monte Carlo estimates will overemphasize parts of thesample where the stepsize is small. This increases the variance of the estimate, though itremains consistent. Therefore Welling and Teh (2011) suggest instead to use the estimate

E(f(θ)) ≈∑T

t=1 εtf(θt)∑Tt=1 εt

,

which is also consistent.

3.4.3 Further developments

A number of extensions to the original SGLD algorithm by Welling and Teh (2011) havebeen suggested. Ahn et al. (2012) aim to improve the mixing of the algorithm by appealing

21

Page 23: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

to the Bernstein-von Mises theorem. Their method samples from the Bernstein-von MisesNormal approximation to the posterior when the step sizes are high. Fisher’s information,used in the Normal approximation, is estimated from the data. When step sizes are small,the method employs Langevin dynamics to sample from a non-Gaussian approximation tothe posterior. The idea of this approach is that they are trading in bias for computationalgain, since the mixing rate of this algorithm will be higher. However theoretical guaranteesfor this algorithm are not well understood, and biases in the sample could be large whenmodels are complex.

Ahn et al. (2014) propose a distributed or parallelised algorithm based on SGLD. Thisworks by dividing batches of data across machines. SGLD is run on each worker for a numberof iterations, the last observation of this trajectory is then passed to another worker whichcarries on the trajectory using its own local dataset. In order to limit the amount of timespent waiting for the slowest workers the number of iterations of SGLD run on each workeris set to depend on the speed of the worker. However the consistency results of Teh et al.(2014) may no longer hold for this method and this needs to be checked. Patterson and Teh(2013) develop an algorithm inspired by SGLD intended for models where the parametersof interest are discrete probability distributions over K items. Common models with theseproperties includes for example latent Dirichlet allocation.

Sato and Nakagawa (2014) analyse the properties of SGLD with a constant stepsizeand find that the algorithm is weakly convergent. Vollmer et al. (2015) appeal that whilefavourable results have been obtained for the SGLD algorithm, most assume that the stepsizedecreases to zero which is not true in practice. This motivates them to calculate biasesexplicitly including their dependence on the stepsize and stochastic gradient noise. Usingthese results they propose a modified SGLD algorithm which reduces the bias in the originalalgorithm due to stochastic gradient noise.

3.5 Stochastic gradient Hamiltonian Monte Carlo

We have seen that Hamiltonian Monte Carlo, provides an efficient proposal for the Metropolis-Hastings algorithm which has a high acceptance rate and explores the space rapidly. Wellingand Teh (2011) propose combining Langevin Monte Carlo with stochastic optimization inorder to develop a scalable MCMC algorithm which only uses a subset of the data at eachiteration. However due to the restriction of just one leapfrog step at each iteration, LangevinMonte Carlo can explore the state space inefficiently and it would be beneficial to extendthe result to enable subsampling in Hamiltonian Monte Carlo. This extension is non-trivial,Betancourt (2015) discusses how using Hamiltonian Monte Carlo while subsampling naivelycan lead to unacceptable biases. Chen et al. (2014) discuss a potential solution to the problemby appealing to the dynamics of HMC itself, referred to as stochastic gradient HamiltonianMonte Carlo (SGHMC). However in doing so they make the assumption that the stochasticgradient noise, as defined in (3.10), is Gaussian. Bardenet et al. (2015) show that when thisassumption is violated poor performance can result.

22

Page 24: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

3.5.1 Stochastic gradient Hamiltonian Monte Carlo

SGHMC can be implemented naively by simply using a subset of the data to calculate thegradient of the potential energy ∇U at each iteration. This is considered in Chen et al.(2014). They use the same unbiased estimator of the potential energy gradient adopted byWelling and Teh (2011) and defined in (3.9).

The key assumption made by Chen et al. (2014) in developing their algorithm is thatthe stochastic gradient noise ν, defined in (3.10), is Normal. To argue the validity of theirassumption, they appeal to the central limit theorem, though the use of this assumptionneeds further verification. They therefore write

∇U(θ) ≈ ∇U(θ) +N (0, V (θ)),

where V is the covariance matrix of the stochastic noise. This assumption allows Chen et al.(2014) to approximately write the dynamics of the naive approach as

dθ = M−1rdt, dr = −∇U(θ)dt+N (0, 2Bdt), (3.14)

where B = 12εV (θ). For brevity we write B rather than B(θ), despite its dependence on θ.

These dynamics can then be discretized using the leapfrog method outlined in Section 3.3.Chen et al. (2014) show that the posterior distribution p(θ|x) is no longer invariant under

the dynamics of (3.14). Further to this Betancourt (2015) shows that naively subsamplingin this way when implementing HMC can lead to unacceptable biases. This is due to thestochastic gradient noise now present in the dynamics. In order to try and limit this noise,Chen et al. (2014) therefore introduce a ‘friction’ term to the dynamics. The friction terminvolves adding the term −BM−1r to the momentum dynamics. This leads to the fulldynamics

dθ = M−1rdt, dr = −∇U(θ)dt−BM−1rdt+N (0, 2Bdt).

The friction term acts by reducing the energy H(θ, r), which in turn reduces the influenceof the noise (Chen et al., 2014). However in practice we rarely know B analytically, andinstead simply have an estimate B. In this case Chen et al. (2014) suggest introducing afreely chosen friction term C such that C � B, meaning that C is componentwise greaterthan or equal to B. They then introduce the following dynamics

dθ = M−1rdt,

dr = −∇U(θ)dt− CM−1rdt+N (0, 2(C − B)dt) +N (0, 2Bdt). (3.15)

Chen et al. (2014) make two suggestions when discussing the value of B. One is to justignore the stochastic gradient noise and set B = 0. While this is not technically correct, asthe step size tends to 0 so will B. Therefore eventually the terms involving C will dominatethe dynamics. An alternative is to set B to 1

2εV , where V is an estimate of the stochastic

gradient noise found using an estimate of Fisher’s information (Ahn et al., 2014).

23

Page 25: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

3.5.2 Tuning SGHMC

While we now have an algorithm which does not depend on knowing the stochastic noisemodel B precisely, it is not obvious how to pick C when tuning the algorithm. In order togain more insight into best practices for tuning the algorithm, Chen et al. (2014) appeal tothe connection of SGHMC to stochastic optimization with momentum, outlined in Section3.2. By setting ν = εM−1r and discretizing the dynamics, we can rewrite (3.15) as

νt+1 = νt − ε2M−1∇U(θt)− εM−1Cνt +N (0, 2ε3M−1(C − B)M−1),

θt+1 = νt+1 + θt.

Next setting η = ε2M−1, α = εM−1C, β = εM−1B we obtain

νt+1 = (1− α)νt − η∇U(θ) +N (0, 2(α− β)η),

θt+1 = θt + νt+1. (3.16)

Notice the similarity between (3.2) and (3.16). In fact when the noise is removed, C = B = 0,then SGHMC naturally reduces to stochastic optimization with momentum. Therefore Chenet al. (2014) suggest appealing to this similarity and to choose the constants η and α ratherthan the matrix C and the stepsize ε. This simplifies tuning since we can use results fromthe stochastic optimization with momentum literature (Sutskever et al., 2013). However itis not obvious to what extent these results are applicable to SGHMC.

Chen et al. (2014) recommend using suggestions for B when selecting β. Therefore theyadvise to either set β = 0 or β = ηV /2, where V is an estimate of Fisher’s information.This leaves three parameters to be tuned, we refer to these as follows: the learning rate η,the momentum decay α, the subsample size n. A key principle to be kept in mind is thatthe stochastic gradient noise needs to be kept small, especially as N gets large, this canbe done in one of two ways. One is using larger subsample sizes n, the other is a smallerlearning rate η. Clearly to keep the speed of the algorithm we want to keep the subsamplesize small. In light of this Chen et al. (2014) suggest keeping η small, since large values cancause the algorithm to diverge. More specifically they suggest setting η = γ/N , where γis some constant, which we refer to as the batch learning rate, generally set to be around0.1 or 0.01. They suggest from previous implementations to keep the subsample size in thehundreds, for example n = 500. Finally by appealing to practices in SGD with momentum,they suggest setting α to be small, at around 0.01 or 0.1.

Now we have some guidance on choosing the algorithm constants, we outline the full

24

Page 26: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

SGHMC procedure in Algorithm 3

Algorithm 3: Stochastic gradient Hamiltonian Monte Carlo (SGHMC).

Input: Initial estimate θ1, batch learning rate γ, momentum decay α,subsample size n, trajectory length L,likelihood and prior gradients ∇p(x|θ) and ∇p(θ).Result: Approximate sample from the full posterior p(θ|x).for t = 1 to T do

Sample st from full dataset xGenerate new momentum variables r ∼ N(0,M)Reparameterise ν ← εM−1rfor l = 1 to L do

a ∼ N(0, 2(α− β)η)ν ← (1− α)ν − η∇U(θ) + aθ ← θ + ν

endStore θ as part of the sample

end

3.5.3 Discussion and extensions

This method can be seen as an extension to stochastic gradient Langevin dynamics. Thekey advantage the method adds is the efficient exploration of the state space inherent inHamiltonian dynamics. However while favourable theoretical results have been found forthe SGLD algorithm such as a central limit theorem and consistency results, in addingHamiltonian dynamics to a stochastic optimization Chen et al. (2014) have had to rely onassuming the stochastic gradient noise is Gaussian. Relying on this assumption can yieldto arbitrarily poor performance when this assumption is violated (Bardenet et al., 2015).Therefore the behaviour of the algorithm when simulating from complex models needs to beexplored. An alternative to relying on a Gaussian noise assumption would be, rather thandispensing with an acceptance step completely, to use the results of Bardenet et al. (2015)and perform an acceptance step using only a subset of the data. Another problem withthe algorithm is that there are a large number of parameters to be tuned with few resultsdiscussing best practices for doing so. It follows that guidance for tuning the algorithm oran SGHMC algorithm which tunes adaptively should be developed.

More recently Ma et al. (2015) proposed a general framework for producing stochasticgradient MCMC methods for which the target distribution is left invariant. They show thatthis proposed framework is complete. This means a large class of Markov processes with thedesired stationary distribution can be written in terms of this framework, including HMC,SGLD, SGHMC, etc. Using the framework, Ma et al. (2015) introduce a stochastic gradientRiemann Hamiltonian Monte Carlo (SGRHMC) algorithm. This algorithm combines thescalability of SGHMC with work by Girolami and Calderhead (2011) on adaptively tuningthe mass matrix M in Hamiltonian Monte Carlo. Ma et al. (2015) identify that using theframework in order to choose dynamics well suited to the desired target is an importantfuture direction.

25

Page 27: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

A problem addressed by Ding et al. (2014) is that the stochastic noise B is not easy toestimate. Appealing to developments in molecular simulation, Ding et al. (2014) consider thetheory of a canonical ensemble, which represents the possible states of a system in thermalequilibrium with a heat bath at fixed temperature T . The probability of states in a canonicalensemble follows the canonical distribution (3.6) outlined in Section 3.3. Ding et al. (2014)assert that a critical characteristic of the canonical ensemble is that the system temperature,defined as the mean kinetic energy, satisfies the equilibrium condition

kBT

2=

1

nE[K(r)], (3.17)

where we have used the notation in Section 3.3.In dynamics based Monte Carlo methods (for example HMC), the canonical ensemble

is approximated in order to generate samples. This approximation is outlined for HMC inSection 3.3. However in order to correctly simulate from the canonical ensemble, the dynamicsmust maintain the thermal equilibrium condition (3.17) (Ding et al., 2014). While it can beshown that the dynamics of HMC maintains thermal equilibrium, due to the presence of thestochastic gradient noise in SGHMC, the algorithm no longer satisfies thermal equilibrium.

In order to account for this, Ding et al. (2014) introduce a thermostat, which controls themean kinetic energy adaptively. To do this, they introduce a new variable ξ, and proposethe following dynamics as an alternative to the SGHMC algorithm

dθ = rdt

dr = −∇U(θ)dt− ξrdt+N (0,√

2Adt)

dξ =

(1

nrT r − 1

)dt,

where A is a constant to be chosen. Due to its similarity to the Nose-Hoover thermostat instatistical physics, Ding et al. (2014) refer to this algorithm as the stochastic gradient Nose-Hoover thermostat (SGNHT). They also introduce a more general method which is able tohandle non-isotropic noise from ∇U(θ). However, discretizing this system introduces biaseswhich need to be studied.

3.6 Conclusion

In this section we outlined stochastic gradient Monte Carlo. First we discussed stochasticoptimization, a technique used to estimate the mode of the posterior distribution while usingonly a subset of the data at each iteration. A downside of this method is that it only producesa point estimate of θ, which can often lead to overfitting. Hamiltonian Monte Carlo, a methodused to produce efficient proposals for a Metropolis-Hastings algorithm with high acceptanceis then discussed, along with complexities in tuning the algorithm.

With the machinery in place we are able to introduce stochastic gradient Langevin dynam-ics, which combines stochastic optimization with posterior sampling. The algorithm aims toimprove upon the overfitting issues of stochastic optimization. However the algorithm mixesslowly for two reasons. One is due to its decreasing stepsize, the other is due to the fact

26

Page 28: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

that Langevin dynamics does not explore the state space efficiently since it takes just onestep at a time. The optimal convergence speed along with results proving consistency and acentral limit theorem have been found for the algorithm. An optimal scaling result would bean important addition to this work.

The mixing issues of SGLD are aimed to be resolved using stochastic gradient HamiltonianMonte Carlo. This aims to extend the SGLD algorithm to use general Hamiltonian dynamicsin order to approximately sample from the posterior. This in turn means that the parameterspace will be explored more efficiently. However in doing so Chen et al. (2014) assumed thatthe stochastic gradient noise is Gaussian. The effects of this assumption for complex modelsare not clear. The method relies on a number of parameters and results on best tuningpractices for these are limited.

4 Simulation study

4.1 Introduction

4.1.1 Overview

In this section, we investigate the performance of the discussed methods in various scenarios.The study is divided into two main parts. The first compares the performance of the batchmethods, the second compares the performance of the stochastic gradient methods. The batchmethods were generally coded from scratch using R, though some methods were implementedusing the parallelMCMCcombine package developed by Miroshnikov and Conlon (2015). Thestochastic gradient methods were all implemented from scratch using Julia. The reason forusing Julia for the stochastic methods is for its speed in iterable tasks where vectorizationis not possible. Implementations of each method are available on GitHub: https://goo.gl/9ZGHP2.

In each study the target is a multivariate t-distribution with location θ, scale Σ anddegrees of freedom ν. In all cases Σ and ν are assumed known, and the algorithms are usedto estimate θ. Suppose the target has dimension d, then the true values of the parameters θ,Σ and ν are as follows

θ =

0...0

, Σ = diag{5, . . . , 5}, ν = 3,

where θ is a vector of length d and Σ is a d× d diagonal matrix. Since the model we use totest the methods is relatively simple, the number of observations simulated from the targetis kept small. The idea behind this is to emulate more complex models where the proportionof parameters to estimate against the number of observations is high. When we are notinvestigating the dimensionality of the target, its dimension is kept at 2.

The batch methods section compares the performance of the algorithms as the followingscenarios change:• The choice of bandwidth (nonparametric/semiparametric);

27

Page 29: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

• The number of observations;• The size of the batches;• The dimension of the target.

The stochastic gradient methods section compares performances for the following scenarios:• The subsample size;• The number of observations;• The dimension of the target.The choice of bandwidth aims to compare the nonparametric/semiparametric methods’

sensitivity to the bandwidth choice, and to see if there appears to be a clear optimal band-width for the methods under this scenario. When studying how the algorithms behave whenthe number of observations vary, we are particularly interested in the behaviour of the para-metric methods as the Bernstein-von Mises theorem no longer holds for the full posterior.Similarly as the size of each batch varies we are interested in the behaviour of the parametricmethods behave as the Bernstein-von Mises theorem no longer holds for each subposterior.Since the number of observations are fixed when the batch size is investigated, the numberof batches also varies in this investigation. We can therefore examine how the semiparamet-ric/nonparametric methods behave as the number of batches gets large.

The subsample size investigation will allow us to examine whether there is a point at whichthe performance improvement in using a larger subsample size is offset by the extra computa-tional cost. Finally when comparing how the different methods perform under dimensionality,we will be particularly interested in the semiparametric/nonparametric methods since KDEis known to scale poorly with dimension.

4.1.2 Performance summaries

The performance of the algorithms are compared by calculating the Kullback-Leibler di-vergence between the empirical distribution of the approximate sample, and the empiricaldistribution of a sample obtained from a standard MH algorithm. The KL divergence is ameasure of distance between two distributions. Given two continuous distributions P and Q,the Kullback-Leibler divergence DKL(P ||Q) is defined to be

DKL(P ||Q) =

∫ ∞−∞

p(x)p(x)

q(x)dx.

Clearly since we only have samples from each distribution the KL divergence needs tobe approximated. We estimate it using a method introduced by Boltz et al. (2007) whichis implemented in the R package FNN (Beygelzimer et al., 2013). The method works bycalculating an empirical density function using k-nearest neighbours, and then comparingthe KL divergence of these empirical densities. For reference, the estimated KL divergencebetween two samples from the same posterior calculated using an MH sampler was 7× 10−4.

In each case the KL divergence of a sample from the full posterior found using Metropolis-Hastings against a Laplace approximation calculated using stochastic optimization, which werefer to as a stochastic approximation, is plotted for comparative purposes. This serves asa baseline method, and so we are particularly interested in performance of the algorithmswhen the performance of the stochastic approximation is bad.

28

Page 30: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

4.2 Batch methods

4.2.1 Choice of bandwidth

First we look at the effect the choice of the bandwidth has on the quality of the sample thatis obtained from nonparametric and semiparametric methods. In this case we are interestedin the effect that annealing the bandwidth has on the performance of the methods. We alsowish to determine the sensitivity of each algorithm to the choice of bandwidth.

●●

●●●

●●● ●

●●●●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

0

1

2

3

0

1

2

3

Annealed

Not A

nnealed

1 2 3 4Ratio of bandwidth to largest batch standard deviation.

KL

Div

erge

nce

method

Npara

Semipara

Semipara_full

Figure 1: Plots of the KL divergence from a standard MH sample for different batch MCMCmethods against the ratio of bandwidth to the largest batch standard deviation.

1000 observations are simulated from the target distribution, which are then allocatedrandomly across 50 batches. 104 iterations of MCMC are run separately for the observationsacross each batch and these samples are then combined using various nonparametric andsemiparametric algorithms. The results are plotted in Figure 1. The plot compares threemethods, abbreviated as follows:• Npara: the nonparametric recombination method introduced in Section 2.5;• Semipara_full: the semiparametric method introduced in Section 2.6;

29

Page 31: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

• Semipara: the alternative semiparametric method introduced in Section 2.6. Thismethod uses the same mixture weights as the nonparametric method, which increasesthe acceptance rate of the MCMC.

Annealed indicates that the bandwidth is set to tend closer to zero at each iteration, asopposed to using a constant bandwidth. In the case the bandwidth is annealed, the x-axisindicates the ratio of the bandwidth at the first iteration of the algorithm to the maximumstandard deviation of a batch.

Notice first that the performance of the nonparametric method shows considerable sen-sitivity to bandwidth choice, while the semiparametric methods only show sensitivity if thebandwidth is set too small. This sensitivity could be due to two things. One is that theMCMC gets stuck in small modes when the bandwidth is too small, the other is that thesupports of each batch do no overlap well. To examine this further, we pick a bandwidthwhose ratio to the maximum standard deviation of a batch is 0.4. We implement the Npara

method in two ways. First, we implement it in the standard way simulating from the mix-ture using MCMC. Second, we fit a KDE to each batch at a grid of points and then take theproduct of these estimates. We refer to this as the grid method. In this case we find thatthe problem is the MCMC getting stuck, since the grid method performs quite well at lowbandwidths. At higher bandwidths however, the grid method exhibits similar poor fit to thestandard method.

When it is annealed, the optimal bandwidth appears to be about 1.2 standard deviations(sds) for the nonparametric method, while it is about 0.7 sds when it is not. Other exampleswere tried and similar values for the optimal bandwidth were found.

The performance of the semiparametric methods seem quite similar to each other, exceptat low bandwidths when the semiparametric method using nonparametric weights seems tooutperform the standard method. This is probably due to low MCMC acceptance for themethod that is using the standard weights. Optimal bandwidths for the semiparametricmethods seems to be about 1.5 sds when the bandwidth is not annealed and 2 sds when thebandwidth is annealed. There appears to be no harm in assiging a bandwidth that is quitelarge when using the semiparametric methods.

4.2.2 Number of observations

We investigate how the number of observations in each batch affects the performance of thedifferent algorithms. We have particular interest in how the parametric methods perform asthe Bernstein-von Mises no longer applies and the posterior moves away from Gaussianity.This time the number of observations that we simulate from the target varies, and these aredivided into 10 batches. 104 iterations of MCMC are run on each batch and samples arecombined using the following batch methods:• NeisNonparaAn: nonparametric method with annealing;• NeisPara: parametric method introduced in Section 2.4;• NeisSemiparaAn: semiparametric method with annealing, using nonparametric weights;• NeisSemiparaFullAn: semiparametric method with annealing;• Scott: parametric method introduced in Section 2.4;• stochastic_approx: sample from a stochastic approximation.

30

Page 32: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Since it was found that annealing does not affect performance of the algorithms too much,provided a good value is chosen for the bandwidth, only the annealed methods are includedin the plot.

●●

●●

●●●

●●

●●

●●●●● ●●

●●●

●●

●●

●●

● ●●

●●● ●●

● ●●

●●●

●●●●●

●●

●●

● ●●●●

● ●●●●● ●●●

● ●●●●

●●●●

●●

●●

●●

● ●●

●●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ●0.00

0.25

0.50

0.75

1.00

50 100 150 200Number observations

KL

dive

rgen

ce

method

NeisNonparaAn

NeisPara

NeisSemiparaAn

NeisSemiparaFullAn

Scott

stochastic_approx

Figure 2: Plots of the KL divergence from a standard MH sample for different batch MCMCmethods against the total number of observations to be divided among 10 batches.

In this example, the best performance is found to be attained by the parametric meth-ods Scott and NeisPara. The methods even perform reasonably well when the stochasticapproximation to the true posterior is poor. This comes as a surprise since theoretical justi-fication for all these methods relies on the Bernstein-von Mises theorem. The methods areonly adversely affected in the extreme case of just 10 observations, so one observation perbatch.

The worst performance in this example is by NeisSemiparaAn, whose KL divergence isvery high for a small number of observations. Since there are only a few observations perbatch in these cases, the means are very different, and the semiparametric method using thestandard weights seems to tend to get stuck at components far away from the true mean. Itprobably gets stuck because of its low acceptance rate.

Figure 2 shows that the stochastic optimization method performs poorly when there arefew observations. This is because the method struggles to get close to the true mode ofthe posterior, suggesting the presence of multiple modes. This is good news for the para-metric methods, as it was questionable how they would perform in the presence of multiplemodes. All the methods except NeisSemiparaFullAn appear to outperform a stochasticapproximation when there are a small number of observations. However as n gets larger theapproximation begins to outperform the methods NeisNonparaAn and NeisSemiparaAn.

NeisSemiparaAn appears to perform somewhat worse than the other methods across theboard. This is possibly due to using weights in the MCMC that are only asymptotically validas h → 0. Since the methods are working with only a few observations the bandwidth h is

31

Page 33: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

probably quite high in this case.

4.2.3 Batch size

We investigate how the size of each batch affects the performance of the different algorithms.800 observations are simulated from the target, the data is then divided up randomly intosets of different numbers of batches. 104 iterations of MCMC are run on each batch and thesamples are combined using the discussed batch methods. The results are plotted in Figure3.

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●● ● ● ● ● ● ●

0.0

0.1

0.2

0.3

0.4

0.5

0 25 50 75 100Batch Size

KL

dive

rgen

ce

method

NeisNonparaAn

NeisPara

NeisSemiparaAn

NeisSemiparaFullAn

Scott

stochastic_approx

Figure 3: Plots of the KL divergence from a standard MH sample for different batch MCMCmethods against different batch sizes.

Figure 3 shows that the semiparametric and parametric methods perform well across avariety of batch sizes. This again is somewhat surprising, since for non-Gaussian targets, theparametric methods are only theoretically justified as the size of each batch tends to infinity.The fact that an approximation using stochastic optimization performs well also suggests itmay be instructive to try a more complex example to see how the methods perform then.The nonparametric method performs very poorly when batch sizes are small. Comparingto the grid method, this is mainly due to inefficient MCMC, probably because the numberof mixture components is large when there are a lot of batches. The grid approximation tothe full posterior is not perfect however, perhaps because the subposterior supports do notoverlap well when batch sizes are small.

4.2.4 Dimensionality

We investigate how the dimensionality of the target affects the performance of the differentalgorithms. We simulate 800 observations from the target, the data is then divided uprandomly into sets of 40 batches each with 20 observations. 104 iterations of MCMC are

32

Page 34: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

run on each batch and the samples are combined using the batch methods. The results areplotted in Figure 4.

●●●●● ●●●●● ●●

●●● ●●

●● ●●

●● ●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

0.0

2.5

5.0

7.5

10.0

5 10 15 20Dimension of target

KL

dive

rgen

ce

method

NeisNonparaAn

NeisPara

NeisSemiparaAn

NeisSemiparaFullAn

Scott

stochastic_approx

Figure 4: Plots of the KL divergence from a sample obtained using a standard MH algorithmfor different batch MCMC methods against the dimension of the target.

Once again it is found that the best performance in this example is by the parametricmethods Scott and NeisParam, despite the stochastic approximation performing badly inhigh dimensions. It was found that in high dimensional cases the stochastic approximationperformed poorly because the approximation of the posterior variance was inadequate. Thisis in contrast with a low number of observations when the approximation struggled to findthe mode.

As expected the nonparametric method based on kernel density estimation performs verybadly in high dimensions. This will be due mainly to the poor performance of KDE in highdimensions. The efficiency of the nonparametric method might be improved by choosing akernel with heavier tails. To a lesser extent, both semiparametric methods perform progres-sively worse as the dimensionality increases. Again this is probably due to the known poorperformance of kernel density estimation in high dimensions.

4.3 Stochastic gradient methods

4.3.1 Tuning the algorithms

In the case of stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamil-tonian Monte Carlo, we have certain free constants we may choose. In these cases we needto make good choices for the constants in order for the algorithms to sample closely to theposterior distribution. In the case of the SGLD algorithm, as we talked about earlier, it isrecommended by Welling and Teh (2011) to keep the subsample size n in the hundreds. Theeffect of a change in subsample size is investigated later. On the recommendations of Teh

33

Page 35: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

et al. (2014) discussed in Section 3.4, we set the sample size to be of the form aN

(1 + t)−1/3,where N is the total number of observations, and t is the iteration. Therefore the onlyconstant that needs to be chosen when tuning the algorithm is a.

There is little in the literature on the choice of this constant, however we had the luxuryof being able to run a standard MH sampler on our chosen examples. Accordingly, we chosethis constant by running the algorithm with a number of different values of a, and choosingthe value of a which minimised the KL divergence between the SGLD sample and the samplefrom a standard MH sampler. Empirically the value which had the most effect on the choiceof a was the number of observations. The subsample size n appeared to have had some effecton the choice of a, while dimensionality appeared to have the least effect.

In the case of the SGHMC algorithm, as recommended by Chen et al. (2014), we appealedto its connection with stochastic gradient descent with momentum. We therefore reparam-eterised in terms of a learning rate η, a momentum decay α, the subsample size n and thetrajectory L. It was found that provided it was set relatively high the trajectory L hadlimited effect on the quality of the sample, we therefore fixed L = 10. We set η = γ/N andtuned γ, known as the batch learning rate.

In their paper, Chen et al. (2014) recommend setting α quite small, to 0.1 or 0.01, andsetting γ to 0.1 or 0.01 too. We checked this recommendation by varying α and γ and findingthe constant choices which minimised the KL divergence between these and a sample from astandard MH sampler. We found that for our examples, the recommended choices for η andα were not the best in general. In fact it was often appropriate to choose a value for α inthe interval [1, 2] and for γ in the interval [0, 10]. Chen et al. (2014) recommend a subsamplesize n of approximately 500, and we explore the effect of choosing different subsample sizeslater in the section.

4.3.2 Subsample size

In this investigation we compared the effect of different subsample sizes on the sample quality.Our interest is whether there is a clear point at which obtaining a larger subsample size is nolonger worth the increase in computational expense. 105 observations were simulated fromthe target distribution. After a burn-in of 103 iterations, SGLD and SGHMC algorithms wererun for 104 steps with different subsample sizes n. For comparison a Normal approximationcalculated using stochastic optimization with the same subsample sizes is also included. Theresults are plotted in Figure 5

Something immediately obvious from Figure 5 is that there is a clear point at which largersubsample sizes provide limited extra value. In this case a subsample size of about 200-250appears optimal for both methods. Notice that for small subsample sizes the performanceof the SGLD algorithm appears to be somewhat poorer than SGHMC, in some cases it evenperforms worse than a stochastic approximation. This may be due to the large stochasticgradient noise when the subsample size is this small. In cases where the stochastic gradientnoise is very large it may be the case that the injected noise never has a chance to dominatethe sampling, so that the posterior sample is not of a good quality. While the same argumentmight be used for the SGHMC the presence of its friction term may be the reason it is lessaffected.

34

Page 36: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

● ●● ● ●

● ●

●●

●●

●● ●

0.00

0.05

0.10

0.15

0.20

0.25

0 250 500 750 1000Subsample Size

KL

Div

erge

nce

method

sghmc

sgld

stochastic_approx

Figure 5: Plots of the KL divergence from a standard MH sample for different stochasticgradient Monte Carlo methods against different subsample sizes.

Both algorithms outperform the stochastic approximations for reasonable subsample sizes,even though the number of observations is high. Despite having no theoretical guarantees,SGHMC also generally appears to outperform SGLD. However this is at the cost of a slowerrun time and more constants to tune. In general it was found that SGHMC runs a little lessthan L times slower than SGLD, where L is the length of the trajectory. In this example, thebest values for a when applying the SGLD algorithm ranged from 0.8, for a small subsamplesize of 10 to 40 for a larger subsample size of 1000. This suggests that as the subsample sizeincreases relative to the number of observations, we are able to increase the stepsize and somake more confident moves.

The best values of α for the SGHMC algorithm ranged from 2 for small values n to 0.5for large values. The best values found for γ ranged from 6 for small values of n to 0.5 forlarge values. While Chen et al. (2014) suggest that as the number of observations N getsbigger, we can either set a small learning rate η or use a larger subsample size n, this exampleappears to suggest differently. As n increases while keeping the observations fixed in thiscase, we would expect η to increase to compensate. However in this case the opposite occurs,and the best constant choices appear to be to decrease η as we also decrease α.

4.3.3 Number of observations

We compare the effect of different numbers of observations on the performance of each of thestochastic gradient methods. Contrasting numbers of observations were simulated from thetarget distribution and after a burn-in of 103 iterations, SGLD and SGHMC algorithms wererun for 104 iterations. In all cases the subsample size was fixed at 10, while this is small, itallowed us to test the methods at a very low number of observations. Results are plotted inFigure 6. The results for the batch method Scott, which performed particularly well, arealso plotted for comparison.

35

Page 37: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

● ●

●● ●

● ● ● ●● ● ● ● ● ● ● ●

●●

●● ● ●

● ● ●

● ● ●●

● ●● ●0.0

0.1

0.2

0.3

0.4

0.5

50 100 150 200Number of Observations

KL

Div

erge

nce method

Scott

sghmc

sgld

stochastic_approx

Figure 6: Plots of the KL divergence from a standard MH sample for different stochasticgradient Monte Carlo methods against different numbers of observations.

Notice that the sample quality of the SGLD and SGHMC algorithms are not too affectedby an excessively small number of observations as the batch methods are. The performanceof the SGHMC algorithm in these cases is exceptional, with an estimated KL divergence ofthe order 0.001, outperforming all the batch methods. For a larger number of observations,the performance of the SGLD algorithm appears rather poor, as it is outperformed by thestochastic approximation, and many of the batch methods. This poor performance is prob-ably due to the excessively small subsample size of 10, as we have seen in Section 4.3.2 themethod is less resilient to large stochastic gradient noise than SGHMC.

The best choices for a when applying the SGLD algorithm in this case was found to rangefrom 200 for small values of n, to 50 for the larger values of n. Therefore in this case wefind that as we increase the number of observations, while keeping the subsample size fixed,the corresponding stepsize should decrease. This occurs because as we increase the numberof observations the stochastic gradient noise will increase. We therefore want to make lessconfident moves to compensate for this.

When applying the SGHMC algorithm however, the best values for γ and α appeared tohave no trend with the number of observations. That being said, the best values found werequite variable in magnitude. Values for γ ranged from 0.01 to 6 while values for α rangedfrom 0.01 to 1.9. Choosing different parameters from these generally led to an estimated KLdivergence of the order 0.1 rather than 0.001. This suggests that tuning this algorithm maybe somewhat of a fine art and that finding good rules for choosing γ and α will probably bedifficult.

4.3.4 Dimensionality

We investigate how the dimensionality of the target affects the performance of the differentstochastic gradient algorithms. We simulate 800 observations from the target distributionwith varying dimensionality d. SGLD and SGHMC are then used to approximately sample

36

Page 38: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

from the posterior with fixed subsample size n = 200. The results are plotted in Figure 7.Again the KL divergence of the batch method Scott is plotted for comparison.

● ● ● ● ● ● ● ● ● ●●

●● ●

● ●

●●

●●

● ●● ●

●●

●● ● ●

● ●

● ● ● ●● ● ● ● ● ● ●

● ●

●0.0

0.5

1.0

1.5

2.0

5 10 15 20Number of Observations

KL

Div

erge

nce method

Scott

sghmc

sgld

stochastic_approx

Figure 7: Plots of the KL divergence from a sample obtained using a standard MH algorithmfor different stochastic gradient Monte Carlo methods against the dimension of the target.

Once again the SGHMC algorithm substantially outperforms the SGLD algorithm at alldimensions. Both methods perform relatively well at high dimensions. At high dimensions,the SGLD algorithm performs about on par with the NeisPara algorithm, and slightly worsethan the algorithm of Scott, while the SGHMC algorithm appears again to perform the bestof all the methods considered in the investigation. The SGLD algorithm appears to get worsedue to dimensionality quite quickly, but this then levels off.

The best values for a when applying the SGLD algorithm in this example ranged from90 at low dimensions to 70 at higher dimensions. This suggests that at higher dimensions aslightly lower stepsize may be required in order for the algorithm to explore the space mosteffectively. Once again we find little trend in the best choices for α and γ with dimension.Values for γ range from 0.1 to 9, while values for α range from 0.1 to 2.

4.4 Conclusion

We find that the parametric methods are surprisingly robust to a variety of scenarios for ourrelatively simple model. Including small batch sizes, high dimensionality and a small numberof observations. This encourages us to explore the methods’ properties in more complex,multimodal models. The nonparametric and semiparametric methods did not perform sowell. Many of these issues seemed to be as a result of the MCMC simulation getting stuck.More efficient ways of combining the subposteriors or simulating from the resulting mixtureare required. The nonparametric method was particularly poor at high dimensions, this maybe improved by investigating heavier tailed kernels.

37

Page 39: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Stochastic gradient methods seemed to be robust to a variety of scenarios. Howeverconsiderable time was spent tuning these algorithms, even with an MH sample of the fullposterior available. Therefore results on tuning guidance for these methods is required. Itwas found that SGHMC was just under L times slower to run than SGLD, where L is thetrajectory length. With 3 constants to tune, compared to 1 for SGLD, SGHMC took longerto tune than SGLD. However its performance was generally much better.

5 Future Work

5.1 Introduction

As a new field, avenues for research are growing rapidly. All of the methods mentioned inthe report are relatively recent, therefore in many cases their theoretical properties have notbeen adequately explored. Biases often exist in the methods which might be improved upon,alternatively the methods have flaws which need to be addressed. In this section we outlinethree areas of open research that have come to light from the review and simulation study,and that I am likely to pursue as part of the PhD project.

5.2 Further comparison of batch methods

Most of the batch methods either relied on potentially quite restrictive assumptions, orhad undesirable properties. While the simulation study shed some light on the practicalperformance of the algorithms, the model used in the testing was relatively simple so issuesmay have been missed. Moreover the simulation study itself opened up a few more questionswhich it is important to answer. This leads us to our first area of research which is to providea more comprehensive comparison of the various batch methods. For the rest of this section,we outline what we are looking for in particular.

Parametric recombination methods relied heavily on the assumption of the Bernstein-vonMises theorem, a central limit theorem for Bayesian statistics. In particular both methodsare only exact when each subposterior was Gaussian or as the size of each batch tends toinfinity. This immediately begs the question, in what cases is this assumption valid. Thesimulation study showed that the two parametric algorithms were surprisingly robust to anumber of different scenarios. This included cases where the number of observations was smallcompared to the dimensionality of the target, a commonly occurring scenario in practice. Ofparticular interest was the fact that the methods would regularly outperform the stochasticapproximation. There is no underlying theory to suggest why this might be, so working todevelop theoretical results of the sort would be a valuable contribution.

The strong performance of the parametric methods may be as a result of the relativelysimple model being used for the comparison. Therefore an investigation into the performanceof parametric methods when they are used to train a more complex target is required. Thedesire is for it to be a standard model from the machine learning literature which exhibitsmultimodality. Some common models which fit these properties include Bayesian logisticregression, neural networks and latent Dirichlet allocation.

38

Page 40: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Nonparametric and semiparametric recombination methods suffer from a number of dis-advantages. These disadvantages came to light during the simulation studies. The nonpara-metric algorithm performed poorly at high dimensions, small batch sizes and was sensitive tobandwidth choice. Semiparametric methods performed poorly at high dimensions and whenthere were a small number of observations.

Many issues with the algorithms were as a result of the MCMC used to simulate from theGaussian mixture getting stuck. Work on recombining the subposteriors more efficiently isneeded. As outlined in Section 2.3, efficient methods for simulating from Gaussian mixtureshave been developed (Ihler et al., 2004; Rudoy and Wolfe, 2007). Potential improvements tothe posterior estimate by using these methods to simulate from the Gaussian mixture couldbe studied.

By changing the kernel used in the KDE to be heavier tailed, the algorithms may workmore effectively at high dimensions and at smaller batch sizes, since the subposterior densitiesare likely to overlap. Implementing a different kernel will require an alternative strategy whenrecombining the density estimates. The fact that the product of two Gaussian densities isitself a Gaussian density is key in the nonparametric method introduced by Neiswanger et al.(2013).

Results regarding the choice of bandwidth when estimating each subposterior could be auseful direction. The estimation of the bandwidth of each subposterior is a balance betweenthe precision of the estimate and the amount of subposterior overlap. Reviewing the methods’performance when targets are multimodal would be useful. One might expect these methodsto perform better than the parametric methods in this case.

To summarise we aim to compare the methods with more complex models, with a partic-ular interest in multimodal targets. We plan to perform a review of the performance of thenonparametric and semiparametric methods when heavier tailed kernels are used. Theoreti-cal results as to the performance of the parametric methods when compared with a stochasticapproximation could be developed. Finally developing more efficient ways to simulate fromthe nonparametric approximation to the posterior could be an area to pursue.

5.3 Tuning guidance for stochastic gradient methods

A particularly non-trivial part of implementing the stochastic gradient Monte Carlo methodsSGLD and SGHMC was the tuning of them. The absence of an acceptance step meansthat tuning cannot simply be performed by looking at the traceplots or by checking theacceptance rate. While we had the luxury of comparing with a sample produced usingMetropolis-Hastings, clearly in general this will not be the case. While Teh et al. (2014) havefound optimal convergence rates for SGLD there are no results on the choice of the constanta. Similarly there have been no results on good choices of α and γ for SGHMC. Thereforeboth methods require guidance on tuning.

Optimal scaling results for either algorithm would be useful. In order to find optimalscaling results for MCMC algorithms, a relevant measure of efficiency is required. In Robertset al. (2001), the optimal scaling of various MH algorithms are reviewed. This includes theMetropolis adjusted Langevin algorithm (MALA) which uses the same proposal as SGLD,though without the decreasing stepsize. However the measure of efficiency used in this case

39

Page 41: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

is the reciprocal of the integrated autocorrelation time, which is found to be related to theacceptance rate. This measure is probably not applicable in the case of the SGLD algorithmsince there is no acceptance rate. Therefore moves which would normally not be accepted bya MALA algorithm, leading to high autocorrelation measures between the new state and thecurrent state, would simply be an overconfident move for the SGLD algorithm. This wouldhave low autocorrelation between the new and current states.

A good measure of efficiency for stochastic gradient based algorithms is therefore requiredin order to allow optimal scaling results to be obtained. This efficiency measure is alsorequired for the development of methods which tune adaptively, similar to the ‘No U-turnSampler’ developed by Hoffman and Gelman (2014) for HMC.

5.4 Using batch methods to analyse complex hierarchical models

A form of model which could benefit considerably from the scalability improvements intro-duced by batch methods are hierarchical models. This type of model is very common amonga wide range of applications, from computer science to ecology. In this case the structure ofthe model can be taken advantage of. Suppose we have a nested hierarchical model of theform

yij ∼ f(y|xj), xj ∼ g(xj|θ),then we might split the data according to the value j (Scott et al., 2013). Batch methods canthen be used to estimate θ, and this can be used in order to simulate values of xj in parallel.Similar methods can be used for more complex structures.

Clearly one key benefit of this approach is that it allows the problem to be parallelisednaturally in groups. Another reason we might expect this approach to work particularly wellis because generally only a few parameters need to be approximated by combining batches.The rest are estimated in parallel given these parameters using standard MCMC. In thissection we outline a particular form of hierarchical model which could benefit from thisapproach.

A common statistical problem, especially in medical applications, is the need to modelthe relationship between a single time independent response and a functional predictor. It isgenerally impossible to observe this functional predictor exactly, the data therefore tends toconsist of noisy observations of the function. For example, we may wish to predict whether anindividual is infected with a particular disease based on various measurements that have beenrecorded over time. A particular example is determining the relationship between magneticresonance imaging (MRI) data and health outcomes (Goldsmith et al., 2012).

This type of problem is considered by Woodard et al. (2013). For each subject i, weassume we have noisy observations {Wi(xik}Kk=1} from a function fi(x). In order to fullyaccount for uncertainty in the estimation of each function, Woodard et al. (2013) introducea hierarchical model. The model first produces an estimate fi(x) of each function using thenoisy observations. The response variable Yi is then regressed against statistical summariesof these estimates.

Suppose in producing an estimate of fi(x) we are required to estimate a set of parametersfor each subject ωi, and that regressing Yi on summaries of ωi requires the estimation of a fur-ther set of parameters φ. Let the vector of noisy observations for each subject {Wi(xik)}Kk=1}

40

Page 42: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

be denoted by Wi, and the matrix of all observations be denoted W. Woodard et al. (2013)take a Bayesian approach and aim to sample from the posterior distribution given by

p({ωi}ni=1, φ|W,Y) ∝ p(φ)n∏i=1

p(ωi)p(Yi|ωi, φ)p(Wi, ωi),

where Y is a vector of responses. Typically accurate estimation of fi(x) requires a largenumber of observations. Therefore when there are a large number of subjects, this problembecomes infeasible due to computational expense.

In order to account for this fact, Woodard et al. (2013) suggest decomposing the posterioras follows

p({ωi}ni=1, φ|W,Y) = p({ωi}ni=1|W,Y)p(φ|{ω}ni=1,Y).

They then suggest, assuming independence across subjects i, to apply an approximationknown as modularization (Liu et al., 2009) to obtain

p({ωi}ni=1|W,Y) ≈ p({ωi}ni=1|W) =n∏i=1

p(ωi|Wi).

This approximation allows us to estimate the function fi(x) for each subject in parallel.Functional estimation is the slower part of the algorithm, so this speeds up the algorithmconsiderably. However splitting the posterior in this way loses valuable information.

An alternative approach would be to split posterior distribution into subposteriors bysubject p(ωi, φ|Wi, Yi) as follows

p({ωi}ni=1, φ|W,Y) ∝n∏i=1

p1/n(φ)p(ωi)p(Yi|ωi, φ)p(Wi, ωi),

=n∏i=1

p(ωi, φ|Wi, Yi).

These subposteriors can then be simulated from in parallel using the methodology outlinedin Woodard et al. (2013). Batch methods could then be used to produce an estimate forφ, the parameter of interest. This is a particularly suitable use for batch methods since thedimensionality of φ is likely to be considerably lower than that of ωi. A review into theperformance of this idea would be useful, especially if its performance is found to give asignificant improvement to solutions which use modularization.

41

Page 43: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

References

Ahn, S., Korattikara, A., and Welling, M. (2012). Bayesian posterior sampling via stochasticgradient Fisher scoring. In Proceedings of the 29th International Conference on MachineLearning (ICML-12), pages 1591–1598.

Ahn, S., Shahbaba, B., and Welling, M. (2014). Distributed stochastic gradient MCMC. InProceedings of the 31st International Conference on Machine Learning (ICML-14), pages1044–1052.

Bardenet, R., Doucet, A., and Holmes, C. (2015). On Markov chain Monte Carlo methodsfor tall data. arXiv preprint arXiv:1505.02827.

Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian MonteCarlo and naive data subsampling. In Proceedings of the 32nd International Conferenceon Machine Learning (ICML-15), pages 533–540.

Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., and Li, S. (2013). FNN:fast nearest neighbor search algorithms and applications. R package version 1.1.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Boltz, S., Debreuve, E., and Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis forMultimedia Interactive Services (WIAMIS-07), pages 16–16. IEEE.

Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo.arXiv preprint arXiv:1402.4102.

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. (2014). Bayesian sam-pling using stochastic gradient thermostats. In Advances in Neural Information ProcessingSystems, pages 3203–3211.

Duong, T. (2004). Bandwidth selectors for multivariate kernel density estimation. PhD thesis,University of Western Australia.

Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and HamiltonianMonte Carlo methods. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 73(2):123–214.

Goldsmith, J., Bobb, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2012). Penalizedfunctional regression. Journal of Computational and Graphical Statistics.

Hjort, N. L. and Glad, I. K. (1995). Nonparametric density estimation with a parametricstart. The Annals of Statistics, pages 882–904.

Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: Adaptively settingpath lengths in Hamiltonian Monte Carlo. The Journal of Machine Learning Research,15(1):1593–1623.

42

Page 44: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Ihler, A. T., Sudderth, E. B., Freeman, W. T., and Willsky, A. S. (2004). Efficient multiscalesampling from products of Gaussian mixtures. Advances in Neural Information ProcessingSystems, 16:1–8.

Le Cam, L. (2012). Asymptotic methods in statistical decision theory. Springer Science &Business Media.

Liu, F., Bayarri, M., Berger, J., et al. (2009). Modularization in Bayesian analysis, withemphasis on analysis of computer models. Bayesian Analysis, 4(1):119–150.

Ma, Y.-A., Chen, T., and Fox, E. B. (2015). A complete recipe for stochastic gradientMCMC. arXiv preprint arXiv:1506.04696.

Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesianinference via the median posterior. In Proceedings of the 31st International Conference onMachine Learning (ICML-14), pages 1656–1664.

Miroshnikov, A. and Conlon, E. (2015). parallelMCMCcombine: Methods for combiningindependent subset MCMC posterior samples to estimate a full posterior density. R packageversion 1.0.

Neal, R. M. (1996). Sampling from multimodal distributions using tempered transitions.Statistics and Computing, 6(4):353–366.

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2):125–139.

Neal, R. M. (2010). MCMC using Hamiltonian Dynamics. In Handbook of Markov ChainMonte Carlo. Chapman & Hall.

Neiswanger, W., Wang, C., and Xing, E. (2013). Asymptotically exact, embarrassinglyparallel MCMC. arXiv preprint arXiv:1311.4780.

Patterson, S. and Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamicson the probability simplex. In Advances in Neural Information Processing Systems, pages3102–3110.

Rabinovich, M., Angelino, E., and Jordan, M. I. (2015). Variational Consensus Monte Carlo.arXiv preprint arXiv:1506.03074.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals ofMathematical Statistics, pages 400–407.

Roberts, G. O., Rosenthal, J. S., et al. (2001). Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16(4):351–367.

Rudoy, D. and Wolfe, P. J. (2007). Multi-scale MCMC methods for sampling from productsof Gaussian mixtures. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007.IEEE International Conference on, volume 3, pages III–1201. IEEE.

43

Page 45: Computational Statistics for Big Data - Lancaster …Computational statistics for big data Jack Baker 1 Introduction As the amount of data stored by individuals and organisations grows,

Computational statistics for big data Jack Baker

Sato, I. and Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevindynamics by using Fokker-Planck equation and Ito process. In Proceedings of the 31stInternational Conference on Machine Learning (ICML-14), pages 982–990.

Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H., George, E., and McCulloch, R.(2013). Bayes and big data: The Consensus Monte Carlo algorithm. In EFaBBayes 250conference, volume 16.

Srivastava, S., Cevher, V., Tran-Dinh, Q., and Dunson, D. B. (2015). WASP: ScalableBayes via barycenters of subset posteriors. In Proceedings of the Eighteenth InternationalConference on Artificial Intelligence and Statistics, pages 912–920.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initializa-tion and momentum in deep learning. In Proceedings of the 30th international conferenceon machine learning (ICML-13), pages 1139–1147.

Teh, Y. W., Thiery, A., and Vollmer, S. (2014). Consistency and fluctuations for stochasticgradient Langevin dynamics. arXiv preprint arXiv:1409.0578.

Vollmer, S. J., Zygalakis, K. C., et al. (2015). (Non-) asymptotic properties of stochasticgradient Langevin dynamics. arXiv preprint arXiv:1501.00438.

Wang, X. and Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arXivpreprint arXiv:1312.4605.

Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. (2015). Parallelizing mcmc with randompartition trees. arXiv preprint arXiv:1506.03164.

Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dy-namics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688.

Woodard, D. B., Crainiceanu, C., and Ruppert, D. (2013). Hierarchical adaptive regressionkernels for regression with functional predictors. Journal of Computational and GraphicalStatistics, 22(4):777–800.

Wu, J. (2004). Some properties of the Gaussian distribution. Georgia Institute of Technology.

44