particle move-reweighting schemes for online bayesian … · 2017-03-09 · particle...

Particle move-reweighting schemesfor online Bayesian inference

Reinaldo Marques and Geir StorvikUniversity of Oslo and Statistics for Innovation Centre, Oslo, Norway.

Abstract. Sequential Monte Carlo (SMC) methods are one of the most impor-tant computational tool to deal with intractability in complex statistical models. Inthose techniques, the distribution of interest is approximated by a set of properlyweighted samples. One problem with SMC algorithms is the weights degener-acy: either the weights have huge variability or high correlations between theparticles. Updating the particles by a few MCMC steps has been suggested asan improvement in this case (the resample-move algorithm). The general setupis to first resample the particles in such a way that all particles are given equalweight. Thereafter the MCMC steps are applied in order to make the identicalsamples diverge. In this work we consider an alternative strategy where the orderof MCMC updates and the resampling steps are switched, i.e. MCMC updates areperformed first. The main advantage with such an approach is that by performingMCMC updates, the weights can be updated simultaneously, making them lessvariable. We illustrate through simulation studies how our methodology can giveimproved results for online Bayesian inference in general state-space models.

Keywords: Approximate inference; Particle filter; MCMC moves; reweighting schemes;Sequential Monte Carlo methods; State space models

2 Marques, R. and Storvik, G.

1. Introduction

In many real-world applications, observations arrive sequentially in time (Doucetet al., 2001; Candy, 2009; Zucchini and MacDonald, 2009). State-space modelsprovide flexible representations for stochastic dynamical systems in which theycan encapsulate many real problems (in time or space-time). Such models areapplied in a wide range of fields: financial econometrics, ecology, geo(neuro)-science, engineering or machine learning (Doucet et al., 2001; Green et al., 2003;Kevin, 2012).

Sequential Monte Carlo (SMC) methods are an efficient class of algorithms tosample from a probability distribution of interest. Their respective algorithms–called particle filters (PF)– are exhaustively used to deal with the compu-tational intractability of general state-space models. By employing sequentialMonte Carlo (SMC) methods, the target distribution is approximated by a setof weighted samples, generated sequentially from some proposal distributions(see Cappe et al., 2007; Fearnhead, 2008; Doucet and Johansen, 2009; Kantaset al., 2009; Tokdar and Kass, 2010, and the references therein).

A well-known problem in SMC methods is weight degeneracy, also calledsample impoverishment. This problem is related to the increasing variance ofthe particle weights over time (Doucet et al., 2000, 2001). Therefore, after afew iterations there are only a small number of particles left which adequatelycharacterize the target distribution. Consequently, the Monte Carlo estimatorswill gradually loose their good statistical proprieties.

Since the seminal paper on SMC methods by Gordon et al. (1993), theintroduction of the resample stage poses an inexpensive alternative to avoidthe collapse of particle filter algorithms due to weight degeneracy. The idea ofthe resample step consists in sampling the particles randomly by duplicatingthe ones with high weights and removing those with low weights. Afterwards,the weights take equal values. The successive use of the resample procedurecan affect significantly the distinct number of particles, and consequently SMCmethods tend to loose sample diversity.

In order to rejuvenate the particles after the resample step, MCMC kernelswere successfully introduced to build proposal distributions in particle filteralgorithms (Gilks and Berzuini, 2001). Basically, after the resample stage, theaddition of MCMC moves allows the particles to move through the samplespace and rejuvenate. Gilks and Berzuini (2001) and Chopin (2004) providea formal justification for such schemes, including theoretical results focused onconsistency and asymptotic normality. Later, Doucet et al. (2006) and Lin et al.(2013) present alternative strategies to reduce weight degeneracy in particlefilter algorithms. Other approaches, such as dynamic parameter estimationusing sufficient statistics (Fearnhead, 2002; Storvik, 2002) can also be considered

SMC: the Move-Reweighting approach 3

as particle filters with MCMC moves.In this paper, we propose a flexible strategy that allows for MCMC moves

without the need of a preliminary resampling step. Following the MCMC moves,we update the particle weight taking into account the diversification step andthat MCMC moves give particles closer to the target distribution. The effectof this is that resample stages can be delayed. With this strategy, we are ableto diversify the particles via an MCMC move and, at the same time, minimizethe weight degeneracy. The validity of this approach is based on the commonlyused trick of working on an artificial extended distribution having the targetdistribution as marginal combined with the use of backwards kernels consideredin Del Moral et al. (2006) for static problems and by Storvik (2011) in moregeneral settings.

This paper is organized as follows. Section 2 gives a brief overview of the Se-quential Monte Carlo methods and the particle filter algorithm with the MCMCmove step. Section 3 presents the move-reweighting approach as an efficient wayto update the weights and reweight the particles, and establishes its formal jus-tification in proposition 2. In section 3 is also discussed the move-reweightingalgorithm and its implementation issues. Section 4 demonstrates two applica-tions of our methodology for sequential Bayesian inference: a non-Gaussian like-lihood model with time varying parameter, and a non-linear Gaussian stochasticsystem. Finally, section 5 closes with a discussion.

2. Sequential Inference in State Space Models

2.1. Preliminaries: notation and model descriptionLet {xt}t∈N and {yt}t∈N be a discrete-time stochastic process in which thelatent process xt is indirectly observed through the measurement data yt. Astraditionally used in the literature, the generic state-space models are expressedin terms of a dynamic process model in combination with an observation model:

x0 ∼π(x0)

xt|x1:t−1 ∼π(xt|xt−1); Prior Latent Model

yt|y1:t−1, x1:t ∼π(yt|xt), Observation Model

where here and in the following π(· · · ) is used generically for distributions basedon the assumed state-space model with · · · specifying which variables that are inquestion. In general the distributions will depend on some parameters θ. Thesecould be included as part of the latent variables, but throughout we will beassumed parameters are known and are therefore not included in the notation.In the following we will also use the notation x1:t = (x1,x2, ...,xt) ∈ Xt andy1:t = (y1,y2, ...,yt) ∈ Yt to denote the first t individuals of the sequence of


Figure 1. State-space model described by graphical structure.

latent and observed variables. For simplicity of notation we assume each xt havea common sample space X ⊂ Rpx and we denote Xt =

∏tk=1 X and similarly

Yt =∏tk=1 Y for Y ⊂ Rpy . Figure 1 describes the dynamic hierarchical structure

between the hidden Markov process and the observations using a graphicalconfiguration.

Given the data y1:t up to time t, inference focuses on the posterior distribu-tion

p(x1:t) ≡ π(x1:t|y1:t) ∝ π(x1)π(y1|x1)

t∏k=2

π(xk|xk−1)π(yk|xk). (1)

When p(x1:t) is intractable, Monte Carlo sampling methods can be applied tocarry out an approximate inference. Although algorithms based on MCMCschemes are traditional stochastic techniques to sample from high-dimensionaldistributions, in many cases including sequential inference, these methods do notperform well when strong temporal correlations are present (Neal, 2001; Chopin,2002; Del Moral et al., 2006). Within statistical Monte Carlo techniques, SMCmethods (and also combined with MCMC sampling) have become a powerfultool to perform static and dynamic inference in complex or high-dimensionmodels (see Fearnhead, 2008; Andrieu et al., 2010; Chopin et al., 2012, and thereferences therein).

In the next section, we present a basic review of such methods, in order toprovide a notation and the necessary background to explain our methodology.Also, for the rest of this paper, we assume that g is an p-integrable functionand K an MCMC Kernel which leaves p invariant.


2.2. Sequential Monte Carlo methodsSMC methods are a broad class of Monte Carlo integration methods basedon importance sampling techniques which decompose the target distribution ina sequence of low dimensional distributions. For each time t, SMC methodsrepresent the posterior (or some distribution of interest) defined in (1) usinga finite collection of properly weighted samples {(xit, wit), i = 1, . . . , N} withN � 1, according to the following definition (Liu and Chen, 1998; Malefaki andIliopoulos, 2008):

Definition 1. A set of weighted random samples {(xi, wi), i = 1, . . . , N} iscalled proper (or properly weighted) with respect to π if, for any square integrablefunction g,

Eq[g(xi)wi] = cEπ[g(xi)], for i = 1, . . . , N,

for some normalizing constant c common to all the N samples generated formq.

The population of samples are called particles, and in the case where x = x1:t

they are typically generated sequentially from some low-dimensional conditionaldistributions

q(x1:t) = q(x1)

t∏k=2

q(xk|xk−1).

where q(· · · ) is generically used for proposal distributions with · · · specifyingwhich variables that are generated. Assuming that q(x1:t) > 0 for all x1:t

with p(x1:t) > 0, then based on sequential importance sampling ideas, particleweights are defined as

wt(x1:t) ≡p(x1:t)

q(x1:t)∝ wt−1(x1:t−1)

π(xt|xt−1)π(yt|xt)q(xt|xt−1)

allowing for recursive computation updating. Estimation is usually based onnormalized weights, making the proportionality constant unnecessary to com-pute. For time t > 1 when new data arrives, the particle filter algorithm providesan online estimate of E[g(x1:t)|y1:t] through

N∑i=1

witg(xi1:t)/

N∑i=1

wit

which is consistent with probability one (Doucet et al., 2001; Chopin, 2004). Inaddition, particle filters provide, as a natural subproduct, an unbiased estimator


for the marginal likelihood (Del Moral, 2004; Pitt et al., 2010). Such likelihoodestimated have been used in different contexts, with emphasis on parameterlearning, model selection and building particle proposals for sequential inference(see e.g. Pitt et al., 2010; Andrieu et al., 2010; Poyiadjis et al., 2011; Chopinet al., 2012; Whiteley and Lee, 2012; Shephard and Doucet, 2012).

Taking the prior π(xt|xt−1) as the proposal distribution (the Bootstrap filterby Gordon et al. (1993)), the incremental weights boils down to evaluating thelikelihood for the observation. Despite that the Bootstrap filter is easy to im-plement, better proposals can give substantial improvements in the efficiency ofSMC methods. Doucet et al. (2000) show that the optimal proposal is the con-ditional distribution given the previous state and the current observation, i.e.,qopt(xt|xt−1,yt) = π(xt|xt−1,yt). However, the optimal choice is not availablein many applications, making it necessary with some approximation (eg. deter-ministic sampling methods or local linearisation proposals). Alternative choicesof q are presented in Chen and Liu (2000) and Pitt and Shephard (1999).

An efficient algorithm should be able to produce particle weights with lowvariance. Even though the use of clever proposals close to qopt tends to decreasethe variance, the weight degeneracy problem will be unavoidable. The nextproposition (see proof in Liu, 2001) deals with how dimension increase influencesthe variance of the weights.

Proposition 1. Let π(v1,v2) and q(v1,v2) be two probability densities,where the support of π is a subset of the support of q. Then,

Varq(π(v1,v2)

q(v1,v2)) ≥ Varq1(

π1(v1)

q1(v1))

where π1(v1) =∫π(v1,v2)dv2 and q1(v1) =

∫q(v1,v2)dv2 are marginal den-

sities.

A special case is to define v1 = x1:t−1 and v2 = xt resulting in that Var(wt) ≥Var(wt−1) for all t ≥ 2. In practice, as t increases, only a few particles (inthe limit, only one) dominate the entire sample implying an extremely poorrepresentation of the target distribution.

In order to attenuate the degeneracy problem, Gordon et al. (1993) suggestedto add resampling steps, commonly named sequential importance resampling(SIR). Resampling is not needed at all time steps, only when samples are toodegenerate. The need for resampling may be evaluated by computing the qualityof samples using the effective sample size (Kong et al. (1994); Liu and Chen(1995); Cappe et al. (2005)).

Algorithm 1 provides a generic version of the ordinary SMC for filtering,assuming the static parameters are known.


Algorithm 1 Ordinary SMC for Filtering

At t = 1Sample xi1 ∼ q(xi1|y1).Compute and normalize the weights

wi1 = π(xi1)π(y1|xi1)/q(xi1|y1).

for t = 2, 3, ... doPropagate xit ∼ q(xit|xit−1,yt).Compute and normalize the weights

wit = wit−1π(xit|xit−1)π(yt|xit)q(xit|xit−1,yt)

.

if Effective sample size is small thenResample to obtain N new equally-weighted particles.

end for

2.3. SMC with MCMC movesAlthough the SIR algorithm is able to control the variance of weights and iseasy to implement, successive use of the resampling stage tends to impoverishthe sample by reducing the distinct number of particles. In particular, whenthe target distribution differs significantly each time, the variance of the MonteCarlo estimator may be affected negatively (Gilks and Berzuini (2001); Berzuiniand Gilks (2003)). Consequently, the efficiency of the SMC algorithms may bepoor, even when clever proposals are considered.

Gilks and Berzuini (2001) proposed to add MCMC moves after the resam-pling step to reduce the sample impoverishment. The approach is called theresample-move (RM) algorithm, and the main idea is to create a greater diver-sity in the sample by rejuvenating the particles via a combination of sequentialimportance resampling and MCMC sampling steps. In this approach, a Markovkernel K(x?1:t|x1:t) with p(x1:t) as the stationary distribution is designed todraw samples after the resample steps.

Different Markovian kernels can be used at each time-step. Some features ofthe kernel K are explored in Gilks and Berzuini (2001). In short, the Markovkernel does not need to be ergodic since the particles target the correct distribu-tion before the move step. For the same reason, no burn-in is required, and thenumber of Markov chain iterations for each particle are taken as an option. Incase many Markov chain updates are required, the use of parallel programmingcan compensate for the extra computational complexity due to MCMC moves.


A simple version of the RM algorithm at time t is as follows:

(a) Run the particle filter with resampling to obtain N new equally weightedparticles;

(b) Move x?1:t according to MCMC kernel K invariant wrt p(x1:t);

(c) Approximate p(x1:t) by N−1∑Ni δx?i

1:t(x1:t).

In any version of particle filter algorithms, the RM approach can be easilyimplemented with a computational cost of O(tNM) at each time step, whereM is the number of MCMC steps. A possibility is to choose s such that at eachtime step we only move xt−s+1:t in which case the computational cost reduces toO(sNM) (see, for instance Polson et al., 2008). Further, sophisticated MCMCsampling algorithms (Gilks et al., 1996; Robert and Casella, 2004; Liang et al.,2011) can be applied for increasing the statistical performance.

3. Particle reweighting schemes

The current section introduces our approach to reweight the particles aftera move. In the suggested methodology, we perform the move before a (orwithout a preliminary) resample stage, followed by a weights updating. Manytypes of updates can give properly weighted samples, and by clever choices ofsuch updates we can obtain less variability of the particle weights. This cansubsequentially lead to delay of resampling and better performance of the fullalgorithm.

3.1. Move by MCMC kernelsAssume x1:t ∼ q(x1:t) is followed by a move x?1:t ∼ K(x?1:t|x1:t) where K is in-variant wrt p(x1:t). As suggested by Del Moral et al. (2006) and Storvik (2011)define an extended target distribution as p(x?1:t,x1:t) ≡ p(x?1:t)h(x1:t|x?1:t), whereh(x1:t|x?1:t) is an artificial density/backward kernel that integrates to one in Xt.Following ordinary importance sample theory working on the enlarged space,we obtain the following result:

Proposition 2. Let {(xi1:t, wit), i = 1, . . . , N} be a properly weighted samplewith respect to p(x1:t) where the particles are generated from q(x1:t). Assumefor each i we make a move xi1:t → x?i1:t by a transition kernel K that is invariantwrt p(x1:t) and update the weights by

w?it ≡ w?t (xi1:t; x?i1:t) =wit × rit (2)


where

rit ≡ rt(xi1:t; x?i1:t) =p(x?i1:t)h(xi1:t|x?i1:t)p(xi1:t)K(x?i1:t|xi1:t)

,

and h(x1:t|x?1:t) is a density such that {(x?1:t,x1:t) : p(x?1:t)ht(x1:t|x?1:t) > 0} isa subset of {(x?1:t,x1:t) : q(x?1:t)K(x1:t|x?1:t) > 0}. Then

{(x?i1:t, w?it ), i = 1, . . . , N} is proper with respect to p(x1:t).

Also, define q?(x?1:t) =∫Xtq(x1:t)K(x?1:t|x1:t)dx1:t. Then

E[w?t (x1:t; x?1:t)|x?1:t] =

p(x?1:t)

q?(x?1:t)≡ wopt

t (x?1:t),

and

Var[w?t (x1:t; x?1:t)] >Var[wopt

t (x?1:t)].

For a proof, see appendix A.Proposition 2 gives us some useful elements. First, by adding the MCMC

move and updating the particle weights, we have that, for any density h, w?t is aproper weight for x?1:t with respect to p(x1:t). Even though we are working withan extended space, the MCMC move does not modify the unbiased property.The main point of this sampling scheme is to highlight the multiple choices ofproper weight functions w?t (x1:t; x

?1:t) for any given kernel K in that there is a

great flexibility in the choice of h. In subsection 3.2, we describe some of thesepossibilities. Note that proposition 1 can give the impression that by extendingset of variables from x1:t to (x1:t,x

?1:t) the variance of the weights will increase.

However, within this framework x1:t is no longer the variable of interest andonly service as an auxiliary variable for generation of x?1:t. An important pointis then that the marginal target distribution for x1:t can deviate from p(x1:t)and thereby Proposition 1 is no longer relevant.

Second, in case we wish to compute the conditional expectation of g(x1:t),this quantity can be consistently estimated by

N∑i=1

w?it g(x?i1:t)/

N∑i=1

w?it . (3)

where we have used the notation w?it ≡ w?t (xi1:t; x?i1:t). Third and last, despite

in most of the cases we are not able to evaluate q?, the updated particle weightsare unbiased estimates of the optimal weights wopt

t .


Algorithm 2 Move-reweighting for Filtering with MCMC kernels

At time t doPropagations as in the ordinary SMCMove x?it via a p-invariant MCMC kernel, K(x?it |xit).Update and normalize the weights w?it ∝ wit × rit where

rit =p(x?i1:t)h(xi1:t|x?i1:t)p(xi1:t)K(x?i1:t|xi1:t)

,


The stochastic term rt can be seen as a dynamic correction term that willtend to rebalance the ordinary particle weights after an MCMC move. Byupdating the ordinary particle weights, we introduce an additional chance toreduce the cases where the distribution of the weights is extremely skewed.Since via an MCMC move, x?1:t is closer to the target distribution than x1:t,the updated weight should also be closer to 1/N . For clever choices of h we canobtain that Var[w?t ] < Var[wt] allowing us to reduce the weight degeneracy.

The move-reweighing scheme is a simple combination of SMC and MCMCsampling where computational complexity is comparable to the RM algorithm.Similar to the RM algorithm, parallel sampling may be applied to increase thecomputational performance. Since x1:t are auxiliary particles up to the iterationt, we only need to store x?1:N1:t and their respective proper weights to provideconsistent estimation. As for the RM algorithm, any MCMC schemes can beapplied. For example, see the appendix in Storvik (2011) for details on howto update the weights when proposals are generated via Metropolis-Hastingsmoves. Note that, after updating the particle weights the degeneracy problemremains, and resample stages may be required in most practical situations.By clever choices of move steps and weight updating schemes, the degeneracyproblem can however be reduces, delaying the need for resampling.

Algorithm 2 describes our algorithm in this case. Similar to the RM algo-rithm, moves will typically only involve the last s components of x1:t in whichcase the weight updating reduces to

rit =p(x?it−s+1:t|xi1:t−s)h(xit−s+1:t|xi1:t−s,x?it−s+1:t)

p(xit−s+1:t|xi1:t−s)K(x?it−s+1:t|xi1:t)

in which case the computational complexity at each iteration reduces toO(sNM).


3.2. Choices of the h functionsThe choices of the conditional densities h should, in general, be balanced in twostrategies: (i) to control the variability of the particle weights; (ii) to find anartificial density that will lead to weights that are easy to evaluate. Additionally,by the importance sampling principles, h must have a support small enough.

Some special cases of h and their respective particle weights are interestingto be discussed. To explore their performance, we will see what kind of weightswe obtain in three special cases:

c.1 If q(x1:t) = p(x1:t), we are drawing from the right distribution in the firstplace. Updating the particles with an p invariant kernel K also gives theright distribution on the updated/moved particle. We would therefore likethe weights to be equal to 1 in this case.

c.2 IfK(x?1:t|x1:t) = p(x?1:t), the move gets us directly to the right distribution,so also in this case, we would like the weights to be equal to one.

c.3 If K(x?1:t|x1:t) = q?(x?1:t), that is x1:t is not used in the update, we wouldlike the weights to be equal to p(x?1:t)/q(x

?1:t).

By Proposition 2, the optimal weight we can obtain at time t is woptt (x?1:t; x1:t).

This corresponds to choosing hopt(x1:t|x?1:t) = q(x1:t)K(x?1:t|x1:t)/q?(x1:t) which

is a legal density. In practice, however, the quantities involved will not be pos-sible to calculate, making this not a realistic choice.

Another option is

h1(x1:t|x?1:t) ≡p(x1:t)K(x?1:t|x1:t)

p(x?1:t)⇒ w?t (h1) = wt. (4)

This shows that when MCMC moves are applied, using the original weights stillgives a properly weighted sample. An important special case is when resamplingis applied before move, so that wit = 1/N for all i. In this case we obtainthe RM algorithm. This Monte Carlo scheme was originally proposed, outsidethe context of particle methods, by MacEachern et al. (1999) and it is alsoreferred as generalized importance sampling in Robert and Casella (2004). Notehowever that although easy to apply, these weights will typically be differentfrom the optimal ones, indicating that better choices are possible. In the c.1case, w?t (h1) = wt = 1, giving a sensible result. In case c.2, however, w?t (h1) =wt = p(x1:t)/q(x1:t), not at all taking into account that we in this case are usinga good (perfect) kernel for the move. A similar situation occur for case c.3.

We can also define h function as a Markovian transition kernel

h2(x1:t|x?1:t) ≡ K(x1:t|x?1:t)⇒ w?t (h2) =p(x?1:t)K(x1:t|x?1:t)q(x1:t)K(x?1:t|x1:t)

. (5)


Note that if K satisfies detailed balance, this choice coincides with h1 andw?t (h2) = wt and the properties are similar to that choice.

Yet another possibility is to choose h as the marginal proposal distribution

h3(x1:t|x?1:t) ≡ q(x1:t)⇒ w?t (h3) =p(x?1:t)

K(x?1:t|x1:t). (6)

Both in cases c.2 and c.3, we obtain the ideal weights, but in case c.1 we obtain

w?t (h3) =p(x?

1:t)K(x?

1:t|x1:t), not taking into account that we started right in this case.

This therefore will be a good choice if we have an efficient MCMC kernel, i.e.K(x?1:t|x1:t) ≈ p(x?1:t) or a move not depending much on x1:t. These weightswill also be easy to compute, assuming the MCMC kernels are easily available.

Finally, any mixture of the previous h’s can also be considered. For example,for α ∈ [0, 1] we have

h(x1:t|x?1:t) ≡ αq(x1:t) + (1− α)K(x1:t|x?1:t)

⇒ w?t (h) = αw?t (h3) + (1− α)w?t (h2). (7)

3.3. General movesIn cases where the use of an MCMC kernel is difficult, alternative moves canbe considered. For these situations, we first sample particles from a proposalx1:t ∼ q(·), followed by an arbitrary transition kernel, i.e. x?1:t ∼ Q(·|x1:t). Thefollowing statement is analogous to that of Proposition 2:

Proposition 3. Let {(xi1:t, wit), i = 1, . . . , N} be a properly weighted samplewith respect to p, and the particles are generated from q. Assume for each i wemove xi1:t → x?i1:t by a transition kernel Q(·|xxt) and update the weights by

w?t ≡ w?t (x1:t; x?1:t) =wit × rit (8)

where

rit ≡ rt(x1:t; x?1:t) =

p(x?i1:t)h(xi1:t|x?i1:t)p(xi1:t)Q(x?i1:t|xi1:t)

and h is a density such that {(x?1:t,x1:t) : p(x?1:t)h(x1:t|x?1:t) > 0} is a subset of{(x?1:t,x1:t) : q(x1:t)Q(x?1:t|x1:t) > 0}. Then

{(x?i1:t, w?it (x1:t; x?i1:t)), i = 1, . . . , N} is proper with respect to p.


Algorithm 3 Move-reweighting for Filtering with arbitrary transitions

At time t doPropagations as in the ordinary SMCMove x?it ∼ Q(x?it |xit).Update and normalize the weights w?it ∝ wit × rit where

rit =p(x?i1:t)h(xi1:t|x?i1:t)p(xi1:t)Q(x?i1:t|xi1:t)

,


The proof of Proposition 3 is identical to the proof of Proposition 2, only re-placing K(x?1:t|x1:t) with Q(x?1:t|x1:t).

When an MCMC kernel is not available, the use of some arbitrary transitionkernel q to move and reweight the particle can be attractive. The choice of Qcan, for example, be an approximation to an MCMC kernel or a clever proposalstaking into account that we have the samples generated from q as auxiliaryparticles. For instance, we can choose Q as the Langevin proposals, such as

Q(x?1:t|.) = Np(x?1:t|x1:t + σ2

2 ∇p, σ2Ip) where σ2 is some tuning parameter, and

∇p is the gradient of the unnormalized target distribution.

Algorithm 3 offers an alternative (analogues to algorithm 2) to implementthe move-reweighting with arbitrary transitions algorithm for filtering.

3.4. Summary of particle move and (re)-weighting strategiesWhen we have a properly or equally weighted sample, there are some interestingSMC schemes to design, taking into different types of moves. In short, if wehave an p-invariant MCMC kernel, K, or some arbitrary transition kernel Q todiversify the particles, we can adduce at least four strategies for (re)-weightingthem in an SMC framework:

s.1 Start with an equally weighted sample, move the particles via K andmaintain the equal particle weights;

s.2 Start with a properly weighted sample, move the particles via K andmaintain the same particle weights;

s.3 Start with a properly weighted sample, move the particles via K andupdate the particle weights; or


s.4 Start with a properly weighted sample, move the particles via Q andupdate the particle weights.

Strategy s.1 is the RM algorithm mentioned in subsection 2.3, and s.2 is its gen-eralization, both provided properly weighted samples with respect to p. Strate-gies s.3 and s.4 are cases where, through a diversification step, the weights getupdated. Thus, these schemes are promising to make the particle weights lessvariable and, at the same time, reduce the sample impoverishment.

4. Numerical Illustrations

In this section we illustrate the move-reweighing approach with two exam-ples. The first is a state-space model with a non-Gaussian likelihood, wherewe are able to move the particles directly from a Gibbs kernel. The second isa non-linear Gaussian stochastic system, where the rejuvenated particles arereweighted after a move with a general kernel.

As a standard measure for evaluating the sample impoverishment, we cal-culate the effective sample size (ESS, Liu, 1996) by

ESSt =(∑Ni=1 w

it)

2∑Ni=1(wit)

2. (9)

The performance is also evaluated using two scoring rules (Gneiting and Raftery,2007; Frei and Kunsch, 2012)): the continue ranked probability score (CRPS)and the root mean square error (RMSE). Both are negatively oriented that is,the lower the better, and they are defined as

CRPSl,t =

∫ +∞

−∞(Fl,t(x)− 1{xl,t≤x})

2dx (10)

and where xl,t is the true state variable at time t. Fl,t is the marginal empiricalcumulative distribution obtained through the SMC algorithm and µl,t is theweighted mean of the particles.

The performance is also evaluated using the root mean square error (RMSE)

RMSEl =

√√√√ 1

T

T∑t=1

(xl,t − xl,t)2 (11)

where xl,t is some estimate of the state variable l and xl,t is obtained by SMCwith large number of particles (N = 350000).

In the following illustrations, we assume all static parameters to be known.


Table 1. Gauss-Poisson model without resampling. Comparison of theordinary SMC algorithm (O, algorithm 1), the move-without-reweightingalgorithm (M, algorithm 2 using the backward kernel from equation (4))and the move-reweight algorithm (MR, algorithm 2 using the backwardkernel from equation (6)), all without resampling. Results are basedon one simulation of data based on model (12) with parameters givenin (13) and an average of 1000 repetitions of the algorithms. N = 5000particles were used in all cases. CRPS1 is the average of CRPSl,t overtime and similarly for the others.

Algorithm ESS(%) CRPS1 CRPS2 RMSE1 RMSE2

O 0.0284 0.1488 1.3998 0.1350 1.4878M 0.0385 0.1615 0.4520 0.1897 0.3682MR 4.9200 0.0842 0.2949 0.0275 0.0631

4.1. Gaussian process with Poisson distributed dataWe start illustrating our methodology for a Poisson-Gaussian state space model.Hence, assume xt = (x1,t, x2,t)

′ and

xt =Fxt−1 + εt, εtind∼ N (0,Σ)

yt ∼Poisson(exp(λ+ bTxt)).(12)

where in our experiments we set

F =

[0.20 0.950.90 0

], Σ =

[0.1 0

0 1

], λ = 5, b =

[1 0

]. (13)

Note in particular that the observation yt only depend on x1,t.Propagation is in this case applied by generating x2,t ∼ π(x2,t|xt−1) fol-

lowed by x1,t ∼ q1(x1,t|x2,t, yt,xt−1) where the latter is the Laplace approx-imation to π(x1,t|x2,t, yt,xt−1). The propagation is then followed by a move(x1,t, x2,t) → (x?1,t, x

?2,t) where x?1,t = x1,t while x?2,t ∼ π(x2,t|xt−1, x1,t, yt) =

π(x2,t|xt−1, x1,t). Hence, we move x?2,t directly from the Gibbs kernel whichis a Gaussian distribution. In addition, we consider t = 1, 2, . . . , 200, and forreweighting the particles we compute w?t (h) where h(x?1:t|x1:t) = q(x1:t), i.e.eq (6). N = 1000 particles were used in the experiments.

A first experiment was carried out without resampling in order to explorethe degeneracy problem.

In a second experiment, the influence of resampling was explored. Using thesame simulated data set, we ran the move-reweighing (MR) approach addingthe multinomial resampling to compare the resample-move (RM) algorithm withthe move-reweight alternatives. Table 3 and Figure 3 report the main resultsusing these methods with N = 50 and N = 5000 particles.


Table 2. Gauss-Poisson model with resampling. Comparison of the ordi-nary SMC algorithm with resampling at each time-point (O), resample-move(RM), move-reweighting with resampling at each time-point (MR1) andmove-reweighting with resampling if ESS/N < 50% (MR2). Results arebased on one simulation of data based on model (12) with parameters givenin (13) and an average of 1000 repetitions of the algorithms.

N Algorithm ESS(%) CRPS1 CRPS2 RMSE1 RMSE2

50 O 27.84 0.2461 0.0966 0.0569 0.1451RM 27.86 0.2201 0.0966 0.0552 0.0931MR1 95.64 0.2083 0.0846 0.0403 0.0599MR2 71.04 0.2123 0.0860 0.0394 0.0548

5000 ORMMR1MR2

Table 3. Gauss-Poisson model with resampling. Comparison ofthe ordinary SMC algorithm with resampling at each time-point(O), resample-move (RM), move-reweighting with resamplingat each time-point (MR1) and move-reweighting with resam-pling if ESS/N < 50% (MR2). Results are based on one simu-lation of data based on model (12) with parameters given in (13)and an average of 1000 repetitions of the algorithms.

Algorithm q10(x1) q90(x1) q10(x2) q90(x2)

50 O 0.1569 0.0339 0.3120 0.3378RM 0.1467 0.0269 0.2620 0.3290MR1 0.0705 0.0169 0.0612 0.0359MR2 0.0684 0.0406 0.0612 0.0466

5000 ORMMR1MR2


0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

Time

ES

S

0 50 100 150 200

020

040

060

080

010

00

TimeR

elat

ive

ES

S

Figure 2. Effective sample size for algorithms described in Table 1. Left panel, absoluteESS (Move-reweighting algorithm in solid line, dash and dot line for resample-movealgorithm and ordinary SMC), and right panel the relative ESS scaled by the ordinaryESS.

We are still writing this section since we are running this experiment withN=5000 as described in Table 2 and 3.

4.2. A Low-dimensional non-linear dynamical systemConsider now a low-dimensional linear dynamical model

x1,t =x1,t−1 + hx2,t−1 + σ1ε1,t (14)

x2,t =x2,t−1 + hα(1− x21,t−1)x2,t−1 − hx1,t−1 + σ2ε2,t (15)

where {εl,t} are independent white noise processes with standard normal marginals.This model was used in Gao and Zhang (2012) as a discretized version (with h asthe step size) of the Van del Pol oscillator described through a second-order dif-ferential equation. Deviating somewhat from Gao and Zhang (2012), we assumethe model observation is given by yt|x2,t ∼ N (x2,t, σ

23), that is only the second

component is observed. The static parameter vector is θ = (h, α, σ21 , σ

22 , σ

23)′ in

which for our simulations we use h = 0.1, α = 1, σ21 = 5 × 10−2, σ2

2 = 8 × 10−3

and σ23 = 1× 10−3.

For the initial state value, we sample from the zero mean Gaussian dis-tribution with low variance. We start sampling the state independently fromN(0, 0.01).


0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

Time

ES

S

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

Time

ES

SFigure 3. Effective sample for the model and algorithms describe in Table 3. Leftpanel, ESS obtained by SMC with with N = 50, and right panel with N = 5000 (Move-reweighting algorithm in solid line, dash and dot line for resample-move algorithm andordinary SMC).

First, we propagate x1:t−1 with xt = (x1,t, x2,t)′ where xt = π(xt|xt−1, yt) ∼

π(x1,t|xt−1)π(x2,t|xt−1, yt) where both distributions are Gaussian. In this sit-uation we will consider moves of (x1,t−1, x1,t). In particular, we will assume(x?1,t−1, x

?1,t) ∼ q(x?1,t−1|xt−2, x2,t)π(x1,t|x?1,t−1, x2,t−1) where q(x?1,t−1|xt−2, x2,t)

is a Laplace approximation of π(x?1,t−1|xt−2, x2,t). Note that in this case ourmove kernel is not invariant with respect to the target distribution, makingweight updating according to Proposition 3 necessary to use. In particular, wewill assume h = q(x1:t) in which case w?t = p(x?1:t)/q(x

?1:t|x1:t).

Even the particles are not rejuvenated via an MCMC kernel, the empiricalresults suggested again that reweighing the particles, we can gain some improve-ment in the SMC performance. As shown in Figure ??, move by an arbitrarykernel can also be propitious to avoid the cases where few extreme weights dom-inate the entire sample resulting a high variance or lack of accuracy measuredby the RMSE. Even thought, for this illustration and following our simulationstrategy, we did not found a great difference in the effective sample size (onaverage, 33% for the move-reweighting algorithm, and 30% for the ordinaryapproach).


Table 4. Oscillator model with resampling at each time-point. RMSE for quan-tiles of the ordinary SMC algorithm (O) and move-reweighting (MR). Resultsare based on one simulation of data based on model described in equations(14-15) with an average of 1000 repetitions of the algorithms.

N Algorithm ESS (%) CRPS1 CRPS2 RMSE1 RMSE2

50 O 59.91 0.1921 0.0184 0.1336 0.0069MR 61.64 0.1898 0.0185 0.1167 0.0067

500 O 58.73 0.1803 0.0177 0.0461 0.0022MR 59.38 0.1801 0.0177 0.0416 0.0025

5000 O 58.79 0.1790 0.0176 0.0171 0.0007MR 58.88 0.1790 0.0176 0.0147 0.0012

Table 5. Oscillator model with resampling at each time-point.RMSE for quantiles of the ordinary SMC algorithm (O) andmove-reweighting (MR). Results are based on one simulationof data based on model described in equations (14-15) with anaverage of 1000 repetitions of the algorithms.

Algorithm q10(x1) q90(x1) q10(x2) q90(x2)

50 O 0.1745 0.1454 0.0159 0.0057MR 0.1469 0.0807 0.0160 0.0061

500 O 0.2012 0.1797 0.0160 0.0051MR 0.1352 0.0752 0.0159 0.0049

5000 O 0.2036 0.1886 0.0161 0.0052MR 0.1303 0.0747 0.0159 0.0048


5. Discussion

In this paper, we have combined an MCMC move within SMC methods, provid-ing the possibility to update the weights so as to reweight the particles. We havealso shown a more general case is when the particles are rejuvenated from a non-MCMC kernel, but also have a possibility to be be reweighted. In both cases,our approach allows a great flexibility to reweight the samples using a properweight with respect to the right distribution. We have point out that the tradi-tional framework to add MCMC moves within particle filter algorithms (Gilksand Berzuini (2001)) can be encapsulated as a special case in our approach.

In our simulation studies, we demonstrate that the move-reweighting algo-rithm can successfully reduce weight degeneracy and, at the same time, increasesample diversity. We have also shown the benefits of our methodology for asmall number of particles. In the first experiment, we showed clearly how thevariance of weights has been affected when we update them after a MCMCmove. In short, the move-reweighting approach can be promising to attenuatethe asymmetry in the distribution of the particle weights. As a result, we obtaina satisfactory increase in the effective sample size, although it is not straight-forward how to compare those for several algorithms due to the resample steps.Later, in the second illustration we indicate some benefits in particular betterdescription of target distribution when the particles are reweighted after movewith arbitrary transition kernels.

Although we assumed static parameters are unknown, our approach can beeither integrated, or it can offer an alternative to build particle proposals, withrecent particle methods (Kantas et al. (2009); Andrieu et al. (2010); Chopinet al. (2012)) to draw samples from the joint distribution p(x1:t, θ). Further-more, advanced MCMC move approaches (e.g. adaptive methods by Fearnheadand Taylor (2013)) can be also merged with those reweighting schemes presentedin order to construct an efficient algorithm. Lastly, in an attempt to address theproblems caused by the curse of dimensionality in SMC methods (Snyder et al.(2008)), the move-reweighting strategies can be used as a possible algorithm tobe implemented in high-dimension systems.

Acknowledgments

We gratefully acknowledge financial support from CAPES-Brazil and Statisticsfor Innovation Center, in Norway.


A. Proofs

Proof of Proposition 2 As g is a p-integrable function, then

E[g(x?1:t)w?t (x?1:t)] =

∫Xt×Xt

g(x?1:t)w?t (x?1:t)q(x1:t)K(x?1:t|x1:t)dx

?1:tdx1:t

=

∫Xt×Xt

g(x?1:t)p(x?1:t)h(x1:t|x?1:t)dx?1:tdx1:t

=

∫Xt

g(x?1:t)p(x?1:t)

∫Xt

h(x1:t|x?1:t)dx1:tdx?1:t

=

∫Xt

g(x?1:t)p(x?1:t)dx

?1:t = Ep[g(x?1:t)].

Further, using that the conditional sampling distribution for x1:t given x?1:t isq(x1:t)K(x?1:t|x1:t)/q

?(x∗1:t) we have

E[w?t (x1:t; x?1:t)|x?1:t] =

∫Xt

w?t (x1:t; x?1:t)

q(x1:t)K(x?1:t|x1:t)

q?(x∗1:t)dx1:t

=p(x?1:t)

q?(x?1:t)

∫Xt

h(x1:t|x?1:t)dx1:t

=p(x?1:t)

q?(x?1:t)= wopt

t (x?1:t).

Finally, Var[w?t (x1:t; x?1:t)] ≥ Var[wopt

t (x?1:t)] is a direct consequence of Proposi-tion 1 choosing v1 = x?1:t and v2 = x1:t. 2


References

Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain MonteCarlo methods. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 72 (3), 269–342.

Berzuini, C. and W. Gilks (2003). Particle filtering methods for dynamic andstatic Bayesian problems. Oxford Statisitcal Science Series, 207–227.

Candy, J. (2009). Bayesian signal processing: classical, modern, and particlefiltering methods, Volume 54. Wiley-Interscience.

Cappe, O., S. Godsill, and E. Moulines (2007). An overview of existing meth-ods and recent advances in sequential Monte Carlo. Proceedings of theIEEE 95 (5), 899–924.

Cappe, O., E. Moulines, and T. Ryden (2005). Inference in hidden Markovmodels. Springer Verlag.

Chen, R. and J. S. Liu (2000). Mixture Kalman filters. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 62, 493–508.

Chopin, N. (2002). A sequential particle filter method for static models.Biometrika 89 (3), 539–552.

Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methodsand its application to Bayesian inference. The Annals of Statistics 32 (6),2385–2411.

Chopin, N., P. E. Jacob, and O. Papaspiliopoulos (2012). Smc2: an efficientalgorithm for sequential analysis of state space models. Journal of the RoyalStatistical Society: Series B (Statistical Methodology).

Del Moral, P. (2004). Feynman-Kac Formulae, Genealogical and InteractingParticle Systems with Applications. New York: Springer-Verlag.

Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo sam-plers. Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 68 (3), 411–436.

Doucet, A., M. Briers, and S. Sncal (2006). Efficient block sampling strategiesfor sequential Monte Carlo methods. Journal of Computational and GraphicalStatistics 15 (3), 693–711.

Doucet, A., N. de Freitas, and N. Gordon (2001). Sequential Monte Carlomethods. Springer-Verlag.


Doucet, A., S. Godsill, and C. Andrieu (2000). On sequential Monte Carlosampling methods for Bayesian filtering. Statistics and computing 10 (3),197–208.

Doucet, A. and A. M. Johansen (2009). A tutorial on particle filtering andsmoothing: fifteen years later. Handbook of Nonlinear Filtering 12, 656–704.

Fearnhead, P. (2002). Markov Chain Monte Carlo, sufficient statistics, andparticle filters. Journal of Computational and Graphical Statistics 11 (4),848–862.

Fearnhead, P. (2008). Computational methods for complex stochastic systems:a review of some alternatives to MCMC. Statistics and Computing 18, 151–171.

Fearnhead, P. and B. Taylor (2013). An adaptive sequential Monte Carlo sam-pler. Bayesian Analysis 8 (1), 1–28.

Frei, M. and H. Kunsch (2012). Bridging the ensemble Kalman and particlefilter. arXiv preprint arXiv:1208.0463 .

Gao, M. and H. Zhang (2012). Sequential Monte Carlo methods for parameterestimation in nonlinear state-space models. Computers & Geosciences 44 (0),70 – 77.

Gilks, W. and C. Berzuini (2001). Following a moving targetMonte Carlo infer-ence for dynamic Bayesian models. Journal of the Royal Statistical Society:Series B (Statistical Methodology) 63 (1), 127–146.

Gilks, W., S. Richardson, and D. Spiegelhalter (1996). Markov chain MonteCarlo in practice. Chapman & Hall/CRC.

Gneiting, T. and A. Raftery (2007). Strictly proper scoring rules, prediction,and estimation. Journal of the American Statistical Association 102 (477),359–378.

Gordon, N., D. Salmond, and A. Smith (1993, apr). Novel approach tononlinear/non-gaussian bayesian state estimation. Radar and Signal Pro-cessing, IEE Proceedings F 140 (2), 107 –113.

Green, P., N. Hjort, and S. Richardson (2003). Highly structured stochasticsystems, Volume 10. Oxford University Press Oxford (United Kingdom).


Kantas, N., A. Doucet, S. Singh, and J. Maciejowski (2009). An overview ofsequential Monte Carlo methods for parameter estimation in general state-space models. In Proceedings of the IFAC Symposium on System Identification(SYSID).

Kevin, M. (2012). Machine Learning: a probabilistic perspective. The MITpress.

Kong, A., J. S. Liu, and W. H. Wong (1994). Sequential imputations andBayesian missing data problems. Journal of the American statistical associ-ation 89 (425), 278–288.

Liang, F., C. Liu, and R. Carroll (2011). Advanced Markov chain Monte Carlomethods: learning from past samples, Volume 714. Wiley.

Lin, M., R. Chen, and J. S. Liu (2013). Lookahead strategies for sequentialMonte Carlo. Statistical Science 28 (1), 69–94.

Liu, J. S. (1996). Metropolized independent sampling with comparisons torejection sampling and importance sampling. Statistics and Computing 6 (2),113–119.

Liu, J. S. (2001). Monte Carlo strategies in scientific computing. Springer.

Liu, J. S. and R. Chen (1995). Blind deconvolution via sequential imputations.Journal of the American Statistical Association 90 (430), 567–576.

Liu, J. S. and R. Chen (1998). Sequential Monte Carlo methods for dynamicSystems. Journal of the American Statistical Association 93 (443), 1032–1044.

MacEachern, S. N., M. Clyde, and J. S. Liu (1999). Sequential importancesampling for nonparametric Bayes models: The next generation. CanadianJournal of Statistics 27 (2), 251–267.

Malefaki, S. and G. Iliopoulos (2008). On convergence of properly weightedsamples to the target distribution. Journal of Statistical Planning and Infer-ence 138 (4), 1210–1225.

Neal, R. M. (2001). Annealed importance sampling. Statistics and Comput-ing 11 (2), 125–139.

Pitt, M., R. Silva, P. Giordani, and R. Kohn (2010). Auxiliary parti-cle filtering within adaptive Metropolis-Hastings sampling. Arxiv preprintarXiv:1006.1914 .


Pitt, M. K. and N. Shephard (1999). Filtering via simulation: Auxiliary particlefilters. Journal of the American Statistical Association 94 (446), 590–599.

Polson, N. G., J. R. Stroud, and P. Muller (2008). Practical filtering withsequential parameter learning. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 70 (2), 413–428.

Poyiadjis, G., A. Doucet, and S. Singh (2011). Particle approximations of thescore and observed information matrix in state space models with applicationto parameter estimation. Biometrika 98 (1), 65–80.

Robert, C. and G. Casella (2004). Monte Carlo statistical methods. SpringerVerlag.

Shephard, N. and A. Doucet (2012). Robust inference on parameters via par-ticle filters and sandwich covariance matrices. In Nuffield College EconomicsWorking Papers. University of Oxford.

Snyder, C., T. Bengtsson, P. Bickel, and J. Anderson (2008). Obstacles to high-dimensional particle filtering. Monthly Weather Review 136 (12), 4629–4640.

Storvik, G. (2002). Particle filters for state-space models with the presence ofunknown static parameters. Signal Processing, IEEE Transactions on 50 (2),281–289.

Storvik, G. (2011). On the flexibility of Metropolis–Hastings acceptance prob-abilities in auxiliary variable proposal generation. Scandinavian Journal ofStatistics 38 (2), 342–358.

Tokdar, S. and R. Kass (2010). Importance sampling: a review. Wiley Inter-disciplinary Reviews: Computational Statistics 2 (1), 54–60.

Whiteley, N. and A. Lee (2012). Twisted particle filters. arXiv preprintarXiv:1210.0220 .

Zucchini, W. and I. MacDonald (2009). Hidden Markov models for time series:an introduction using R, Volume 110. Chapman & Hall/CRC.

particle move-reweighting schemes for online bayesian … · 2017-03-09 · particle...

Documents