adaptive methods for sequential importance sampling

Upload: hrx

Post on 03-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    1/20

    Stat Comput (2008) 18: 461480DOI 10.1007/s11222-008-9089-4

    Adaptive methods for sequential importance samplingwith application to state space models

    Julien Cornebise ric Moulines Jimmy Olsson

    Received: 28 February 2008 / Accepted: 11 July 2008 / Published online: 15 August 2008 Springer Science+Business Media, LLC 2008

    Abstract In this paper we discuss new adaptive proposal

    strategies for sequential Monte Carlo algorithmsalsoknown as particle filtersrelying on criteria evaluating thequality of the proposed particles. The choice of the pro-posal distribution is a major concern and can dramaticallyinfluence the quality of the estimates. Thus, we show howthe long-used coefficient of variation (suggested by Konget al. in J. Am. Stat. Assoc. 89(278288):590599,1994) ofthe weights can be used for estimating the chi-square dis-tance between the target and instrumental distributions ofthe auxiliary particle filter. As a by-product of this analy-sis we obtain an auxiliary adjustment multiplier weight typefor which this chi-square distance is minimal. Moreover, we

    establish an empirical estimate of linear complexity of theKullback-Leibler divergence between the involved distribu-tions. Guided by these results, we discuss adaptive design-ing of the particle filter proposal distribution and illustratethe methods on a numerical example.

    Keywords Adaptive Monte Carlo Auxiliary particlefilter Coefficient of variation Kullback-Leibler

    This work was partly supported by the National Research Agency(ANR) under the program ANR-05-BLAN-0299.

    J. Cornebise () . MoulinesInstitut des Tlcoms, Tlcom ParisTech, 46 Rue Barrault,75634 Paris Cedex 13, Francee-mail:[email protected]

    . Moulinese-mail:[email protected]

    J. OlssonCenter of Mathematical Sciences, Lund University, Box 118,SE-22100 Lund, Swedene-mail:[email protected]

    divergence

    Cross-entropy method

    Sequential Monte

    Carlo State space models

    1 Introduction

    Easing the role of the user by tuning automatically the keyparameters of sequential Monte Carlo (SMC) algorithmshas been a long-standing topic in the community, notablythrough adaptation of the particle sample size or the way theparticles are sampled and weighted. In this paper we focuson the latter issue and develop methods for adjusting adap-

    tively the proposal distribution of the particle filter.Adaptation of the number of particles has been treatedby several authors. In Legland and Oudjane (2006) (andlater Hu et al.2008, Sect. IV) the size of the particle sam-ple is increased until the total weight mass reaches a pos-itive threshold, avoiding a situation where all particles arelocated in regions of the state space having zero posteriorprobability. Fearnhead and Liu (2007, Sect. 3.2) adjust thesize of the particle cloud in order to control the error intro-duced by the resampling step. Another approach, suggestedby Fox (2003) and refined in Soto (2005) and Straka andSimandl (2006), consists of increasing the sample size un-

    til theKullback-Leibler divergence(KLD) between the trueand estimated target distributions is below a given threshold.Unarguably, setting an appropriate sample size is a key

    ingredient of any statistical estimation procedure, and thereare cases where the methods mentioned above may be usedfor designing satisfactorily this size; however increasing thesample size only is far from being always sufficient forachieving efficient variance reduction. Indeed, as in any al-gorithm based on importance sampling, a significant dis-crepancy between the proposal and target distributions may

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    2/20

    462 Stat Comput (2008) 18: 461480

    require an unreasonably large number of samples for de-creasing the variance of the estimate under a specified value.For a very simple illustration, consider importance samplingestimation of the mean m of a normal distribution using asimportance distribution another normal distribution havingzero mean and same variance: in this case, the variance ofthe estimate grows like exp(m2)/N,Ndenoting the numberof draws, implying that the sample size required for ensuringa given variance grows exponentially fast withm.

    This points to the need for adapting the importance dis-tribution of the particle filter, e.g., by adjusting at each iter-ation the particle weights and the proposal distributions; seee.g. Doucet et al. (2000), Liu (2004), and Fearnhead (2008)for reviews of various filtering methods. These two quan-tities are critically important, since the performance of theparticle filter is closely related to the ability of proposingparticles in state space regions where the posterior is signif-icant. It is well known that sampling using as proposal dis-tribution the mixture composed by the current particle im-

    portance weights and the prior kernel (yielding the classicalbootstrap particle filter of Gordon et al.1993) is usually in-efficient when the likelihood is highly peaked or located inthe tail of the prior.

    In the sequential context, the successive distributions tobe approximated (e.g. the successive filtering distributions)are the iterates of a nonlinear random mapping, defined onthe space of probability measures; this nonlinear mappingmay in general be decomposed into two steps: a predictionstep which is linear and a nonlinear correction step whichamounts to compute a normalisation factor. In this setting,an appealing way to update the current particle approxima-

    tion consists of sampling new particles from the distributionobtained by propagating the current particle approximationthrough this mapping; see e.g. Hrzeler and Knsch (1998),Doucet et al. (2000), and Knsch (2005) (and the referencestherein). This sampling distribution guarantees that the con-ditional variance of the importance weights is equal to zero.As we shall see below, this proposal distribution enjoys otheroptimality conditions, and is in the sequel referred to as theoptimal sampling distribution. However, sampling from theoptimal sampling distribution is, except for some specificmodels, a difficult and time-consuming task (the in gen-eral costly auxiliary accept-reject developed and analysed

    by Knsch (2005) being most often the only available op-tion).To circumvent this difficulty, several sub-optimalschemes

    have been proposed. A first type of approaches tries tomimic the behavior of the optimal sampling without suf-fering the sometimes prohibitive cost of rejection sampling.This typically involves localisation of the modes of the un-normalised optimal sampling distribution by means of someoptimisation algorithm, and the fitting of over-dispersed stu-dents t-distributions around these modes; see for example

    Shephard and Pitt (1997), Doucet et al. (2001), and Liu(2004) (and the references therein). Except in specific cases,locating the modes involves solving an optimisation prob-lem for every particle, which is quite time-consuming.

    A second class of approaches consists of using someclassical approximate non-linear filtering tools such as theextended Kalman filter (EKF) or the unscented transformKalman filter (UT/UKF); see for example Doucet et al.(2001) and the references therein. These techniques assumeimplicitly that the conditional distribution of the next stategiven the current state and the observation has a singlemode. In the EKF version of the particle filter, the lin-earisation of the state and observation equations is carriedout for each individual particle. Instead of linearising thestate and observation dynamics using Jacobian matrices, theUT/UKF particle filter uses a deterministic sampling strat-egy to capture the mean and covariance with a small set ofcarefully selected points (sigma points), which is also com-puted for each particle. Since these computations are most

    often rather involved, a significant computational overheadis introduced.A third class of techniques is the so-calledauxiliary par-

    ticle filter (APF) suggested by Pitt and Shephard (1999),who proposed it as a way to build data-driven proposaldistributions (with the initial aim of robustifying standardSMC methods to the presence of outlying observations); seee.g. Fearnhead (2008). The procedure comprises two stages:in the first-stage, the current particle weights are modifiedin order to select preferentially those particles being mostlikely proposed in regions where the posterior is signifi-cant. Usually this amounts to multiply the weights with so-

    calledadjustment multiplier weights, which may depend onthe next observation as well as the current position of theparticle and (possibly) the proposal transition kernels. Mostoften, this adjustment weight is chosen to estimate the pre-dictive likelihood of the next observation given the currentparticle position, but this choice is not necessarily optimal.

    In a second stage, a new particle sample from the targetdistribution is formed using this proposal distribution andassociating the proposed particles with weights proportionalto the inverse of the adjustment multiplier weight.1 APF pro-cedures are known to be rather successful when the first-stage distribution is appropriately chosen, which is not al-

    ways straightforward. The additional computational cost de-pends mainly on the way the first-stage proposal is designed.The APF method can be mixed with EKF and UKF leading

    1The original APF proposed by Pitt and Shephard (1999) featuresa second resampling procedure in order to end-up with an equallyweighted particle system. This resampling procedure might howeverseverely reduce the accuracy of the filter: Carpenter et al. ( 1999) givean example where the accuracy is reduced by a factor of 2; see alsoDouc et al. (2008) for a theoretical proof.

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    3/20

    Stat Comput (2008) 18: 461480 463

    to powerful but computationally involved particle filter; see,e.g., Andrieu et al. (2003).

    None of the suboptimal methods mentioned above min-imise any sensible risk-theoretic criterion and, more annoy-ingly, both theoretical and practical evidences show thatchoices which seem to be intuitively correct may lead toperformances even worse than that of the plain bootstrapfilter (see for example Douc et al. 2008for a striking ex-ample). The situation is even more unsatisfactory when theparticle filter is driven by a state space dynamic differentfrom that generating the observations, as happens frequentlywhen, e.g., the parameters are not known and need to be es-timated or when the model is misspecified.

    Instead of trying to guess what a good proposal distri-bution should be, it seems sensible to follow a more risk-theoretically founded approach. The first step in such a con-struction consists of choosing a sensible risk criterion, whichis not a straightforward task in the SMC context. A naturalcriterion for SMC would be the variance of the estimate ofthe posterior mean of a target function (or a set of targetfunctions) of interest, but this approach does not lead to apractical implementation for two reasons. Firstly, in SMCmethods, though closed-form expression for the variance atany given current time-step of the posterior mean of anyfunction is available, this variance depends explicitly on allthe time steps before the current time. Hence, choosing tominimise the variance at a given time-step would require tooptimise all the simulations up to that particular time step,which is of course not practical. Because of the recursiveform of the variance, the minimisation of the conditionalvariance at each iteration of the algorithm does not neces-

    sarily lead to satisfactory performance on the long-run. Sec-ondly, as for the standard importance sampling algorithm,this criterion is notfunction-free, meaning that a choice of aproposal can be appropriate for a given function, but inap-propriate for another.

    We will focus in the sequel on function-free risk criteria.A first criterion, advocated in Kong et al. (1994) and Liu(2004) is the chi-square distance (CSD) between the pro-posal and the target distributions, which coincides with thecoefficient of variation(CV2) of the importance weights. Inaddition, as heuristically discussed in Kong et al. (1994), theCSD is related to the effective sample size, which estimates

    the number of i.i.d. samples equivalent to the weighted parti-cle system.2 In practice, the CSD criterion can be estimated,with a complexity that grows linearly with the number ofparticles, using the empirical CV2 which can be shownto converge to the CSD as the number of particles tendsto infinity. In this paper we show that a similar property

    2In some situations, the estimated ESS value can be misleading: see thecomments of Stephens and Donnelly (2000) for a further discussion ofthis.

    still holds in the SMC context, in the sense that the CV2

    still measures a CSD between two distributions and ,which are associated with the proposal and target distribu-tions of the particle filter (see Theorem1(ii)). Though thisresult does not come as a surprise, it provides an additionaltheoretical footing to an approach which is currently used inpractice for triggering resampling steps.

    Another function-free risk criterion to assess the perfor-mance of importance sampling estimators is the KLD be-tween the proposal and the target distributions; see Cappet al. (2005, Chap. 7). The KLD shares some of the attrac-tive properties of the CSD; in particular, the KLD may beestimated using the negated empirical entropy Eof the im-portance weights, whose computational complexity is againlinear in the number of particles. In the SMC context, it isshown in Theorem1(i) that Estill converges to the KLD be-tween the same two distributionsand associated withthe proposal and the target distributions of the particle filter.

    Our methodology to design appropriate proposal distrib-utions is based upon the minimisation of the CSD and KLD

    between the proposal and the target distributions. Whereasthese quantities (and especially the CSD) have been rou-tinely used to detect sample impoverishment and trigger theresampling step (Kong et al.1994), they have not been usedfor adapting the simulation parameters in SMC methods.

    We focus here on the auxiliary sampling formulationof the particle filter. In this setting, there are two quan-tities to optimise: the adjustment multiplier weights (alsocalledfirst-stage weights) and the parameters of the proposalkernel; together these quantites define the mixture used asinstrumental distribution in the filter. We first establish aclosed-form expression for the limiting value of the CSD

    and KLD of the auxiliary formulation of the proposal andthe target distributions. Using these expressions, we iden-tify a type of auxiliary SMC adjustment multiplier weightswhich minimise the CSD and the KLD for a given proposalkernel (Proposition 2). We then propose several optimisationtechniques for adapting the proposal kernels, always drivenby the objective of minimising the CSD or the KLD, in co-herence with what is done to detect sample impoverishment(see Sect. 5). Finally, in the implementation section (Sect. 6),we use the proposed algorithms for approximating the filter-ing distributions in several state space models, and show thatthe proposed optimisation procedure improves the accuracy

    of the particle estimates and makes them more robust to out-lying observations.

    2 Informal presentation of the results

    2.1 Adaptive importance sampling

    Before stating and proving rigorously the main results, wediscuss informally our findings and introduce the proposed

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    4/20

    464 Stat Comput (2008) 18: 461480

    methodology for developing adaptive SMC algorithms. Be-fore entering into the sophistication of sequential methods,we first briefly introduce adaptation of the standard (non-sequential) importance sampling algorithm.

    Importance sampling(IS) is a general technique to com-pute expectations of functions with respect to a target dis-tribution with densityp(x)while only having samples gen-erated from a different distributionreferred to as the pro-

    posal distributionwith density q(x)(implicitly, the dom-inating measure is taken to be the Lebesgue measure on Rd). We sample{i}Ni=1 from the proposal distribu-tion q and compute the unnormalised importance weightsi W (i ), i= 1, . . . , N , where W(x) p(x)/q(x). Forany function f, the self-normalised importance sampling es-timator may be expressed as ISN(f )

    1N

    Ni=1 i f (i ),

    where NN

    j=1 j. As usual in applications of the ISmethodology to Bayesian inference, the target density p isknown only up to a normalisation constant; hence we willfocus only on a self-normalised version of IS that solely re-

    quires the availability of an unnormalised version ofp (seeGeweke1989). Throughout the paper, we call a set{i}Ni=1of random variables, referred to asparticlesand taking val-ues in , and nonnegative weights {i}Ni=1a weighted sam-

    pleon . HereNis a (possibly random) integer, though wewill take it fixed in the sequel. It is well known (see againGeweke1989) that, provided that fis integrable with re-spect top, i.e.

    |f(x)|p(x) dx < , ISN(f )converges, asthe number of samples tends to infinity, to the target value

    Ep[f(X)]

    f (x)p(x) dx,

    for any function f C, where C is the set of functions whichare integrable with respect to to the target distribution p.Un-der some additional technical conditions, th is estimator isalso asymptotically normal at rate

    N; see Geweke (1989).

    It is well known that IS estimators are sensitive to thechoice of the proposal distribution. A classical approachconsists of trying to minimise the asymptotic variance withrespect to the proposal distributionq . This optimisation is inclosed form and leads (when fis a non-negative function)to the optimal choice q(x)=f (x)p(x)/ f (x)p(x) dx,which is, since the normalisation constant is precisely the

    quantity of interest, rather impractical. Sampling from thisdistribution can be done by using an accept-reject algorithm,but this does not solve the problem of choosing an appropri-ate proposal distribution. Note that it is possible to approachthis optimal sampling distribution by using the cross-entropymethod; see Rubinstein and Kroese (2004) and de Boer et al.(2005) and the references therein. We will discuss this pointlater on.

    For reasons that will become clear in the sequel, this typeof objective is impractical in the sequential context, since

    the expression of the asymptotic variance in this case is re-cursive and the optimisation of the variance at a given stepis impossible. In addition, in most applications, the proposaldensity is expected to perform well for a range of typicalfunctions of interest rather than for a specific target func-tion f. We are thus looking for function-freecriteria. Themost often used criterion is the CSD between the proposaldistributionq and the target distribution p, defined as

    d2 (p q) = {p(x) q(x)}2

    q(x)dx, (2.1)

    =

    W2(x)q(x) dx 1, (2.2)

    =

    W(x)p(x) dx 1. (2.3)

    The CSD betweenpandqmay be expressed as the varianceof the importance weight function Wunder the proposal dis-tribution, i.e.

    d2 (p q) = Varq [W(X)].

    This quantity can be estimated by computing the squaredcoefficient of variation of the unnormalized weights (Evansand Swartz1995, Sect. 4):

    CV2{i}Ni=1 N2N N

    i=12i 1. (2.4)

    The CV2 was suggested by Kong et al. (1994) as a meansfor detecting weight degeneracy. If all the weights are equal,then CV2 is equal to zero. On the other hand, if all theweights but one are zero, then the coefficient of variation isequal toN1 which is its maximum value. From this it fol-lows that using the estimated coefficient of variation for as-sessing accuracy is equivalent to examining the normalisedimportance weights to determine if any are relatively large.3

    Kong et al. (1994) showed that the coefficient of variationof the weights CV2({i}Ni=1)is related to theeffective sam-

    ple size (ESS), which is used for measuring the overall effi-ciency of an IS algorithm:

    N1ESS{i}Ni=1

    1

    1+

    CV2{i}Ni=

    1 1 + d 2 (p q)1 .

    Heuristically, the ESS measures the number of i.i.d. samples(fromp) equivalent to theNweighted samples. The smallerthe CSD between the proposal and target distributions is,

    3Some care should be taken for small sample sizes N; the CV2 canbe low becauseq sample only over a subregion where the integrand isnearly constant, which is not always easy to detect.

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    5/20

    Stat Comput (2008) 18: 461480 465

    the larger is the ESS. This is why the CSD is of particularinterest when measuring efficiency of IS algorithms.

    Another possible measure of fit of the proposal distrib-ution is the KLD (also called relative entropy) between theproposal and target distributions, defined as

    dKL(p q) p(x) logp(x)

    q(x)dx, (2.5)=

    p(x) log W(x) dx, (2.6)

    =

    W(x) log W(x)q(x) dx. (2.7)

    This criterion can be estimated from the importance weightsusing the negative Shannon entropy E of the importanceweights:

    E

    {i}Ni=1

    1N

    N

    i=1ilog

    N1N i

    . (2.8)

    The Shannon entropy is maximal when all the weights areequal and minimal when all weights are zero but one. InIS (and especially for the estimation of rare events), theKLD between the proposal and target distributions was thor-oughly investigated by Rubinstein and Kroese (2004), and iscentral in thecross-entropy(CE) methodology.

    Classically, the proposal is chosen from a family of den-sitiesqparameterised by. Hereshould be thought of asan element of , which is a subset ofRk . The most classicalexample is the family of students t-distributions parame-terised by mean and covariance. More sophisticated parame-

    terisations, like mixture of multi-dimensional Gaussian orStudents t-distributions, have been proposed; see, e.g., Ohand Berger (1992,1993), Evans and Swartz (1995), Givensand Raftery (1996),Liu(2004, Chap. 2, Sect. 2.6), and, morerecently, Capp et al. (2008) in this issue. In the sequentialcontext, where computational efficiency is a must, we typ-ically use rather simple parameterisations, so that the twocriteria above can be (approximatively) solved in a few iter-ations of a numerical minimisation procedure.

    The optimal parameters for the CSD and the KLD arethose minimising d 2 (p q) and dKL(p q),respectively. In the sequel, we denote by CSD and

    KLD

    these optimal values. Of course, these quantities cannot becomputed in closed form (recall that even the normalisationconstant ofpis most often unknown; even if it is known, theevaluation of these quantities would involve the evaluationof most often high-dimensional integrals). Nevertheless, itis possible to construct consistent estimators of these opti-mal parameters. There are two classes of methods, detailedbelow.

    The first uses the fact that the the CSD d2 (p q)andthe KLD dKL(p|q) may be approximated by (2.4)and(2.8),

    substituting in these expressions the importance weights byi= W(i ),i= 1, . . . , N , where W p/qand{i}Ni=1is a sample from q. This optimisation problem formallyshares some similarities with the classical minimum chi-square or maximum likelihood estimation, but with the fol-lowing important difference: the integrations in (2.1) and(2.5) are with respect to the proposal distribution q andnot the target distribution p. As a consequence, the particles{i}Ni=1in the definition of the coefficient of variation (2.4)or the entropy (2.8) of the weights constitute a sample fromqand not from the target distribution p. As the estimationprogresses, the samples used to approach the limiting CSDor KLD can, in contrast to standard estimation procedures,be updated (these samples could be kept fixed, but this is ofcourse inefficient).

    The computational complexity of these optimisationproblems depends on the way the proposal is parame-terised and how the optimisation procedure is implemented.Though the details of the optimisation procedure is in gen-eral strongly model dependent, some common principles forsolving this optimisation problem can be outlined. Typically,the optimisation is done recursively, i.e. the algorithm de-fines a sequence , = 0, 1, . . . , of parameters, where isthe iteration number. At each iteration, the value ofis up-dated by computing a direction p+1in which to step, a steplength+1, and setting

    +1 = + +1p+1.

    The search direction is typically computed using eitherMonte Carlo approximation of the finite-difference or (whenthe quantities of interest are sufficiently regular) the gradi-

    ent of the criterion. These quantities are used later in con-junction with classical optimisation strategies for comput-ing the step size +1 or normalising the search direction.These implementation issues, detailed in Sect.6, are modeldependent. We denote by M the number of particles usedto obtain such an approximation at iteration . The numberof particles may vary with the iteration index; heuristicallythere is no need for using a large number of simulations dur-ing the initial stage of the optimisation. Even rather crude es-timation of the search direction might suffice to drive the pa-rameters towards the region of interest. However, as the iter-ations go on, the number of simulations should be increased

    to avoid zig-zagging when the algorithm approaches con-vergence. After L iterations, the total number of generatedparticles is equal toN=L=1 M. Another solution, whichis not considered in this paper, would be to use a stochasticapproximation procedure, which consists of fixing M = Mand letting the stepsizetend to zero. This appealing solu-tion has been successfully used in Arouna (2004).

    The computation of the finite difference or the gradient,being defined as expectations of functions depending on ,can be performed using two different approaches. Starting

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    6/20

    466 Stat Comput (2008) 18: 461480

    from definitions (2.3) and (2.6), and assuming appropriateregularity conditions, the gradient of d2 (p q)and dKL(p q)may be expressed as

    GCSD( ) d2 (p q) =

    p(x)W(x) dx,

    = q(x)W(x)W(x) dx, (2.9)GKLD( ) dKL(p q) =

    p(x)log[W(x)] dx,

    =

    q(x)W(x) dx. (2.10)

    These expressions lead immediately to the following ap-proximations,

    GCSD( ) = M1M

    i=1W(

    i )W(i ), (2.11)

    GKLD( ) = M1M

    i=1W (i ). (2.12)

    There is another way to compute derivatives, which sharessome similarities with pathwise derivative estimates. Re-call that for any , one may choose F so that therandom variable F(), where is a vector of independentuniform random variables on[0, 1]d, is distributed accord-ing to q. Therefore, we may express d 2 (p q)and dKL(p q)as the following integrals,

    d 2 (p

    q)

    = [0,1]d w(x) dx,dKL(p q) =

    [0,1]d

    w(x) log[w(x)] dx,

    wherew(x) W F(x). Assuming appropriate regular-ity conditions (i.e. that W F(x)is differentiable andthat we can interchange the integration and the differenti-ation), the differential of these quantities with respect to may be expressed as

    GCSD( ) =

    [0,1]dw(x) dx,

    GKLD( ) =

    [0,1]d{w(x) log[w(x)] + w(x)}dx.

    For any givenx , the quantity w(x)is the pathwise deriv-ative of the function w(x). As a practical matter, weusually think of each x as a realization of of the output ofan ideal random generator. Each w(x) is then the outputof the simulation algorithm at parameter for the randomnumberx . Each w(x) is the derivative of the simulationoutput with respect to with the random numbers held fixed.

    These two expressions, which of course coincide with (2.9)and (2.10), lead to the following estimators,

    GCSD( ) = M1M

    i=1w(i ),

    GKLD( )

    =M1

    M

    i=1 {w(i ) log[w(i )] + w(i )} ,where each element of the sequence{i}Mi=1 is a vectoron[0, 1]d of independent uniform random variables. It isworthwhile to note that if the number M = Mis kept fixedduring the iterations and the uniforms{i}Mi=1 are drawnonce and for all (i.e. the same uniforms are used at the dif-ferent iterations), then the iterative algorithm outlined abovesolves the following problem:

    CV2{w(i )}Mi=1

    , (2.13)

    E{w(i )}Ni=1 . (2.14)From a theoretical standpoint, this optimisation problem isvery similar to M-estimation, and convergence results forM-estimators can thus be used under rather standard techni-cal assumptions; see for example Van der Vaart (1998). Thisis the main advantage of fixing the sample{i}Mi=1. We usethis implementation in the simulations.

    Under appropriate conditions, the sequence of estimators

    ,CSDor ,KLDof these criteria converge, as the number of

    iterations tends to infinity, toCSDor KLDwhich minimise

    the criteria d2 (p q)and dKL(p q), respec-tively; these theoretical issues are considered in a compan-ion paper.

    The second class of approaches considered in this paperis used for minimising the KLD (2.14) and is inspired bythe cross-entropy method. This algorithm approximates theminimum KLD of (2.14) by a sequence of pairs of steps,where each step of each pair addresses a simpler optimisa-tion problem. Compared to the previous method, this algo-rithm is derivative-free and does not require to select a stepsize. It is in general simpler to implement and avoid most ofthe common pitfalls of stochastic approximation. Denote by0 an initial value. We define recursively the sequence{}0as follows. In a first step, we draw a sample {

    i }

    Mi=1

    and evaluate the function

    Q(,) Mi=1

    W (i ) log q(

    i ). (2.15)

    In a second step, we choose +1 to be the (or any, if thereare several) value of that maximises Q(,). Asabove, the number of particles M is increased during thesuccessive iterations. This procedure ressembles closely the

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    7/20

    Stat Comput (2008) 18: 461480 467

    Monte Carlo EM (Wei and Tanner1991) for maximum like-lihood in incomplete data models. The advantage of thisapproach is that the solution of the maximisation problem+1= argmax Q(,)is often on closed form. Inparticular, this happens if the distribution qbelongs to anexponential family(EF) or is a mixture of distributions ofNEF; see Capp et al. (2008) for a discussion. The conver-gence of this algorithm can be established along the samelines as the convergence of the MCEM algorithm; see Fortand Moulines (2003). As the number of iterations in-creases, the sequence of estimators may be shown to con-verge to KLD. These theoretical results are established in acompanion paper.

    2.2 Sequential Monte Carlo methods

    In the sequential context, where the problem consists of sim-ulating from a sequence{pk} of probability density func-tion, the situation is more difficult. Let k be denote the

    state space of distribution pk and note that this space mayvary with k, e.g. in terms of increasing dimensionality. Inmany applications, these densities are related to each otherby a (possibly random) mapping, i.e. pk= k1(pk1). Inthe sequel we focus on the case where there exists a non-negative functionlk1 : (, ) lk1(, )such that

    pk ( ) =

    lk1(, )pk1( ) dpk1( )

    lk1(, ) dd

    . (2.16)

    As an example, consider the following generic nonlinear dy-namic system described in state space form:

    State (system) model

    Xk= a(Xk1, Uk ) Transition Density q(Xk1, Xk ) , (2.17)

    Observation (measurement) model

    Yk= b(Xk , Vk ) Observation Density

    g(Xk, Yk ) . (2.18)

    By these equations we mean that each hidden state Xk anddata Yk are assumed to be generated by nonlinear func-

    tions a() and b(), respectively, of the state and observa-tion noisesUkand Vk . The state and the observation noises{Uk}k0 and{Vk}k0 are assumed to be mutually inde-pendent sequences of i.i.d. random variables. The preciseform of the functions and the assumed probability distribu-tions of the state and observation noises Uk and Vk imply,via a change of variables, the transition probability densityfunctionq (xk1, xk )and the observation probability densityfunction g (xk, yk), the latter being referred to as the likeli-hood of the observation. With these definitions, the process

    {Xk}k0 is Markovian, i.e. the conditional probability den-sity of Xk given the past states X0:k1 (X0, . . . , Xk1)depends exclusively on Xk1. This distribution is describedby the density q(xk1, xk ). In addition, the conditional prob-ability density ofYk given the states X0:k and the past ob-servationsY0:k1 depends exclusively on Xk , and this dis-tribution is captured by the likelihoodg(xk , yk ). We assumefurther that the initial state X

    0is distributed according to

    a density function0(x0). Such nonlinear dynamic systemsarise frequently in many areas of science and engineeringsuch as target tracking, computer vision, terrain referencednavigation, finance, pollution monitoring, communications,audio engineering, to list only a few.

    Statistical inference for the general nonlinear dynamicsystem above involves computing the posterior distributionof a collection of state variables Xs:s (Xs , . . . , Xs )con-ditioned on a batchY0:k= (Y0, . . . , Y k)of observations. Wedenote this posterior distribution by s:s|k (Xs:s |Y0:k ). Spe-cific problems includefiltering, corresponding tos= s = k,

    fixed lag smoothing, where s= s= k L, andfixed inter-val smoothing, with s= 0 and s = k. Despite the apparentsimplicity of the above problem, the posterior distributionscan be computed in closed form only in very specific cases,principally, the linear Gaussian model (where the functionsa()and b()are linear and the state and observation noises{Uk}k0 and{Vk}k0 are Gaussian) and the discrete hid-den Markov model (where Xk takes its values in a finitealphabet). In the vast majority of cases, nonlinearity or non-Gaussianity render analytic solutions intractablesee An-derson and Moore (1979), Kailath et al. (2000), Ristic et al.(2004), Capp et al. (2005).

    Starting with the initial, or prior, density function 0(x0),and observations Y0:k = y0:k , the posterior densityk|k(xk|y0:k) can be obtained using the following predic-tion-correctionrecursion (Ho and Lee1964):

    Prediction

    k|k1(xk|y0:k1)= k1|k1(xk1|y0:k1)q(xk1, xk ), (2.19)

    Correction

    k|k (xk|y0:k ) =g (xk, yk)k

    |k

    1(xk

    |y0

    :k

    1)

    k|k1(yk|y0:k1) , (2.20)where k|k1 is the predictive distribution of Yk given thepast observations Y0:k1. For a fixed data realisation, thisterm is a normalising constant (independent of the state) andis thus not necessary to compute in standard implementa-tions of SMC methods.

    By settingpk= k|k ,pk1 = k1|k1, and

    lk1(x,x) = g(xk , yk )q(xk1, xk ),

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    8/20

    468 Stat Comput (2008) 18: 461480

    we conclude that the sequence{k|k}k1 of filtering densi-ties can be generated according to (2.16).

    The case of fixed interval smoothing works entirely anal-ogously: indeed, since

    0:k|k1(x0:k|y0:k1)= 0:k1|k1(x0:k1|y0:k1)q(xk1, xk )

    and

    0:k|k (xk|y0:k ) =g (xk , yk )k|k1(x0:k|y0:k1)

    k|k1(yk|y0:k1),

    the flow{0:k|k}k1 of smoothing distributions can be gen-erated according to (2.16) by letting pk= 0:k|k , pk1=0:k1|k1, and replacing lk1(x0:k1, x 0:k ) dx

    0:k by

    g(x k, yk)q(xk1, xk) dx

    kx0:k1 (dx

    0:k1), where a denotes

    the Dirac mass located in a. Note that this replacement isdone formally since the unnormalised transition kernel inquestion lacks a density in the smoothing mode; this is dueto the fact that the Dirac measure is singular with respectto the Lebesgue measure. This is however handled by themeasure theoretic approach in Sect.4, implying that all the-oretical results presented in the following will comprise alsofixed interval smoothing.

    We now adapt the procedures considered in the previ-ous section to the sampling of densities generated accord-ing to (2.16). Here we focus on a single time-step, anddrop from the notation the dependence on k which is ir-relevant at this stage. Moreover, set pk= , pk1= ,lk=l , and assume that we have at hand a weighted sam-ple{(N,i , N,i )}Ni=1 targeting , i.e., for any -integrablefunctionf,1

    N Ni=1 N,i f (N,i )approximates the corre-sponding integral f ()() d. A natural strategy for sam-pling fromis to replace in (2.16) by its particle approx-imation, yielding

    N( )

    Ni=1

    N,i

    l (N,i , ) dNj=1 N,j

    l (N,j, ) d

    l(N,i , )l (N,i , ) d

    as an approximation of, and simulate MNnew particlesfrom this distribution; however, in many applications directsimulation from N is infeasible without the applicationof computationally expensive auxiliary accept-reject tech-

    niques introduced by Hrzeler and Knsch (1998) and thor-oughly analysed by Knsch (2005). This difficulty can be

    overcome by simulating new particles {N,i }MNi=1from the in-strumental mixture distribution with density

    N( )

    Ni=1

    N,i N,iNj=1 N,jN,j

    r(N,i , ),

    where{N,i }Ni=1 are the so-called adjustment multiplierweightsand r is a Markovian transition density function,

    i.e., r(, ) is a nonnegative function and, for any ,r(, ) d=1. If one can guess, based on the new ob-

    servation, which particles are most likely to contribute sig-nificantly to the posterior, the resampling stage may beanticipated by increasing (or decreasing) the importanceweights. This is the purpose of using the multiplier weightsN,i . We associate these particles with importance weights

    {N(N,i )/N(N,i )}MNi=1. In this setting, a new particle po-

    sition is simulated from the transition proposal densityr(N,i , ) with probability proportional to N,i N,i . Hap-lessly, the importance weight N(N,i )/N(N,i ) is ex-pensive to evaluate since this involves summing over Nterms.

    We thus introduce, as suggested by Pitt and Shep-hard (1999), an auxiliary variable corresponding to theselected particle, and target instead the probability den-sity

    auxN (i,

    )

    N,i

    l (N,i , ) dN

    j=1 N,j

    l (N,j, ) d

    l(N,i , ) l(N,i , ) d

    (2.21)

    on the product space{1, . . . , N } . Since Nis the mar-ginal distribution ofauxN with respect to the particle indexi, we may sample from N by simulating instead a set

    {(IN,i , N,i )}MNi=1 of indices and particle positions from theinstrumental distribution

    auxN (i,x)

    N,i N,i

    Nj=1 N,jN,jr(N,i , ) (2.22)

    and assigning each draw(IN,i , N,i )the weight

    N,i auxN (IN,i , N,i )

    auxN (IN,i , N,i )= 1N,IN,i

    l(N,IN,i , N,i )

    r(N,IN,i , N,i ). (2.23)

    Hereafter, we discard the indices and let{(N,i ,N,i )}MNi=1approximate the target density . Note that setting, for alli {1, . . . , N }, N,i 1 yields the standard bootstrap par-ticle filter presented by Gordon et al. (1993). In the se-quel, we assume that each adjustment multiplier weightN,i is a function of the particle position N,i= (N,i ),i {1, . . . , N }, and define

    (, ) 1( )l(, )

    r(, ), (2.24)

    so that auxN (i, )/auxN (i, ) is proportional to (N,i , ).

    We will refer to the function as theadjustment multiplierfunction.

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    9/20

    Stat Comput (2008) 18: 461480 469

    2.3 Risk minimisation for sequential adaptive importancesampling and resampling

    We may expect that the efficiency of the algorithm describedabove depends highly on the choice of adjustment multiplierweights and proposal kernel.

    In the context of state space models, Pitt and Shephard(1999) suggested to use an approximation, defined as thevalue of the likelihood evaluated at the mean of the priortransition, i.e. N,i g(

    x q(N,i , x) dx, yk ), where yk

    is the current observation, of the predictive likelihood asadjustment multiplier weights. Although this choice of theweight outperforms the conventional bootstrap filter in manyapplications, as pointed out in Andrieu et al. (2003), this ap-proximation of the predictive likelihood could be very poorand lead to performance even worse than that of the conven-tional approach if the dynamic model q(xk1, xk ) is quitescattered and the likelihood g(xk , yk ) varies significantlyover the priorq (xk1, xk ).

    The optimisation of the adjustment multiplier weightwas also studied by Douc et al. (2008) (see also Olssonet al.2007) who identified adjustment multiplier weights forwhich the increase of asymptotic variance at a single itera-tion of the algorithm is minimal. Note however that this opti-misation is done using a function-specificcriterion, whereaswe advocate here the use offunction-freecriteria.

    In our risk minimisation setting, this means that boththe adjustment weights and the proposal kernels need tobe adapted. As we will see below, these two problems arein general intertwined; however, in the following it will beclear that the two criteria CSD and KLD behave differently

    at this point. Because the criteria are rather involved, it is in-teresting to study their behaviour as the number of particlesNgrows to infinity. This is done in Theorem1, which showsthat the CSD d2 (

    auxN auxN )and KLD dKL(auxN auxN )

    converges tod 2 ( )and dKL( ), respectively,

    where

    (, ) ()l(, ) ()l(, ) dd

    ,

    (, ) ()()r(, )

    ()()r(, ) dd

    .

    (2.25)

    The expressions (2.25) of the limiting distributions then al-low for deriving the adjustment multiplier weight functionand the proposal density l minimising the correspond-ing discrepancy measures. In absence of constraints (when andl can be chosen arbitrarily), the optimal solution forboth the CSD and the KLD consists of setting = andr= r, where

    ( )

    l(, ) d=

    l(, )

    r(, )r(, ) d , (2.26)

    r(, ) l(, )/(). (2.27)

    This choice coincides with the so-called optimal samplingstrategyproposed by Hrzeler and Knsch (1998) and de-veloped further by Knsch (2005), which turns out to beoptimal(in absence of constraints) in our risk-minimisationsetting.

    Remark 1 The limiting distributions and have niceinterpretations within the framework of state space models(see the previous section). In this setting, the limiting distri-butionat timekis the joint distribution k:k+1|k+1of the

    filteredcouple Xk:k+1, that is, the distribution ofXk:k+1con-ditionally on the observation recordY0:k+1; this can be seenas the asymptotic target distribution of our particle model.Moreover, the limiting distribution at time k is onlyslightly more intricate: Its first marginal corresponds to thefiltering distribution at time k reweighted by the adjustmentfunction, which is typically used for incorporating infor-mation from the new observation Yk+1. The second marginal

    of is then obtained by propagating this weighted filter-ing distribution through the Markovian dynamics of the pro-posal kernel R; thus, describes completely the asymp-totic instrumental distribution of the APF, and the two quan-titiesdKL( )andd2 ( )reflect the asymptoticdiscrepancy between the true model and the particle modelat the given time step.

    In presence of constraints on the choice of and r ,the optimisation of the adjustment weight function and theproposal kernel density is intertwined. By the so-calledchain rule for entropy (see Cover and Thomas1991, The-orem 2.2.1), we have

    dKL(||) =

    ()

    ()( ) log

    ()/()

    ()/()

    d

    where (f)

    ()f () d. Hence, if the optimal ad-justment function can be chosen freely, then, whatever thechoice of the proposal kernel is, the best choice is stillKL,r= : the best that we can do is to chooseKL,rsuchthat the two marginal distributions (, ) d and (, ) dare identical. If the choices of the weightadjustment function and the proposal kernels are constrained(if, e.g., the weight should be chosen in a pre-specified fam-

    ily of functions or the proposal kernel belongs to a paramet-ric family), nevertheless, the optimisation ofandrdecou-ple asymptotically. The optimisation for the CSD does notlead to such a nice decoupling of the adjustment functionand the proposal transition; nevertheless, an explicit expres-sion for the adjustment multiplier weights can still be foundin this case:

    2,r

    ( )

    l2(, )

    r(, )d

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    10/20

    470 Stat Comput (2008) 18: 461480

    =

    l2(, )

    r2(, )r(, ) d . (2.28)

    Compared to (2.26), the optimal adjustment function forthe CSD is the L2 (rather than the L1) norm of l2(, )/r2(, ). Since l(, )=()r(, ) (see defi-nitions (2.26) and (2.27)), we obtain, not surprisingly, if wesetr

    =r

    ,2,r

    ( )=

    ( ).Using this risk minimisation formulation, it is possible

    to select the adjustment weight function as well as the pro-posal kernel by minimising either the CSD or the KLD cri-teria. Of course, compared to the sophisticated adaptationstrategies considered for adaptive importance sampling, wefocus on elementary schemes, the computational burden be-ing quickly a limiting factor in the SMC context.

    To simplify the presentation, we consider in the sequelthe adaptation of the proposal kernel; as shown above, it isof course possible and worthwhile to jointly optimise theadjustment weight and the proposal kernel, but for clarity

    we prefer to postpone the presentation of such a techniqueto a future work. The optimisation of the adjustment weightfunction is in general rather complex: indeed, as mentionedabove, the computation of the optimal adjustment weightfunction requires the computing of an integral. This integralcan be evaluated in closed form only for a rather limitednumber of models; otherwise, a numerical approximation(based on cubature formulae, Monte Carlo etc.) is required,which may therefore incur a quite substantial computationalcost. If proper simplifications and approximations are notfound (which are, most often, model specific) the gains inefficiency are not necessarily worth the extra cost. In state

    space (tracking) problems simple and efficient approxima-tions, based either on the EKF or the UKF (see for exampleAndrieu et al.2003or Shen et al.2004), have been proposedfor several models, but the validity of this sort of approxima-tions cannot necessarily be extended to more general mod-els.

    In the light of the discussion above, a natural strategy foradaptive design of auxN is to minimise the empirical esti-mate E (or CV2) of the KLD (or CSD) over all proposalkernels belonging to some parametric family{r} . Thiscan be done using straightforward adaptations of the twomethods described in Sect.2.1. We postpone a more precise

    description of the algorithms and implementation issues toSect.4, where more rigorous measure-theoretic notation isintroduced and the main theoretical results are stated.

    3 Notation and definitions

    To state precisely the results, we will now use measure-theoretic notation. In the following we assume that all ran-dom variables are defined on a common probability space

    (, F,P) and let, for any general state space (,B()),P() and B() be the sets of probability measures on(,B())and measurable functions from to R, respec-tively.

    A kernel K from (,B()) to some other state space(,B()) is called finite if K(,)

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    11/20

    Stat Comput (2008) 18: 461480 471

    Alternatively, we will sometimes say that the weightedsample in Definition1targetsthe measure .

    Thus, suppose that we are given a weighted sample{(N,i , N,i )}MNi=1targetingP(). We wish to transformthis sample into a new weighted particle sample approximat-ing the probability measure

    () L(

    )

    L() = L(, )(d )

    L(,)(d) (3.4)

    on some other state space(,B()). HereLis a finite tran-sition kernel from (,B())to (,B()). As suggestedby Pitt and Shephard (1999), an auxiliary variable corre-sponding to the selected stratum, and target the measure

    auxN ({i} A) N,i L(N,i ,)MN

    j=1 N,jL(N,j,)

    L(N,i , A)

    L(N,i ,)

    (3.5)

    on the product space {1, . . . , M N}. Since Nis the mar-ginal distribution ofauxN with respect to the particle posi-tion, we may sample from Nby simulating instead a set

    {(IN,i , N,i )}MNi=1 of indices and particle positions from theinstrumental distribution

    auxN ({i} A) N,i N,iMN

    j=1 N,jN,jR(N,i , A) (3.6)

    and assigning each draw(IN,i , N,i )the weight

    N,i 1N,IN,i dL(N,IN,i , )

    dR(N,IN,i , )(N,i )

    being proportional to dauxN /dauxN (IN,i , N,i )the for-

    mal difference with (2.23) lies only in the use of Radon-Nykodym derivatives of the two kernels rather than densitieswith respect to Lebesgue measure. Hereafter, we discard the

    indices and take {(N,i ,N,i )}MNi=1as an approximation of.The algorithm is summarised below.

    Algorithm 1Nonadaptive APF

    Require:{(N,i , N,i )}MNi=1targets .1: Draw {IN,i }MNi=1 M(MN, {N,jN,j/

    MN=1 N, N,}MNj=1),

    2: simulate {N,i }MNi=1 MN

    i=1 R(N,IN,i , ),3: set, for alli {1, . . . , MN},

    N,i 1N,IN,i dL(N,IN,i , )/dR(N,IN,i , )(N,i ).

    4: take {(N,i ,N,i )}MNi=1as an approximation of.

    4 Theoretical results

    Consider the following assumptions.

    (A1) The initial sample{(N,i , N,i )}MNi=1 is consistent for(, C).

    (A2) There exists a function : R+ such thatN,i = (N,i ); moreover, C L1(, ) andL(,) C.

    Under these assumptions we define for (, ) the weight function

    (, ) 1( )dL(, )dR(, ) ( ), (4.1)

    so that for every index i ,N,i= (IN,i , N,i ). The follow-ing result describes how the consistency property is passedthrough one step of the APF algorithm. A somewhat lessgeneral version of this result was also proved in Douc et al.(2008, Theorem 3.1).

    Proposition 1 Assume(A1,A2).Then the weighted sample

    {(N,i ,N,i )}MNi=1 is consistent for (,C), whereC {fL

    1(,),L(, |f|) C}.

    The result above is a direct consequence of Lemma2andthe fact that the setCis proper.

    Let and be two probability measures in P()suchthat is absolutely continuous with respect to . We thenrecall that the KLD and the CSD are, respectively, given by

    dKL(

    ) log[d/d()](d),

    d2 ( )

    [d/d() 1]2(d).

    Define the two probability measures on the product space(,B() B()):

    (A) LL()

    (A)

    =

    (d)L(, d)1A(,)

    (d)L(, d

    ), (4.2)

    (A)

    [

    ] R

    () (A)

    =

    (d)()R(, d)1A(,)

    (d)()R(, d)

    , (4.3)

    whereA B() B()and the outer product of a mea-sure and a kernel is defined in (3.2).

    Theorem 1 Assume(A1, A2). Then the following holds asN .

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    12/20

    472 Stat Comput (2008) 18: 461480

    (i) IfL(, | log |) C L1(, ),then

    dKL(auxN auxN )

    P dKL

    , (4.4)

    (ii) IfL(, ) C,then

    d 2 (auxN auxN )

    P d2

    , (4.5)

    Additionally, Eand CV2, defined in (2.8) and (2.4) re-spectively, converge to the same limits.

    Theorem 2 Assume(A1, A2). Then the following holds asN .(i) IfL(, | log |) C L1(, ),then

    E({N,i }MNi=1) P dKL

    . (4.6)

    (ii) IfL(, ) C,then

    CV2({N,i }MNi=1) P d 2 . (4.7)

    Next, it is shown that the adjustment weight function canbe chosen so as to minimize the RHS of (4.4) and (4.5).

    Proposition 2 Assume(A1, A2).Then the following holds.

    (i) IfL(, | log |) C L1(, ),then

    arg mindKL

    KL,R

    whereKL,R( ) L(,).

    (ii) IfL(, ) C,thenarg mind2

    2,R

    where 2,R

    ( )

    dL(, )dR(, ) ()L(, d ).

    It is worthwhile to notice that the optimal adjustmentweights for the KLD do not depend on the proposal kernelR. The minimal valuedKL( KL,R )of the limiting KLDis the conditional relative entropy between and .

    In both cases, letting R(, A)

    =L(

    ,A)/L(

    ,

    ) yields,

    as we may expect, the optimal adjustment multiplier weightfunction KL,R ()=2,R()=L(,), resulting in uni-form importance weightsN,i 1.

    It is possible to relate the asymptotic CSD (4.5) betweenauxN and

    auxN

    to the asymptotic variance of the estimator

    1NMN

    i=1N,i f (N,i )of an expectation (f )for a givenintegrable target functionf. More specifically, suppose thatMN/MN [0, ]as N and that the initial sam-ple {(N,i , N,i )}MNi=1satisfies, for all fbelonging to a given

    class A of functions, the central limit theorem

    aN1N

    MNi=1

    N,i [f (N,i ) (f)] DN[0, 2(f )], (4.8)

    where the sequence{aN}N is such that aNMN [0, ) as N and : A R+ is a functional. Thenthe sample

    {(

    N,i,

    N,i)}

    MN

    i=1produced in Algorithm1is, as

    showed in Douc et al. (2008, Theorem 3.2), asymptoticallynormal for a class of functionsA in the sense that, for allfA,

    1N

    MNi=1

    N,i [f (N,i ) (f)] DN{0,2[](f )},

    where

    2[](f )

    =2

    {L

    [, f

    (f)

    ]}/

    [L(

    ,)

    ]2

    + 1(R{, 2[f (f)]2})()/[L()]2

    and, recalling the definition (3.3) of a modulated measure,

    (R{, 2[f (f)]2})()/[L()]2

    = 2(|f|)d2{[|f|]/(|f|) }

    2(f )(f1/2+ )d 2{[f1/2+ ]/(f1/2+ ) }

    + 2(f )(f1/2 )d 2{[f1/2 ]/(f1/2 ) }+ 2(f)d2 ( ) + 2(|f|) 2(f). (4.9)

    Heref+ max(f, 0)andf max(f, 0)denote the pos-itive and negative parts of f, respectively, and (|f|)refers to the expectation of the extended function |f| :(, ) |f ( )| R+ under (and simi-larly for (f1/2 )). From (4.9) we deduce that decreas-ing d2 (

    )will imply a decrease of asymptotic vari-ance if the discrepancy betweenand modulated measure[|f|]/(|f|)is not too large, that is, we deal with a tar-get function fwith a regular behavour in the support of( ).

    5 Adaptive importance sampling

    5.1 APF adaptation by minimisation of estimated KLD andCSD over a parametric family

    Assume that there exists a random noise variable , havingdistribution on some measurable space (,B()), and afamily {F}of mappings from tosuch that weare able to simulate R(, ), for , by simulating

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    13/20

    Stat Comput (2008) 18: 461480 473

    and letting= F(,). We denote by the impor-tance weight function associated with R, see(4.1) and set F(,) (,F(,)).

    Assume that (A1) holds and suppose that we have sim-

    ulated, as in the first step of Algorithm1, indices{IN,i }MNi=1and noise variables{N,i }MNi=1

    MN. Now, keeping theseindices and noise variables fixed, we can form an idea of

    how the KLD varies with via the mapping E({F(N,IN,i , N,i )}

    MNi=1). Similarly, the CSD can be studied

    by using CV2 instead ofE. This suggests an algorithm inwhich the particles are reproposed using RN, with

    N=

    arg min E({ F(N,IN,i , N,i )}MNi=1).

    This procedure is summarised in Algorithm 2, and itsmodification for minimisation of the empirical CSD isstraightforward.

    Algorithm 2Adaptive APFRequire: (A1)

    1: Draw {IN,i }MNi=1 M(MN, {N,jN,j/MN

    =1 N, N,}MNj=1),2: simulate {N,i }MNi=1

    MN,

    3: N arg minE({ F(N,IN,i , N,i )}MNi=1),

    4: set

    N,ii FN(N,IN,i , N,i )

    5: update

    N,ii N(N,IN,i , N,i ),

    6: let{

    (

    N,i ,

    N,i )

    }MNi=1

    approximate.

    Remark 2 A slight modification of Algorithm2, loweringthe added computational burden, is to apply the adaptationmechanism only when the estimated KLD (or CSD) is abovea chosen threshold.

    Remark 3 It is possible to establish a law of large numbersas well as a central limit theorem for the algorithm above,similarly to what has been done for the nonadaptive aux-iliary particle filter in Douc et al. (2008) and Olsson et al.(2007).

    More specifically, suppose again that (4.8) holds for simi-lar(A, , ())and that MN/MN [0, ] asN .Then the sample{(N,i ,N,i )}MNi=1produced in Algorithm2is asymptotically normal for a class of functions A in thesense that, for allfA,

    1N

    MNi=1

    N,i [f (N,i ) (f)] DN[0,2 (f )],

    where

    2 (f )

    1(R{, 2[f(f)]2})( )/[L()]2

    + 2(L{, [f (f )]})/[L()]2

    and

    minimises the asymptotic KLD. The complete proofof this result is however omitted for brevity.

    5.2 APF adaptation by cross-entropy (CE) methods

    Here we construct an algorithm which selects a proposal ker-nel from a parametric family in a way that minimises theKLD between the instrumental mixture distribution and thetarget mixture auxN (defined in (3.5)). Thus, recall that we

    are given an initial sample{(N,i , N,i )}MNi=1; we then useIS to approximate the target auxiliary distribution auxN bysampling from the instrumental auxiliary distribution

    auxN,({i} A) N,i N,iMN

    j=1 N,jN,jR(N,i ,A), (5.1)

    which is a straightforward modification of (3.6) whereR is replaced by R, that is, a Markovian kernel from(,B())to(,B())belonging to the parametric family{R(, ) : , }.

    We aim at finding the parameter which realises theminimum of dKL(auxN auxN,) over the parameterspace , where

    dKL(auxN auxN,)

    =MNi=1

    log

    dauxNd auxN,

    (i, )

    auxN (i, d ). (5.2)

    Since the expectation in (5.2) is intractable in most cases,the key idea is to approximate it iteratively using IS frommore and more accurate approximationsthis idea has beensuccessfully used in CE methods; see e.g. Rubinstein andKroese (2004). At iteration , denote by N the cur-rent fit of the parameter. Each iteration of the algorithm issplit into two steps: In the first step we sample, following

    Algorithm1 with MN= MN and R= RN, MN particles{(IN,i , N,i )}

    MNi=1from

    auxN,N

    . Note that the adjustment mul-

    tiplier weights are kept constant during the iterations, a limi-tation which is however not necessary. The second step con-sists of computing the exact solution

    +1

    N argmin

    MNi=1

    N,i

    N

    log

    dauxNdauxN,

    (IN,i , N,i )

    (5.3)

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    14/20

    474 Stat Comput (2008) 18: 461480

    to the problem of minimising the Monte Carlo approxima-tion of (5.2). In the case where the kernels L and R havedensities, denoted byland r, respectively, with respect to acommon reference measure on (,B()), the minimisationprogram (5.3) is equivalent to

    +1N argmax

    MN

    i=1N,i

    N

    log r(IN,i, N,i ). (5.4)

    This algorithm is helpful only in situations where the min-imisation problem (5.3) is sufficiently simple for allowingfor closed-form minimisation; this happens, for example, ifthe objective function is a convex combination of concavefunctions, whose minimum either admits a (simple) closed-form expression or is straightforward to minimise numeri-cally. As mentioned in Sect.2.1, this is generally the casewhen the function r(, )belongs to an exponential familyfor any.

    Since this optimisation problem closely ressembles the

    Monte Carlo EM algorithm, all the implementation detailsof these algorithms can be readily transposed to that con-text; see for example Levine and Casella (2001), Eickhoffet al. (2004), and Levine and Fan (2004). Because we usevery simple models, convergence occurs, as seen in Sect. 6,within only few iterations. When choosing the successiveparticle sample sizes {MN}L=1, we are facing a trade-off be-tween precision of the approximation (5.3) o f (5.2) and com-putational cost. Numerical evidence typically shows thatthese sizes may, as high precision is less crucial here thanwhen generating the final population from aux

    N,LN, be rel-

    atively small compared to the final size MN. Besides, itis possible (and even theoretically recommended) to in-crease the number of particles with the iteration index, since,heuristically, high accuracy is less required at the first steps.In the current implementation in Sect. 6, we will show thatfixing a priori the total number of iterations and using thesame number MN= MN/L of particles at each iterationmay provide satisfactory results in a first run.

    6 Application to state space models

    For an illustration of our findings we return to the frame-

    work of state space models in Sect. 2.2and apply the CE-adaptation-based particle method to filtering in nonlinearstate space models of type

    Xk+1 = m(Xk ) + w(Xk )Wk+1, k 0,Yk= Xk+ vVk , k 0,

    (6.1)

    where the parametervand the R-valued functions(m, w)are known, and {Wk}k=1and {Vk}k=0are mutually indepen-dent sequences of independent standard normal-distributed

    Algorithm 3CE-based adaptive APF

    Require:{(i , i )}MNi=1targets .1: Choose an arbitrary 0N,2: for = 0, . . . , L 1do More intricate criteria are sensible3: draw

    {IN,i }MNi=1 M(MN, {jN,j/

    MN

    n=1nN,n}MNj=1),

    4: simulate {N,i }MNi=1

    MNi=1 RN(IN,i , ),

    5: update

    N,ii N(IN,i ,

    N,i ),

    6: compute, with available closed-form,

    +1N argmin

    MNi=1

    N,i

    N

    log

    dauxNd auxN,

    (IN,i , N,i )

    ,

    7: end for8: run Algorithm1withR

    =R

    L

    N

    .

    variables. In this setting, we wish to approximate the fil-ter distributions{k|k}k0, defined in Sect. 2.2 as the pos-terior distributions of Xk given Y0:k (recall that Y0:k (Y0, . . . , Y k )), which in general lack closed-form expres-sions. For models of this type, the optimal weight and den-sity defined in (2.26) and (2.27), respectively, can be ex-pressed in closed-form:

    k(x ) =N(Yk+1; m(x),2w(x) + 2v ), (6.2)

    where N(x; , ) exp((x )2/(22))/22 and

    rk (x,x) =N(x; (x,Yk+1),(x)), (6.3)

    with

    (x,Yk+1) 2w(x)Yk+1 + 2v m(x)

    2w(x) + 2v,

    2(x) 2w(x)

    2v

    2w(x) + 2v.

    We may also compute the chi-square optimal adjustment

    multiplier weight function 2,Q when the prior kernel isused as proposal: at timek ,

    2,Q

    (x)

    22v22w(x) + 2v

    exp

    Y2k+12v

    + m(x)22w(x) + 2v

    [2Yk+1 m(x)]

    .

    (6.4)

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    15/20

    Stat Comput (2008) 18: 461480 475

    We recall from Proposition2that the optimal adjustmentweight function for the KLD is given by KL,Q(x)=k(x ).

    In these intentionally chosen simple example we willconsider, at each timestepk , adaption over the family

    R(x, ) N((x,Yk+1),(x)) : xR, > 0

    (6.5)

    of proposal kernels. In addition, we keep the adjustmentweights constant, that is (x) = 1.

    The mode of each proposal kernel is centered at the modeof the optimal kernel, and the variance is proportional to theinverse of the Hessian of the optimal kernel at the mode. Letr(x,x

    ) N(x; (x,Yk+1),(x)) denote the density ofR(x, )with respect to the Lebesgue measure. In this set-ting, at every timestep k, a closed-form expression of theKLD between the target and proposal distributions is avail-able:

    dKL(auxN

    auxN,)

    =MNi=1

    N,i N,iMN

    j=1 N,jN,j

    log

    N,i NMN

    j=1 N,jN,j

    + log

    + 12

    1

    2 1

    , (6.6)

    where we set N,i (N,i )andN=

    MNi=1 N,i .

    As we are scaling the optimal standard deviation, it is

    obvious that

    N argmin>0

    dKL(auxN auxN,) = 1, (6.7)

    which may also be inferred by straightforward derivationof (6.6) with respect to . This provides us with a refer-ence to which the parameter values found by our algorithmcan be compared. Note that the instrumental distributionaux

    N,Ndiffers from the target distribution auxN by the ad-

    justment weights used: recall that every instrumental dis-tribution in the family considered has uniform adjustmentweights, (x)

    =1, whereas the overall optimal proposal

    has, since it is equal to the target distribution auxN, the opti-mal weights defined in (6.2). This entails that

    dKL(auxN auxN,N)

    =MNi=1

    N,iN,iMN

    j=1 N,jN,j

    log

    N,i NMN

    j=1 N,jN,j

    ,

    (6.8)

    which is zero if all the optimal weights are equal.

    The implementation of Algorithm 3 is straightforwardas the optimisation program (5.4) has the following closed-form solution:

    +1

    N = MN

    i=1

    NN,i

    NN

    2

    N,I

    NN,i

    NN,i

    N,I

    NN,i

    21/2, (6.9)

    where N,i (N,i , Yk+1) and 2N,i 2(N,i ). This is a

    typical case where the family of proposal kernels allows forefficient minimisation. Richer families sharing this propertymay also be used, but here we are voluntarily willing to keepthis toy example as simple as possible.

    We will study the following special case of the model(6.1):

    m(x) 0, w(x) =

    0 + 1x2.This is the classical Gaussian autoregressive conditionalheteroscedasticity (ARCH) model observed in noise (see

    Bollerslev et al.1994). In this case an experiment was con-ducted where we compared:

    (i) A plain nonadaptive particle filter for which 1,that is, the bootstrap particle filter of Gordon et al.(1993),

    (ii) An auxiliary filter based on the prior kernel and chi-square optimal weights

    2,Q,

    (iii) Adaptive bootstrap filters with uniform adjustmentmultiplier weights using numerical minimisation of theempirical CSD and

    (iv) The empirical KLD (Algorithm2),

    (v) An adaptive bootstrap filter using direct minimisationofdKL(auxN auxN,), see (6.7),(vi) A CE-based adaptive bootstrap filter, and as a refer-

    ence,(vi) An optimal auxiliary particle filter, i.e. a filter using

    the optimal weight and proposal kernel defined in (6.2)and (6.3), respectively.

    This experiment was conducted for the parameter set(0, 1,

    2v ) = (1, 0.99, 10), yielding (since 1

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    16/20

    476 Stat Comput (2008) 18: 461480

    have the same mode as the optimal transition kernel, adjustautomatically the variance of the proposals to that of the op-timal kernel and reach performances comparable to that ofthe optimal auxiliary filter.

    For these observation records, Fig.1displays MSEs es-timates based on 500 filter means. Each filter used 5,000particles. The reference values used for the MSE estimateswere obtained using the optimal auxiliary particle filter withas many as 500,000 particles. This also provided a set fromwhich the initial particles of every filter were drawn, henceallowing for initialisation at the filter distribution a few stepsbefore the outlying observations.

    The CE-based filter of Algorithm3was implemented inits most simple form, with the inside loop using a constantnumber ofMN= N/10 = 500 particles and only L = 5 it-erations: a simple prefatory study of the model indicatedthat the Markov chain{N}l0 stabilised around the valuereached in the very first step. We set 0N= 10 to avoid ini-tialising at the optimal value.

    It can be seen in Fig. 1a that using the CSD optimalweights combined with the prior kernel as proposal does notimprove on the plain bootstrap filter, precisely because theobservations were chosen in such a way that the prior ker-nel was helpless. On the contrary, Figs.1a and1b show thatthe adaptive schemes perform exactly similarly to the opti-mal filter: they all success in finding the optimal scale of thestandard deviation, and using uniform adjustment weightsinstead of optimal ones does not impact much.

    We observe clearly a change of regime, beginning at step110, corresponding to the outlying constant observations.The adaptive filters recover from the changepoint in one

    timestep, whereas the bootstrap filter needs several. Moreimportant is that the adaptive filters (as well as the optimalone) reduce, in the regime of the outlying observations, theMSE of the bootstrap filter by a factor 10.

    Moreover, for a comparison with fixed simulation bud-get, we ran a bootstrap filter with 3N=15,000 particlesThis corresponds to the same simulation budget as the CE-based adaptive scheme with Nparticles, which is, in thissetting, the fastest of our adaptive algorithms. In our setting,the CE-based filter is measured to expand the plain boot-strap runtime by a factor 3, although a basic study of algo-rithmic complexity shows that this factor should be closer toL=1 MN/N= 1.5the difference rises from Matlab ben-efitting from the vectorisation of the plain bootstrap filter,not from the iterative nature of the CE.

    The conclusion drawn from Fig.1b is that for an equalruntime, the adaptive filter outperforms, by a factor 3.5, thebootstrap filter using even three times more particles.

    Acknowledgements The authors are grateful to Prof. Paul Fearn-head for encouragements and useful recommandations, and to theanonymous reviewers for insightful comments and suggestions that im-proved the presentation of the paper.

    Fig. 1 Plot of MSE performances (on log-scale) on the ARCH modelwith (0, 1, 2v )=(1, 0.99, 10). Reference filters common to bothplots are: the bootstrap filter (, continuous line), the optimal filterwith weights and proposal kernel density r (), and a bootstrapfilter using a proposal with parameter Nminimising the current KLD(,continuous line). The MSE values are computed using N= 5,000particlesexcept for the reference bootstrap using 3N particles (,dashed line)and 1,000 runs of each algorithm

    Appendix A: Proofs

    A.1 Proof of Theorems1and2

    We preface the proofs of Theorems1and2with the follow-ing two lemmas.

    Lemma 1 Assume(A2).Then the following identities hold.

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    17/20

    Stat Comput (2008) 18: 461480 477

    (i) dKL() = L{log[( )/L()]}/L(),(ii) d2 (

    = () L()/[L()]2 1.

    ProofWe denote byq(,)the Radon-Nikodym derivativeof the probability measurewith respect to R(wherethe outer productof a measure and a kernel is defined in(3.2)), that is,

    q(,) dL(,)dR(,) ()

    (d)L(, d)

    , (A.1)

    and byp()the Radon-Nikodym derivative of the probabil-ity measure with respect to R:

    p() = ()()

    . (A.2)

    Using the notation above and definition (4.1) of the weightfunction , we have

    (,)()

    L()=

    ()dL(,)dR(,)

    ()

    ()L()= p1()q(,).

    This implies that

    dKL

    =

    (d)R(, d)q(,)

    log

    p1()q(,)

    = L{log[( )/L()]}/L(),

    which establishes assertion (i). Similarly, we may write

    d 2

    =

    (d)R(, d)p1()q2(,) 1

    =

    ()(d)R(, d)[ dL(,)dR(,) ()]21( )

    [L()]2 1

    = () L()/[L()]2 1,

    showing assertion (ii).

    Lemma 2 Assume(A1, A2)and letC {f B() :L(, |f|) C L

    1(, )}.Then,for allf C,asN ,

    1N

    MNi=1

    N,i f (N,IN,i , N,i ) P L(f )/L()

    ProofIt is enough to prove that

    M1N

    MNi=1

    N,i f (N,IN,i , N,i ) P L(f )/( ), (A.3)

    for all f C; indeed, since the function f 1 be-longs to C under (A2), the result of the lemma will fol-low from (A.3) by Slutskys theorem. Define the mea-sure (A) (1A)/( ), with A B(). By apply-ing Theorem 1 in Douc and Moulines (2008) we con-clude that the weighted sample{(N,i , N,i )}MNi=1 is con-sistent for (, {f L1(, ):|f| C}). Moreover, by

    Theorem 2 in the same paper this is also true for the uni-formly weighted sample{(N,IN,i , 1)}MNi=1 (see the proof of

    Theorem 3.1 in Douc et al. 2008 for details). By defini-tion, for f C, R (|f|)()= L(|f|) 0 and consider, for any C >0,the decomposition

    M1N

    MNi=1

    E

    N,i |f (N,IN,i , N,i )|

    1{N,i |f (N,IN,i ,N,i )|}FN

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    18/20

    478 Stat Comput (2008) 18: 461480

    M1NMNi=1

    R

    N,IN,i , |f|1{|f|C}

    + 1{MN

  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    19/20

    Stat Comput (2008) 18: 461480 479

    MNi=1

    EauxN

    dauxNd auxN

    (N,IN,m , N,m )

    IN,m = i

    auxN ({i} )

    P () L()/[L()]2.

    which proves (A.8). This completes the proof of (ii).

    Proof of Theorem 2 Applying directly Lemma 2 for f=log (which belongs to C by assumption) and the limit(A.3) forf 1 yields, by the continuous mapping theorem,

    E({N,i }MNi=1)

    =1NMNi=1

    N,ilogN,i+ log(MN1N )

    P L(log )/L() + log[( )/L()]= L{log[( )/L()]}/L().

    Now, we complete the proof of assertion (i) by applyingLemma1.

    We turn to (ii). Sincebelongs to Cby assumption, weobtain, by applying Lemma2together with (A.3),

    CV2({N,i }MNi=1)

    = (MN1N )1NMNi=1

    2N,i1

    P

    KL( ) ()

    L()/

    [L()

    ]2

    1. (A.9)

    From this (ii) follows via Lemma1.

    A.2 Proof of Proposition2

    Define by q()

    R(, d)q(,)the marginal densityof the measure on (,B()),A B() (A). Wedenote by q (| ) = q(,)/q()the conditional distribu-tion. By the chain rule of the entropy, (the entropy of a pairof random variables is the entropy of one plus the condi-tional entropy of the other), we may split the KLD betweenand as follows,

    dKL(auxN auxN )

    =

    (d)q() log(p1()q())

    +

    (d)R(, d)q(,) log q(|).

    The second term in the RHS of the previous equation doesnot depend on the adjustment multiplier weight. The first

    term is canceled if we setp = q, i.e. if

    ()

    ()=

    R(, d)q(,) = L(,)

    (d)L(,)

    ,

    which establishes assertion (i).Consider now assertion (ii). Note first that

    (d)R(, d)p1()q2(,) 1

    =

    (d )p1()g2( ) 1

    = 2(g)

    (d )g2( )

    p()2(g) 1

    + 2(g) 1, (A.10)

    where

    g2( ) =

    R(, d)q2(,).

    The first term on the RHS of(A.10) is the CSD betweenthe probability distributions associated with the densitiesg/(g) and /() with respect to . The second term doesnot depend on and the optimal value of the adjustmentmultiplier weight is obtained by canceling the first term.This establishes assertion (ii).

    References

    Anderson, B.D.O., Moore, J.B.: Optimal Filtering. Prentice Hall, NewYork (1979)

    Andrieu, C., Davy, M., Doucet, A.: Efficient particle filtering for jump

    Markov systems. Application to time-varying autoregressions.IEEE Trans. Signal Process.51(7), 17621770 (2003)

    Arouna, B.: Robbins-monro algorithms and variance reduction in fi-nance. J. Comput. Finance7(2) (2004)

    Bollerslev, T., Engle, R.F., Nelson, D.: ARCH models. In: Engle, R.F.,McFadden, D. (eds.) Handbook of Econometrics, pp. 29593038.North-Holland, Amsterdam (1994)

    Capp, O., Moulines, E., Rydn, T.: Inference in Hidden Markov Mod-els. Springer, Berlin (2005),http://www.tsi.enst.fr/~cappe/ihmm/

    Capp, O., Douc, R., Guillin, A., Marin, J.M., Robert, C.P.: Adaptiveimportance sampling in general mixture classes. Stat. Comput.(2008, this issue)

    Carpenter, J., Clifford, P., Fearnhead, P.: An improved particle filterfor non-linear problems. IEE Proc. Radar Sonar Navig.146, 27(1999)

    Chen, M., Chen, G.: Geometric ergodicity of nonlinear autoregressivemodels with changing conditional variances. Can. J. Stat.28(3),605613 (2000)

    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley,New York (1991)

    de Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial onthe cross-entropy method. Ann. Oper. Res.134, 1967 (2005)

    Douc, R., Moulines, E.: Limit theorems for weighted samples with ap-plications to sequential Monte Carlo. Ann. Stat.36(2008, to ap-pear),arXiv:math.ST/0507042

    Douc, R., Moulines, E., Olsson, J.: On the auxiliary particle filter.Probab. Math. Stat.28(2) (2008)

    http://www.tsi.enst.fr/~cappe/ihmm/http://arxiv.org/abs/arXiv:math.ST/0507042http://arxiv.org/abs/arXiv:math.ST/0507042http://www.tsi.enst.fr/~cappe/ihmm/
  • 8/11/2019 Adaptive Methods for Sequential Importance Sampling

    20/20

    480 Stat Comput (2008) 18: 461480

    Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte-Carlo sam-pling methods for Bayesian filtering. Stat. Comput. 10, 197208(2000)

    Doucet, A., De Freitas, N., Gordon, N. (eds.): Sequential Monte CarloMethods in Practice. Springer, New York (2001)

    Eickhoff, J.C., Zhu, J., Amemiya, Y.: On the simulation size and theconvergence of the Monte Carlo EM algorithm via likelihood-based distances. Stat. Probab. Lett.67(2), 161171 (2004)

    Evans, M., Swartz, T.: Methods for approximating integrals in Statis-tics with special emphasis on Bayesian integration problems. Stat.Sci.10, 254272 (1995)

    Fearnhead, P.: Computational methods for complex stochastic systems:a review of some alternatives to mcmc. Stat. Comput. 18, 151171(2008)

    Fearnhead, P., Liu, Z.: On-line inference for multiple changepointproblems. J. R. Stat. Soc. Ser. B69(4), 590605 (2007)

    Fort, G., Moulines, E.: Convergence of the Monte Carlo expectationmaximization for curved exponential families. Ann. Stat. 31(4),12201259 (2003)

    Fox, D.: Adapting the sample size in particle filters through KLD-sampling. Int. J. Rob. Res.22(11), 9851004 (2003)

    Geweke, J.: Bayesian inference in econometric models using Monte-Carlo integration. Econometrica57(6), 13171339 (1989)

    Givens, G., Raftery, A.: Local adaptive importance sampling for multi-variate densities with strong nonlinear relationships. J. Am. Stat.Assoc.91(433), 132141 (1996)

    Gordon, N., Salmond, D., Smith, A.F.: Novel approach tononlinear/non-Gaussian Bayesian state estimation. IEE Proc. FRadar Signal Process.140, 107113 (1993)

    Ho, Y.C., Lee, R.C.K.: A Bayesian approach to problems in stochasticestimation and control. IEEE Trans. Autom. Control 9(4), 333339 (1964)

    Hu, X.L., Schon, T.B., Ljung, L.: A basic convergence result forparticle filtering. IEEE Trans. Signal Process. 56(4), 13371348(2008)

    Hrzeler, M., Knsch, H.R.: Monte Carlo approximations for generalstate-space models. J. Comput. Graph Stat.7, 175193 (1998)

    Kailath, T., Sayed, A., Hassibi, B.: Linear Estimation. Prentice Hall,New York (2000)

    Kong, A., Liu, J.S., Wong, W.: Sequential imputation and Bayesian

    missing data problems. J. Am. Stat. Assoc. 89(278288), 590599 (1994)

    Knsch, H.R.: Recursive Monte-Carlo filters: algorithms andtheoretical analysis. Ann. Stat. 33(5), 19832021 (2005),arXiv:math.ST/0602211

    Legland, F., Oudjane, N.: A sequential algorithm that keeps theparticle system alive. Tech. rep., Rapport de recherche 5826,

    INRIA (2006),ftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/RR-5826.pdf

    Levine, R.A., Casella, G.: Implementations of the Monte Carlo EMalgorithm. J. Comput. Graph. Stat.10(3), 422439 (2001)

    Levine, R.A., Fan, J.: An automated (Markov chain) Monte Carlo EMalgorithm. J. Stat. Comput. Simul.74(5), 349359 (2004)

    Liu, J.: Monte Carlo Strategies in Scientific Computing. Springer,Berlin (2004)

    Oh, M.S., Berger, J.O.: Adaptive importance sampling in Monte Carlo

    integration. J. Stat. Comput. Simul.41(34), 143168 (1992)Oh, M.S., Berger, J.O.: Integration of multimodal functions by MonteCarlo importance sampling. J. Am. Stat. Assoc.88(422), 450456(1993)

    Olsson, J., Moulines, E., Douc, R.: Improving the performance of thetwo-stage sampling particle filter: a statistical perspective. In: Pro-ceedings of the IEEE/SP 14th Workshop on Statistical SignalProcessing, Madison, USA, pp. 284288 (2007)

    Pitt, M.K., Shephard, N.: Filtering via simulation: Auxiliary particlefilters. J. Am. Stat. Assoc.94(446), 590599 (1999)

    Ristic, B., Arulampalam, M., Gordon, A.: Beyond Kalman Filters: Par-ticle Filters for Target Tracking. Artech House, Norwood (2004)

    Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method. Springer,Berlin (2004)

    Shen, C., van den Hengel, A., Dick, A., Brooks, M.J.: Enhanced im-

    portance sampling: unscented auxiliary particle filtering for visualtracking. In: AI 2004: Advances in Artificial Intelligence. LectureNotes in Comput. Sci., vol. 3339, pp. 180191. Springer, Berlin(2004)

    Shephard, N., Pitt, M.: Likelihood analysis of non-Gaussian measure-ment time series. Biometrika84(3), 653667 (1997), erratum in91, 249250 (2004)

    Soto, A.: Self adaptive particle filter. In: Kaelbling, L.P., Saffiotti, A.(eds.) Proceedings of the 19th International Joint Conferences onArtificial Intelligence (IJCAI), Edinburgh, Scotland, pp. 13981406 (2005),http://www.ijcai.org/proceedings05.php

    Stephens, M., Donnelly, P.: Inference in molecular population genetics.J. R. Stat. Soc. Ser. B Stat. Methodol.62(4), 605655 (2000), withdiscussion and a reply by the authors

    Straka, O., Simandl, M.: Particle filter adaptation based on efficient

    samplesize. In: Proceedings of the14th IFACSymposium on Sys-tem Identification, Newcastle, Australia, pp. 991996 (2006)

    Van der Vaart, A.W.: Asymptotic Statistics. Cambridge UniversityPress, Cambridge (1998)

    Wei, G.C.G., Tanner, M.A.: A Monte-Carlo implementation of theEM algorithm and the poor mans Data Augmentation algorithms.J. Am. Stat. Assoc.85, 699704 (1991)

    http://arxiv.org/abs/arXiv:math.ST/0602211ftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/RR-5826.pdfftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/RR-5826.pdfhttp://www.ijcai.org/proceedings05.phphttp://www.ijcai.org/proceedings05.phpftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/RR-5826.pdfftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/RR-5826.pdfhttp://arxiv.org/abs/arXiv:math.ST/0602211