bayesian filtering

7/29/2019 Bayesian Filtering

1/69

ANUSCRIPT 1

Bayesian Filtering: From Kalman Filters toParticle Filters, and Beyond

ZHE CHEN

Abstract In this self-contained survey/review paper, we system-ically investigate the roots of Bayesian filtering as well as its richaves in the literature. Stochastic filtering theory is briefly reviewedith emphasis on nonlinear and non-Gaussian filtering. Followinge Bayesian statistics, different Bayesian filtering techniques are de-loped given different scenarios. Under linear quadratic Gaussianrcumstance, the celebrated Kalman filter can be derived within theayesian framework. Optimal/suboptimal nonlinear filtering tech-ques are extensively investigated. In particular, we focus our at-ntion on the Bayesian filtering approach based on sequential Montearlo sampling, the so-called particle filters. Many variants of the

article filter as well as their features (strengths and weaknesses) arescussed. Related theoretical and practical issues are addressed in

etail. In addition, some other (new) directions on Bayesian filteringre also explored.

Index Terms Stochastic filtering, Bayesian filtering,ayesian inference, particle filter, sequential Monte Carlo,quential state estimation, Monte Carlo methods.

The probability of any event is the ratio between thevalue at which an expectation depending on the happeningof the event ought to be computed, and the value of thething expected upon its happening.

Thomas Bayes (1702-1761), [29]

Statistics is the art of never having to say yourewrong. Variance is what any two statisticians are at.

C. J. Bradfield

Contents

Introduction 2

I-A Stochastic Filtering Theory . . . . . . . . . . . . . . . 2

I-B Bayesian Theory and Bayesian Filtering . . . . . . . . 2

I-C Monte Carlo Methods and Monte Carlo Filtering . . . 2

I-D Outline of Paper . . . . . . . . . . . . . . . . . . . . . 3

Mathematical Preliminaries and Problem Formula-on 4

II-A Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 4

II-B Notations . . . . . . . . . . . . . . . . . . . . . . . . . 4

II-C Stochastic Filtering Problem . . . . . . . . . . . . . . 4

II-D Nonlinear Stochastic Filtering Is an Ill-posed InverseProblem . . . . . . . . . . . . . . . . . . . . . . . . . . 5

I I-D.1 Inverse Problem . . . . . . . . . . . . . . . . . 5

II-D.2 Differential Operator and Integral Equation . . 6

II-D.3 Relations to Other Problems . . . . . . . . . . 7

II-E Stochastic Differential Equations and Filtering . . . . 7

I Bayesian Statistics and Bayesian Estimation 8

III-ABayesian Statistics . . . . . . . . . . . . . . . . . . . . 8

III-BRecursive Bayesian Estimation . . . . . . . . . . . . . 9

The work is supported by the Natural Sciences and Engineeringesearch Council of Canada. Z. Chen was also partially supportedy Clifton W. Sherman Scholarship.The author is with the Communications Research Laboratory,cMaster University, Hamilton, Ontario, Canada L8S 4K1, e-ail: [email protected], Tel: (905)525-9140 x27282,

ax:(905)521-2922.

IV Bayesian Optimal Filtering 9IV-AOptimal Filtering . . . . . . . . . . . . . . . . . . . . . 10

IV-BKalman Filtering . . . . . . . . . . . . . . . . . . . . . 11

IV-COptimum Nonlinear Filtering . . . . . . . . . . . . . . 13

IV-C.1 Finite-dimensional Filters . . . . . . . . . . . . 13

V Numerical Approximation Methods 14

V-A Gaussian/Laplace Approximation . . . . . . . . . . . . 14

V-B Iterative Quadrature . . . . . . . . . . . . . . . . . . . 14

V-C Mulitgrid Method and Point-Mass Approximation . . 14

V-D Moment Approximation . . . . . . . . . . . . . . . . . 15

V-E Gaussian Sum Approximation . . . . . . . . . . . . . . 16

V-F Deterministic Sampling Approximation . . . . . . . . . 16

V-G Monte Carlo Sampling Approximation . . . . . . . . . 17V-G.1 Importance Sampling . . . . . . . . . . . . . . 18

V-G.2 Rejection Sampling . . . . . . . . . . . . . . . . 19

V-G.3 Sequential Importance Sampling . . . . . . . . 19

V-G.4 Sampling-Importance Resampling . . . . . . . 20

V-G.5 Stratified Sampling . . . . . . . . . . . . . . . . 21

V-G.6 Markov Chain Monte Carlo . . . . . . . . . . . 22

V-G.7 Hybrid Monte Carlo . . . . . . . . . . . . . . . 23

V-G.8 Quasi-Monte Carlo . . . . . . . . . . . . . . . . 24

VI Sequential Monte Carlo Estimation: Particle Filters 25

VI-A Sequential Importance Sampling (SIS) Filter . . . . . 26

VI-BBootstrap/SIR filter . . . . . . . . . . . . . . . . . . . 26

VI-CImproved SIS/SIR Filters . . . . . . . . . . . . . . . . 27

VI-DAuxiliary Particle Filter . . . . . . . . . . . . . . . . . 28

VI-E Rejection Particle Filter . . . . . . . . . . . . . . . . . 29

VI-F Rao-Blackwellization . . . . . . . . . . . . . . . . . . . 30

VI-GKernel Smoothing and Regularization . . . . . . . . . 31

VI-HData Augmentation . . . . . . . . . . . . . . . . . . . 32

VI-H.1 Data Augmentation is an Iterative KernelSmoothing Process . . . . . . . . . . . . . . . . 32

VI-H.2 Data Augmentation as a Bayesian SamplingMethod . . . . . . . . . . . . . . . . . . . . . . 33

VI-I MCMC Particle Filter . . . . . . . . . . . . . . . . . . 33

VI-J Mixture Kalman Filters . . . . . . . . . . . . . . . . . 34

VI-KMixture Particle Filters . . . . . . . . . . . . . . . . . 34

VI-L Other Monte Carlo Filters . . . . . . . . . . . . . . . . 35

VI-MChoices of Proposal Distribution . . . . . . . . . . . . 35VI-M.1Prior Distribution . . . . . . . . . . . . . . . . 35

VI-M.2Annealed Prior Distribution . . . . . . . . . . . 36

VI-M.3Likelihood . . . . . . . . . . . . . . . . . . . . . 36

VI-M.4Bridging Density and Partitioned Sampling . . 37

VI-M.5Gradient-Based Transition Density . . . . . . . 38

VI-M.6EKF as Proposal Distribution . . . . . . . . . . 38

VI-M.7Unscented Particle Filter . . . . . . . . . . . . 38

VI-NBayesian Smoothing . . . . . . . . . . . . . . . . . . . 38

VI-N.1 Fixed-point smoothing . . . . . . . . . . . . . . 38

VI-N.2 Fixed-lag smoothing . . . . . . . . . . . . . . . 39

VI-N.3 Fixed-interval smoothing . . . . . . . . . . . . 39

VI-OLikelihood Estimate . . . . . . . . . . . . . . . . . . . 40

VI-PTheoretical and Practical Issues . . . . . . . . . . . . . 40VI-P.1 Convergence and Asymptotic Results . . . . . 40

VI-P.2 Bias-Variance . . . . . . . . . . . . . . . . . . . 41

VI-P.3 Robustness . . . . . . . . . . . . . . . . . . . . 43

VI-P.4 Adaptive Procedure . . . . . . . . . . . . . . . 46


2/69

ANUSCRIPT 2

VI-P.5 Evaluation and Implementation . . . . . . . . . 46

IIOther Forms of Bayesian Filtering and Inference 47

VII-AConjugate Analysis Approach . . . . . . . . . . . . . . 47

VII-BDifferential Geometrical Approach . . . . . . . . . . . 47

VII-CInteracting Multiple Models . . . . . . . . . . . . . . . 48

VII-DBayesian Kernel Approaches . . . . . . . . . . . . . . . 48

VII-EDynamic Bayesian Networks . . . . . . . . . . . . . . . 48

IIISelected Applications 49

VIII-ATarget Tracking . . . . . . . . . . . . . . . . . . . . . . 49

VIII-BComputer Vision and Robotics . . . . . . . . . . . . . 49

VIII-CDigital Communications . . . . . . . . . . . . . . . . . 49

VIII-DSpeech Enhancement and Speech Recognition . . . . . 50

VIII-EMachine Learning . . . . . . . . . . . . . . . . . . . . . 50

VIII-FOthers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

VIII-GAn Illustrative Example: Robot-Arm Problem . . . . . 50

X Discussion and Critique 51

IX-AParameter Estimation . . . . . . . . . . . . . . . . . . 51

IX-BJoint Estimation and Dual Estimation . . . . . . . . . 51

IX-C Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

IX-DLocalization Methods . . . . . . . . . . . . . . . . . . 52

IX-EDimensionality Reduction and Projection . . . . . . . 53IX-F Unanswered Questions . . . . . . . . . . . . . . . . . . 53

Summary and Concluding Remarks 55

I. Introduction

THE contents of this paper contain three major scien-tific areas: stochastic filtering theory, Bayesian theory,nd Monte Carlo methods. All of them are closely discussedround the subject of our interest: Bayesian filtering. Inhe course of explaining this long story, some relevant the-

ries are briefly reviewed for the purpose of providing theader a complete picture. Mathematical preliminaries andackground materials are also provided in detail for theelf-containing purpose.

. Stochastic Filtering Theory

Stochastic filtering theory was first established in thearly 1940s due to the pioneering work by Norbert Wiener87], [488] and Andrey N. Kolmogorov [264], [265], and it

ulminated in 1960 for the publication of classic Kalmanlter (KF) [250] (and subsequent Kalman-Bucy filter in

961 [249]),1

though many credits should be also due toome earlier work by Bode and Shannon [46], Zadeh andagazzini [502], [503], Swerling [434], Levinson [297], and

thers. Without any exaggeration, it seems fair to sayhat the Kalman filter (and its numerous variants) haveominated the adaptive filter theory for decades in signalrocessing and control areas. Nowadays, Kalman filtersave been applied in the various engineering and scientificreas, including communications, machine learning, neu-oscience, economics, finance, political science, and manythers. Bearing in mind that Kalman filter is limited by itsssumptions, numerous nonlinear filtering methods along

1Another important event in 1960 is the publication of the cele-ated least-mean-squares (LMS) algorithm [485]. However, the LMSter is not discussed in this paper, the reader can refer to [486], [205],07], [247] for more information.

its line have been proposed and developed to overcome itslimitation.

B. Bayesian Theory and Bayesian Filtering

Bayesian theory2 was originally discovered by the Britishresearcher Thomas Bayes in a posthumous publication in1763 [29]. The well-known Bayes theorem describes thefundamental probability law governing the process of log-

ical inference. However, Bayesian theory has not gainedits deserved attention in the early days until its modernform was rediscovered by the French mathematician Pierre-Simon de Laplace in Theorie analytique des probailites.3

Bayesian inference [38], [388], [375], devoted to applyingBayesian statistics to statistical inference, has become oneof the important branches in statistics, and has been ap-plied successfully in statistical decision, detection and es-timation, pattern recognition, and machine learning. Inparticular, the November 19 issue of 1999 Science mag-azine has given the Bayesian research boom a four-page

special attention [320]. In many scenarios, the solutionsgained through Bayesian inference are viewed as optimal.Not surprisingly, Bayesian theory was also studied in thefiltering literature. One of the first exploration of itera-tive Bayesian estimation is found in Ho and Lee paper[212], in which they specified the principle and procedureof Bayesian filtering. Sprangins [426] discussed the itera-tive application of Bayes rule to sequential parameter esti-mation and called it as Bayesian learning. Lin and Yau[301] and Chien an Fu [92] discussed Bayesian approachto optimization of adaptive systems. Bucy [62] and Bucyand Senne [63] also explored the point-mass approximation

method in the Bayesian filtering framework.

C. Monte Carlo Methods and Monte Carlo Filtering

The early idea of Monte Carlo4 can be traced back tothe problem of Buffons needle when Buffon attemptedin 1777 to estimate (see e.g., [419]). But the modernformulation of Monte Carlo methods started from 1940sin physics [330], [329], [393] and later in 1950s to statis-tics [198]. During the World War II, John von Neumann,Stanislaw Ulam, Niick Metropolis, and others initializedthe Monte Carlo method in Los Alamos Laboratory. von

Neumann also used Monte Carlo method to calculate theelements of an inverse matrix, in which they redefined theRussian roulette and splitting methods [472]. In recentdecades, Monte Carlo techniques have been rediscovered in-dependently in statistics, physics, and engineering. Manynew Monte Carlo methodologies (e.g. Bayesian bootstrap,hybrid Monte Carlo, quasi Monte Carlo) have been reju-venated and developed. Roughly speaking, Monte Carlo

2A generalized Bayesian theory is the so-called Quasi-Bayesian the-ory (e.g. [100]) that is built on the convex set of probability distribu-tions and a relaxed set of aximoms about preferences, which we dontdiscuss in this paper.3An interesting history of Thomas Bayes and its famous essay is

found in [110].4The method is named after the city in the Monaco principality,

because of a roulette, a simple random number generator. The namewas first suggested by Stanislaw Ulam.


3/69

ANUSCRIPT 3

echnique is a kind of stochastic sampling approach aim-ng to tackle the complex systems which are analyticallyntractable. The power of Monte Carlo methods is thathey can attack the difficult numerical integration prob-ms. In recent years, sequential Monte Carlo approachesave attracted more and more attention to the researchersom different areas, with many successful applications inatistics (see e.g. the March special issue of 2001 Annals

f the Institute of Statistical Mathematics), sig-al processing (see e.g., the February special issue of 2002EEE Transactions on Signal Processing), machinearning, econometrics, automatic control, tracking, com-

munications, biology, and many others (e.g., see [141] andhe references therein). One of the attractive merits of se-uential Monte Carlo approaches lies in the fact that theylow on-line estimation by combining the powerful Montearlo sampling methods with Bayesian inference, at an ex-ense of reasonable computational cost. In particular, theequential Monte Carlo approach has been used in parame-er estimation and state estimation, for the latter of which

is sometimes called particle filter.5 The basic idea ofarticle filter is to use a number of independent randomariables called particles,6 sampled directly from the statepace, to represent the posterior probability, and updatehe posterior by involving the new observations; the par-cle system is properly located, weighted, and propagated

ecursively according to the Bayesian rule. In retrospect,he earliest idea of Monte Carlo method used in statisti-al inference is found in [200], [201], and later in [5], [6],06], [433], [258], but the formal establishment of particle

lter seems fair to be due to Gordon, Salmond and Smith

93], who introduced certain novel resampling techniqueo the formulation. Almost in the meantime, a numberf statisticians also independently rediscovered and devel-ped the sampling-importance-resampling (SIR) idea [414],66], [303], which was originally proposed by Rubin [395],97] in a non-dynamic framework.7 The rediscovery and

enaissance of particle filters in the mid-1990s (e.g. [259],22], [229], [304], [307], [143], [40]) after a long dominant

eriod, partially thanks to the ever increasing computingower. Recently, a lot of work has been done to improvehe performance of particle filters [69], [189], [428], [345],56], [458], [357]. Also, many doctoral theses were devoted

o Monte Carlo filtering and inference from different per-pectives [191], [142], [162], [118], [221], [228], [35], [97],65], [467], [86].

It is noted that particle filter is not the only leaf in theayesian filtering tree, in the sense that Bayesian filtering

an be also tackled with other techniques, such as differen-

5Many other terminologies also exist in the literature, e.g., SIS fil-r, SIR filter, bootstrap filter, sequential imputation, or CONDEN-

ATION algorithm (see [224] for many others), though they are ad-essed differently in different areas. In this paper, we treat them asfferent variants within the generic Monte Carlo filter family. Montearlo filters are not all sequential Monte Carlo estimation.6The particle filter is called normal if it produces i.i.d. samples;metimes it is deliberately to introduce negative correlations amonge particles for the sake of variance reduction.7The earliest idea of multiple imputation due to Rubin was pub-hed in 1978 [394].

tial geometry approach, variational method, or conjugatemethod. Some potential future directions, will be consid-ering combining these methods with Monte Carlo samplingtechniques, as we will discuss in the paper. The attentionof this paper, however, is still on the Monte Carlo methodsand particularly sequential Monte Carlo estimation.

D. Outline of Paper

In this paper, we present a comprehensive review ofstochastic filtering theory from Bayesian perspective. [Ithappens to be almost three decades after the 1974 publica-tion of Prof. Thomas Kailaths illuminating review paperA view of three decades of linear filtering theory [244],we take this opportunity to dedicate this paper to him whohas greatly contributed to the literature in stochastic filter-ing theory.] With the tool of Bayesian statistics, it turnsout that the celebrated Kalman filter is a special case ofBayesian filtering under the LQG (linear, quadratic, Gaus-sian) circumstance, a fact that was first observed by Ho

and Lee [212]; particle filters are also essentially rootedin Bayesian statistics, in the spirit of recursive Bayesianestimation. To our interest, the attention will be given tothe nonlinear, non-Gaussianand non-stationary situationswhere we mostly encounter in the real world. Generally fornonlinear filtering, no exact solution can be obtained, or thesolution is infinite-dimensional,8 hence various numericalapproximation methods come in to address the intractabil-ity. In particular, we focus our attention on sequentialMonte Carlo method which allows on-line estimation in aBayesian perspective. The historic root and remarks ofMonte Carlo filtering are traced. Other Bayesian filtering

approaches other than Monte Carlo framework are also re-viewed. Besides, we extend our discussion from Bayesianfiltering to Bayesian inference, in the latter of which thewell-known hidden Markov model (HMM) (a.k.a. HMMfilter), dynamic Bayesian networks (DBN) and Bayesiankernel machines are also briefly discussed.

Nowadays Bayesian filtering has become such a broadtopic involving many scientific areas that a comprehen-sive survey and detailed treatment seems crucial to caterthe ever growing demands of understanding this importantfield for many novices, though it is noticed by the author

that in the literature there exist a number of excellent tuto-rial papers on particle filters and Monte Carlo filters [143],[144], [19], [438], [443], as well as relevant edited volumes[141] and books [185], [173], [306], [82]. Unfortunately, asobserved in our comprehensive bibliographies, a lot of pa-pers were written by statisticians or physicists with somespecial terminologies, which might be unfamiliar to manyengineers. Besides, the papers were written with differentnomenclatures for different purposes (e.g. the convergenceand asymptotic results are rarely cared in engineering butare important for the statisticians). The author, thus, feltobligated to write a tutorial paper on this emerging andpromising area for the readership of engineers, and to in-troduce the reader many techniques developed in statistics

8Or the sufficient statistics is infinite-dimensional.


4/69

ANUSCRIPT 4

nd physics. For this purpose again, for a variety of particlelter algorithms, the basic ideas instead of mathematicalerivations are emphasized. The further details and exper-

mental results are indicated in the references. Due to theual tutorial/review nature of current paper, only few sim-e examples and simulation are presented to illustrate thesential ideas, no comparative results are available at thisage (see other paper [88]); however, it doesnt prevent us

resenting the new thoughts. Moreover, many graphicalnd tabular illustrations are presented. Since it is also aurvey paper, extensive bibliographies are included in theeferences. But there is no claim that the bibliographiesre complete, which is due to the our knowledge limitations well as the space allowance.

The rest of this paper is organized as follows: In Section, some basic mathematical preliminaries of stochastic fil-

ering theory are given; the stochastic filtering problem isso mathematically formulated. Section III presents thesential Bayesian theory, particularly Bayesian statistics

nd Bayesian inference. In Section IV, the Bayesian fil-ring theory is systematically investigated. Following themplest LQG case, the celebrated Kalman filter is brieflyerived, followed by the discussion of optimal nonlinearltering. Section V discusses many popular numerical ap-roximation techniques, with special emphasis on Montearlo sampling methods, which result in various forms ofarticle filters in Section VI. In Section VII, some otherew Bayesian filtering approaches other than Monte Carloampling are also reviewed. Section VIII presents some se-cted applications and one illustrative example of particlelters. We give some discussions and critiques in Section

X and conclude the paper in Section X.

II. Mathematical Preliminaries and ProblemFormulation

. Preliminaries

Definition 1: Let S be a set and Fbe a family of subsetsf S. F is a -algebra if (i) F; (ii) A F impliesc F; (iii) A1, A2, F implies i=1Ai F.A -algebra is closed under complement and union of

ountably infinitely many sets.

Definition 2: A probability space is defined by the el-ments {, F, P} where F is a -algebra of and P iscomplete, -additive probability measure on all F. In

ther words, P is a set function whose arguments are ran-om events (element of F) such that axioms of probabilityold.

Definition 3: Let p(x) = dP(x)d denote Radon-Nikodym

ensity of probability distribution P(x) w.r.t. a measure .When x X is discrete and is a counting measure, p(x)

a probability mass function (pmf); when x is continuous

nd is a Lebesgue measure, p(x) is a probability densityunction (pdf).

Intuitively, the true distribution P(x) can be replacedy the empirical distribution given the simulated samples

0

1

x

P(x)

Fig. 1. Empirical probability distribution (density) function con-

structed from the discrete observations {x(i)}.

(see Fig. 1 for illustration)

P(x) =1

Np

Npi=1

(x x(i))

where () is a Radon-Nikodym density w.r.t. of thepoint-mass distribution concentrated at the point x. When

x X is discrete, (x x(i)) is 1 for x = x(i) and 0elsewhere. When x X is continuous, (x x(i)) is aDirac-delta function, (x x(i)) = 0 for all x(i) = x, andX dP(x) =

X p(x)dx = 1.

B. Notations

Throughout this paper, the bold font is referred to vec-tor or matrix; the subscript symbol t (t R+) is referredto the index in a continuous-time domain; and n (n N)is referred to the index in a discrete-time domain. p(x) isreferred to the pdf in a Lebesque measure or the pmf in

a counting measure. E[] and Var[] (Cov[]) are expecta-tion and variance (covariance) operators, respectively. Un-less specified elsewhere, the expectations are taken w.r.t.the true pdf. Notations x0:n and y0:n

9 are referred tothe state and observation sets with elements collected fromtime step 0 up to n. Gaussian (normal) distribution is de-noted by N(, ). xn represents the true state in timestep n, whereas xn (or xn|n) and xn|n1 represent the fil-tered state and predicted state of xn, respectively. f and gare used to represent vector-valued state function and mea-surement function, respectively. f is denoted as a generic(vector or scalar valued) nonlinear function. Additionalnomenclatures will be given wherever confusion is neces-sary to clarify.

For the readers convenience, a complete list of notationsused in this paper is summarized in the Appendix G.

C. Stochastic Filtering Problem

Before we run into the mathematical formulation ofstochastic filtering problem, it is necessary to clarify somebasic concepts:

Filtering is an operation that involves the extraction of

information about a quantity of interest at time t byusing data measured up to and including t.

9Sometimes it is also denoted by y1:n, which differs in the assumingorder of state and measurement equations.


5/69

ANUSCRIPT 5

Prediction is an a prioriform of estimation. Its aim is toderive information about what the quantity of interestwill be like at some time t + in the future ( >0) by using data measured up to and including timet. Unless specified otherwise, prediction is referred toone-step ahead prediction in this paper.

Smoothing is an a posteriori form of estimation in thatdata measured after the time of interest are used for

the estimation. Specifically, the smoothed estimate attime t is obtained by using data measured over theinterval [0, t], where t < t.

Now, let us consider the following generic stochastic fil-ering problem in a dynamic state-space form [238], [422]:

xt = f(t, xt, ut, dt), (1a)

yt = g(t, xt, ut, vt), (1b)

here equations (1a) and (1b) are called state equation andmeasurement equation, respectively; xt represents the state

ector, yt is the measurement vector, ut represents the sys-em input vector (as driving force) in a controlled environ-ment; f : RNx RNx and g : RNx RNy are two vector-

alued functions, which are potentially time-varying; dtnd vt represent the process (dynamical) noise and mea-urement noise respectively, with appropriate dimensions.he above formulation is discussed in the continuous-timeomain, in practice however, we are more concerned abouthe discrete-time filtering.10 In this context, the followingractical filtering problem is concerned:11

xn+1 = f(xn, dn), (2a)

yn = g(xn, vn), (2b)

here dn and vn can be viewed as white noise randomequences with unknown statistics in the discrete-time do-

main. The state equation (2a) characterizes the state tran-tion probability p(xn+1|xn), whereas the measurement

quation (2b) describes the probability p(yn|xn) which isurther related to the measurement noise model.

The equations (2a)(2b) reduce to the following specialase where a linear Gaussian dynamic system is consid-ed:12

xn+1 = Fn+1,nxn + dn, (3a)

yn = Gnxn + vn, (3b)

r which the analytic filtering solution is given by thealman filter [250], [253], in which the sufficient statistics13

10The continuous-time dynamic system can be always convertedto a discrete-time system by sampling the outputs and using zero-der holds on the inputs. Hence the derivative will be replaced bye difference, the operator will become a matrix.11For discussion simplicity, no driving-force in the dynamic system

which is often referred to the stochastic control problem) is consid-ed in this paper. However, the extension to the driven system israightforward.12An excellent and illuminating review of linear filtering theory isund in [244] (see also [385], [435], [61]); for a complete treatment of

near estimation theory, see the classic textbook [247].13Sufficient statistics is referred to a collection of quantities which

niquely determine a probability density in its entirety.

xt-1 xt+1

u t-1 ut ut+1

yt-1 yt yt+1

xt

input

state

measurement

ft-1 ( )

g t-1 ( ) g t( ) g t+1 ( )

ft( )

Fig. 2. A graphical model of generic state-space model.

of mean and state-error correlation matrix are calculatedand propagated. In equations (3a) and (3b), Fn+1,n, Gnare called transition matrix and measurement matrix, re-spectively.

Described as a generic state-space model, the stochasticfiltering problem can be illustrated by a graphical model(Fig. 2). Given initial density p(x0), transition density

p(xn|xn1), and likelihood p(yn|xn), the objective of thefiltering is to estimate the optimal current state at time ngiven the observations up to time n, which is in essenceamount to estimating the posterior density p(xn|y0:n) or

p(x0:n|y0:n). Although the posterior density provides acomplete solution of the stochastic filtering problem, theproblem still remains intractable since the density is a func-tion rather than a finite-dimensional point estimate. Weshould also keep in mind that most of physical systems arenotfinite dimensional, thus the infinite-dimensional systemcan only be modeled approximately by a finite-dimensional

filter, in other words, the filter can only be suboptimalin this sense. Nevertheless, in the context of nonlinearfiltering, it is still possible to formulate the exact finite-dimensional filtering solution, as we will discuss in SectionIV.

In Table I, a brief and incomplete development history ofstochastic filtering theory (from linear to nonlinear, Gaus-sian to non-Gaussian, stationary to non-stationary) is sum-marized. Some detailed reviews are referred to [244], [423],[247], [205].

D. Nonlinear Stochastic Filtering Is an Ill-posed InverseProblem

D.1 Inverse Problem

Stochastic filtering is an inverse problem: Given collectedyn at discrete time steps (hence y0:n), provided f and g areknown, one needs to find the optimal or suboptimal xn. Inanother perspective, this problem can be interpreted as aninverse mapping learning problem: Find the inputs sequen-tially with a (composite) mapping function which yields theoutput data. In contrast to the forward learning (given in-puts find outputs) which is a many-to-one mapping prob-

lem, the inversion learning problem is one-to-many, in asense that the mapping from output to input space is gen-erally non-unique.

A problem is said to be well-posed if it satisfies three con-


6/69

ANUSCRIPT 6

TABLE I

A Development History of Stochastic Filtering Theory.

author(s) (year) method solution comment

Kolmogorov (1941) innovations exact linear, stationary

Wiener (1942) spectral factorization exact linear, stationary, infinite memory

Levinson (1947) lattice filter approximate linear, stationary, finite memory

Bode & Shannon (1950) innovations, whitening exact linear, stationary,

Zadeh & Ragazzini (1950) innovations, whitening exact linear, non-stationary

Kalman (1960) orthogonal projection exact LQG, non-stationary, discrete

Kalman & Bucy (1961) recursive Riccati equation exact LQG, non-stationary, continuous

Stratonovich (1960) conditional Markov process exact nonlinear, non-stationary

Kushner (1967) PDE exact nonlinear, non-stationary

Zakai (1969) PDE exact nonlinear, non-stationary

Handschin & Mayne (1969) Monte Carlo approximate nonlinear, non-Gaussian, non-stationary

Bucy & Senne (1971) point-mass, Bayes approximate nonlinear, non-Gaussian, non-stationary

Kailath (1971) innovations exact linear, non-Gaussian, non-stationary

Benes (1981) Benes exact solution of Zakai eqn. nonlinear, finite-dimensional

Daum (1 986) Da um, vi rt ual meas ure ment e xact s ol uti on of FP K eq n. nonl ine ar, f ini te- dim ensi onal

Gordon, Salmond, & Smith (1993) bootstrap, sequential Monte Carlo approximate nonlinear, non-Gaussian, non-stationary

J uli er & U hl mann (199 7) unsc ent ed t rans form ati on approxi mat e nonl ine ar, ( no n)-G aus si an, der ivati ve- fre e

tions: existence, uniqueness and stability, otherwise it isaid to be ill posed [87]. In this context, stochastic filteringroblem is ill-posed in the following sense: (i) The ubiqui-ous presence of the unknown noise corrupts the state and

measurement equations, given limited noisy observations,he solution is non-unique; (ii) Supposing the state equa-on is a diffeomorphism(i.e. differentiable and regular),14

he measurement function is possibly a many-to-one map-

ing function (e.g. g() = 2 or g() = sin(), see also theustrative example in Section VIII-G), which also violates

he uniqueness condition; (iii) The filtering problem is pere a conditional posterior distribution (density) estimationroblem, which is known to be stochastically ill posed es-ecially in high-dimensional space [463], let alone on-linerocessing [412].

.2 Differential Operator and Integral Equation

In what follows, we present a rigorous analysis of stochas-c filtering problem in the continuous-time domain. To

mplify the analysis, we first consider the simple irregularochastic differential equation (SDE):

dxtdt

= f(t, xt) + dt, t T (4)

here xt is a second-order stochastic process, t =t

0dsds

a Wiener process (Brownian motion) and dt can be re-arded as a white noise. f : TL2(, F, P) L2(, F, P)

a mapping to a (Lebesque square-integrable) Hilbertpace L2(, F, P) with finite second-order moments. Theolution of (4) is given by the stochastic integral

xt = x0 +t

0

f(s, xs)ds +t

0

ds, (5)

14Diffeomorphism is referred to a smooth mapping with a smoothverse, one-to-one mapping.

where the second integral is Ito stochastic integral (namedafter Japanese mathematician Kiyosi Ito [233]).15

Mathematically, the ill-posed nature of stochastic filter-ing problem can be understood from the operator theory.

Definition 4: [274], [87] Let A : Y X be an operatorfrom a normed space Y to X. The equation AY = X is saidto be well posed ifA is bijective and the inverse operatorA1 : X

Y is continuous. Otherwise the equation is

called ill posed.

Definition 5: [418] Suppose H is a Hilbert space and letA = A() be a stochastic operator mapping H inH. Let X = X() be a generalized random variable (orfunction) in H, then

A()Y = X() (6)

is a generalized stochastic operator equation for the ele-ment Y H.

Since is an element of a measurable space (, F) onwhich a complete probability measure P is defined, stochas-tic operator equation is a family of equations. The familyof equations has a unique member when P is a Dirac mea-sure. Suppose Y is a smooth functional with continuousfirst n derivatives, then (6) can be written as

A()Y() =

Nk=0

ak(t, )dkY

dtk= X(), (7)

which can be represented in the form of stochastic integralequations of Fredholm type or Voltera type [418], with an

15The Ito stochastic integral is defined as tt0 (t)d(t) =limn

nj=1 (tj1)j

. The Ito calculus satisfies d2(t) = dt,

d(t)dt = 0, dtN+1 = dN+2(t) = 0 (N > 1). See [387], [360] for adetailed background about Ito calculus and Ito SDE.


7/69

ANUSCRIPT 7

ppropriately defined kernel K:

Y(t, ) = X(t, ) +

K(t ,, )Y(, )d, (8)

hich takes a similar form as the continuous-time Wiener-opf equation (see e.g. [247]) when K is translation invari-

nt.

Definition 6: [418] Any mapping Y() :

H which

atisfies A()Y() = X() for every , is said to be aide-sense solution of (6).

The wide-sense solution is a stochastic solution if it ismeasurable w.r.t. P and Pr{ : A()Y() = X()} = 1.

he existenceand uniqueness conditions of the solution tohe stochastic operator equation (6) is given by the prob-bilistic Fixed-Point Theorem [418]. The essential idea ofixed-Point Theorem is to prove that A() is a stochas-c contractive operator, which unfortunately is not alwaysue for the stochastic filtering problem.

Lets turn our attention to the measurement equation in

n integral form

yt =

t0

g(s, xs)ds + vt, (9)

here g : RNx RNy . For any () RNx , the optimaln mean-square sense) filter (xt) is the one that seeks an

minimum mean-square error, as given by

(xt) arg min{ 2} =

(xt|y0:t)(x)dxt(xt|y0:t)dxt , (10)

here (

) is an unnormalized filtering density. A common

ay to study the unnormalized filtering density is to treatas a solution of the Zakai equation, as will be detailed in

ection II-E.

.3 Relations to Other Problems

It is conducive to better understanding the stochastic fil-ering problem by comparing it with many other ill-posedroblems that share some commons in different perspec-ves: System identification: System identification has

many commons with stochastic filtering. Both of them

belong to statistical inference problems. Sometimes,identification is also meant as filtering in stochasticcontrol realm, especially with a driving-force as in-put. However, the measurement equation can ad-mit the feedback of previous output, i.e. yn =g(xn, yn1, vn). Besides, identification is often moreconcerned about the parameter estimation problem in-stead of state estimation. We will revisit this issue inthe Section IX.

Regression: In some perspective, filtering can beviewed as a sequential linear/nonlinear regressionproblem if state equation reduces to a random walk.But, regression differs from filtering in the followingsense: Regression is aimed to find a deterministic map-ping between the input and output given a finite num-ber of observation pairs {xi, yi}i=1, which is usually

off-line; whereas filtering is aimed to sequentially inferthe signal or state process given some observations byassuming the knowledge of the state and measurementmodels.

Missing data problem: Missing data problem iswell addressed in statistics, which is concerned aboutprobabilistic inference or model fitting given limiteddata. Statistical approaches (e.g. EM algorithm, data

augmentation) are used to help this goal by assum-ing auxiliary missing variables (unobserved data) withtractable (on-line or off-line) inference.

Density estimation: Density estimation shares somecommons with filtering in that both of them target at adependency estimation problem. Generally, filtering isnothing but to learn the conditional probability distri-bution. However, density estimation is more difficultin the sense that it doesnt have any prior knowledgeon the data (though sometimes people give some as-sumption, e.g. mixture distribution) and it usuallyworks directly on the state (i.e. observation processis tantamount to the state process). Most of densityestimation techniques are off-line.

Nonlinear dynamic reconstruction: Nonlinear dy-namic reconstruction arise from physical phenomena(e.g. sea clutter) in the real world. Given some lim-ited observations (possibly not continuously or evenlyrecorded), it is concerned about inferring the physi-cally meaningful state information. In this sense, itis very similar to the filtering problem. However, itis much more difficult than the filtering problem inthat the nonlinear dynamics involving f is totally un-

known (usually assuming a nonparametric model toestimate) and potentially complex (e.g. chaotic), andthe prior knowledge of state equation is very limited,and thereby severely ill-posed [87]. Likewise, dynamicreconstruction allows off-line estimation.

E. Stochastic Differential Equations and Filtering

In the following, we will formulate the continuous-timestochastic filtering problem by SDE theory. Suppose {xt}is a Markov process with an infinitesimal generator, rewrit-ing state-space equations (1a)(1b) in the following form of

Ito SDE [418], [360]:

dxt = f(t, xt)dt + (t, xt)dt, (11a)

dyt = g(t, xt)dt + dvt, (11b)

where f(t, xt) is often called nonlinear drift and (t, xt)called volatility or diffusion coefficient. Again, the noiseprocesses {t, vt, t 0} are two Wiener processes. xt R

Nx , yt RNy . First, lets look at the state equation(a.k.a. diffusion equation). For all t 0, we define abackward diffusion operator Lt as

16

Lt =Nxi=1

fit

xi+ 1

2

Nxi,j=1

aijt2

xixj, (12)

16Lt is a partial differential operator.


8/69

ANUSCRIPT 8

here aijt = i(t, xt)

j(t, xt). Operator L corresponds ton infinitesimal generator of the diffusion process {xt, t }. The goal now is to deduce conditions under whichne can find a recursive and finite-dimensional (close form)cheme to compute the conditional probability distribution(xt|Yt), given the filtrationYt17 produced by the observa-on process (1b).

Lets define an innovations process18

et = yt t

0

E[g(s, xs)|y0:s]ds, (13)

here E[g(s, xs)|Ys] is described asg(xt) = E[g(t, xt)|Yt]

=

g(xt)p(xt|Ys)dx. (14)

or any test function RNx , the forward diffusion oper-tor L is defined as

Lt = Nxi=1

fitxi

+12

Nxi,j=1

aijt2

xixj, (15)

hich essentially is the Fokker-Planck operator. Given ini-al condition p(x0) at t = 0 as boundary condition, it turnsut that the pdf of diffusion process satisfies the Fokker-lanck-Kolmogorov equation (FPK; a.k.a. Kolmogorovrward equation, [387]) 19

p(xt)

t= Ltp(xt). (16)

y involving the innovation process (13) and assuming[vt] = v,t, we have the following Kushners equation

e.g., [284]):

p(xt|Yt) = Ltp(xt|Yt)dt +p(xt|Yt)et1v,tdt, (t 0) (17)hich reduces to the FPK equation (16) when there are no

bservations or filtration Yt. Integrating (17), we have

p(xt|Yt) = p(x0) +t

0

p(xs|Ys)ds

+t

0

Lsp(xs|Ys)es1v,sds. (18)

17One can imagine filtration as sort of information coding the pre-ous history of the state and measurement.18Innovations process is defined as a white Gaussian noise process.

ee [245], [247] for detailed treatment.19The stochastic process is determined equivalently by the FPK

quation (16) or the SDE (11a). The FPK equation can be inter-eted as follows: The first term is the equation of motion for a cloudparticles whose distribution is p(xt), each point of which obeys the

quation of motion dxdt = f(xt, t). The second term describes the dis-

rbance due to Brownian motion. The solution of (16) can be solvedxactly by Fourier transform. By inverting the Fourier transform, we

n obtain

(x, t + t|x0, t) = 120t

exp (x x0 f(x0)t)

2

20t

,

hich is a Guaussian distribution of a deterministic path.

Given conditional pdf (18), suppose we want to calculate

(xt) = E[(xt)|Yt] for any nonlinear function RNx .By interchanging the order of integrations, we have

(xt) =

(x)p(xt|Yt)dx

=

(x)p(x0)dx

+t

0

(x)Lsp(xs|Ys)dxds

+

t0

(x)p(xs|Ys)es1v,sdxds

= E[(x0)] +

t0

p(xs|Ys)Ls(x)dxds

+

t0

(x)g(s, x)p(xs|Ys)dx

g(xs)

(x)p(xs

|Ys)dx1v,sds.

The Kushner equation lends itself a recursive form of fil-tering solution, but the conditional mean requests all ofhigher-order conditional moments and thus leads to aninfinite-dimensional system.

On the other hand, under some mild conditions, the un-normalized conditional density of xt given Ys, denoted as(xt|Yt), is the unique solution of the following stochas-tic partial differential equation (PDE), the so-called Zakaiequation (see [505], [238], [285]):

d(xt|Yt) = L(xt|Yt)dt + g(t, xt)(xt|Yt)dyt (19)with the same L defined in (15). Zakai equation and Kush-ner equation have a one-to-one correspondence, but Zakaiequation is much simpler,20 hence we are usually turnedto solve the Zakai equation instead of Kushner equation.In the early history of nonlinear filtering, the common wayis to discretize the Zakai equation to seek the numericalsolution. Numerous efforts were devoted along this line[285], [286], e.g. separation of variables [114], adaptive lo-cal grid [65], particle (quadrature) method [66]. However,

these methods are neither recursive nor computationallyefficient.

III. Bayesian Statistics and Bayesian Estimation

A. Bayesian Statistics

Bayesian theory (e.g., [38]) is a branch of mathemat-ical probability theory that allows people to model theuncertainty about the world and the outcomes of interestby incorporating prior knowledge and observational evi-dence.21 Bayesian analysis, interpreting the probability as

20This is true because (19) is linear w.r.t. (xt|Yt

) whereas (17)involves certain nonlinearity. We dont extend discussion here due tospace constraint.21In the circle of statistics, there are slightly different treatments to

probability. The frequentists condition on a hypothesis of choice andput the probability distribution on the data, either observed or not;


9/69

ANUSCRIPT 9

conditional measure of uncertainty, is one of the popu-r methods to solve the inverse problems. Before running

nto Bayesian inference and Bayesian estimation, we firstntroduce some fundamental Bayesian statistics.

Definition 7: (Bayesian Sufficient Statistics) Let p(x|Y)enote the probability density of x conditioned on mea-urements Y. A statistics, (x), is said to be sufficient

the distribution of x conditionally on does not depend

n Y. In other words, p(x|Y) = p(x|Y) for any two sets Ynd Y s.t. (Y) = (Y).

he sufficient statistics (x) contains all of informationrought by x about Y. The Rao-Blackwell Theorem sayshat when an estimator is evaluated under a convex loss,he optimal procedure only depends on the sufficient statis-cs. Sufficiency Principleand Likelihood Principleare twoxiomatic principles in the Bayesian inference [388].

There are three types of intractable problems inherentlylated to the Bayesian statistics:

Normalization: Given the prior p(x) and likelihoodp(y|x), the posterior p(x|y) is obtained by the productof prior and likelihood divided by a normalizing factoras

p(x|y) = p(y|x)p(x)X p(y|x)p(x)dx

. (20)

Marginalization: Given the joint posterior (x, z),the marginal posterior is

p(x|y) =Z

p(x, z|y)dz, (21)

as shown later, marginalization and factorization playsan important role in Bayesian inference.

Expectation: Given the conditional pdf, some aver-aged statistics of interest can be calculated

Ep(x|y)[f(x)] =X

f(x)p(x|y)dx. (22)

In Bayesian inference, all of uncertainties (includingates, parameters which are either time-varying or fixed

ut unknown, priors) are treated as random variables.22

he inference is performed within the Bayesian framework

ven all of available information. And the objective ofayesian inference is to use priors and causal knowledge,uantitatively and qualitatively, to infer the conditionalrobability, given finite observations. There are usuallyhree levels of probabilistic reasoning in Bayesian analysiso-called hierarchical Bayesian analysis): (i) starting with

model selection given the data and assumed priors; (ii) esti-mating the parameters to fit the data given the model and

nly one hypothesis is regarded as true; they regard the probabilityfrequency. The Bayesians only condition on the observed data andnsider the probability distributions on the hypotheses; they putobability distributions on the several hypotheses given some priors;obability is not viewed equivalent to the frequency. See [388], [38],20] for more information.22This is the true spirit of Bayesian estimation which is differentom other estimation schemes (e.g. least-squares) where the un-

nown parameters are usually regarded as deterministic.

priors; (iii) updating the hyperparameters of the prior. Op-timization and integration are two fundamental numericalproblems arising in statistical inference. Bayesian inferencecan be illustrated by a directed graph, a Bayesian network(or belief network) is a probabilistic graphical model witha set of vertices and edges (or arcs), the probability depen-dency is described by a directed arrow between two nodesthat represent two random variables. Graphical models

also allow the possibility of constructing more complex hi-erarchical statistical models [239], [240].

B. Recursive Bayesian Estimation

In the following, we present a detailed derivation of re-cursive Bayesian estimation, which underlies the principleof sequential Bayesian filtering. Two assumptions are usedto derive the recursive Bayesian filter: (i) The states followa first-order Markov process p(xn|x0:n1) = p(xn|xn1);(ii) the observations are independent of the given states.For notation simplicity, we denote Yn as a set of observa-

tions y0:n := {y0, , yn}; let p(xn|Y

n) denote the condi-tional pdf of xn. From Bayes rule we have

p(xn|Yn) = p(Yn|xn)p(xn)p(Yn)

=p(yn,Yn1|xn)p(xn)

p(yn,Yn1)

=p(yn|Yn1, xn)p(Yn1|xn)p(xn)

p(yn|Yn1)p(Yn1)=

p(yn|Yn1, xn)p(xn|Yn1)p(Yn1)p(xn)p(yn

|Yn1)p(Yn1)p(xn)

= p(yn|xn)p(xn|Yn1)p(yn|Yn1) . (23)

As shown in (23), the posterior density p(xn|Yn) is de-scribed by three terms:

Prior: The prior p(xn|Yn1) defines the knowledge ofthe model

p(xn|Yn1) =

p(xn|xn1)p(xn1|Yn1)dxn1, (24)

where p(xn

|xn1) is the transition density of the state.

Likelihood: the likelihood p(yn|xn) essentially deter-mines the measurement noise model in the equation(2b).

Evidence: The denominator involves an integral

p(yn|Yn1) =

p(yn|xn)p(xn|Yn1)dxn. (25)

Calculation or approximation of these three terms are theessences of the Bayesian filtering and inference.

IV. Bayesian Optimal Filtering

Bayesian filtering is aimed to apply the Bayesian statis-tics and Bayes rule to probabilistic inference problems, andspecifically the stochastic filtering problem. To our knowl-edge, Ho and Lee [212] were among the first authors to


10/69

ANUSCRIPT 10

scuss iterative Bayesian filtering, in which they discussedn principle the sequential state estimation problem and in-

uded the Kalman filter as a special case. In the past fewecades, numerous authors have investigated the Bayesianltering in a dynamic state space framework [270], [271],21], [424], [372], [480]-[484].

. Optimal Filtering

An optimal filter is said optimal only in some specificense [12]; in other other words, one should define a cri-erion which measures the optimality. For example, someotential criteria for measuring the optimality can be:

1. Minimum mean-squared error (MMSE): It can be de-fined in terms of prediction or filtering error (or equiv-alently the trace of state-error covariance)

E[xn xn2|y0:n] =

xn xn2p(xn|y0:n)dxn,

which is aimed to find the conditional mean xn =E[xn|y0:n] =

xnp(xn|y0:n)dxn.2. Maximum a posteriori (MAP): It is aimed to find the

mode of posterior probability p(xn|y0:n),23 which isequal to minimize a loss function

E= E[1 Ixn:xnxn(xn)],where I() is an indicator function and is a smallscalar.

3. Maximum likelihood (ML): which reduces to a specialcase of MAP where the prior is neglected.24

4. Minimax: which is to find the median of posteriorp(xn|y0:n). See Fig. 3 for an illustration of the differ-ence between mode, mean and median.

5. Minimum conditional inaccuracy25: Namely,

Ep(x,y)[ log p(x|y)] =

p(x, y)log1

p(x|y) dxdy.

6. Minimum conditional KL divergence [276]: The con-ditional KL divergence is given by

KL = p(x, y)logp(x, y)

p(x|y)p(x)

dxdy.

7. Minimum free energy26: It is a lower bound of maxi-mum log-likelihood, which is aimed to minimize

F(Q; P) EQ(x)[ log P(x|y)]= EQ(x)

log

Q(x)

P(x|y)

EQ(x)[log Q(x)],

23When the mode and the mean of distribution coincide, the MAPtimation is correct; however, for multimodal distributions, the MAPtimate can be arbitrarily bad. See Fig. 3.24This can be viewed as a least-informative prior with uniform dis-ibution.25It is a generalization of Kerridges inaccuracy for the case of i.i.d.

ata.26Free energy is a variational approximation of ML in order toinimize its upper bound. This criterion is usually used in off-lineayesian estimation.

p(x|y)

x

modemean

median

mode

modemean

Fig. 3. Left: An illustration of three optimal criteria that seek

different solutions for a skewed unimodal distribution, in which themean, mode and mediando not coincide. Right: MAP is misleadingfor the multimodal distribution where multiple modes (maxima) exist.

where Q(x) is an arbitrary distribution of x. Thefirst term is called Kullback-Leibler (KL) divergencebetween distributions Q(x) and P(x|y), the secondterm is the entropy w.r.t. Q(x). The minimizationof free energy can be implemented iteratively by theexpectation-maximization (EM) algorithm [130]:

Q(xn+1)

arg maxQ {

Q, xn}

,

xn+1 arg maxx

{Q(xn}, x).

Remarks:

The above criteria are valid not only for state estima-tion but also for parameter estimation (by viewing xas unknown parameters).

Both MMSE and MAP methods require the estima-tion of the posterior distribution (density), but MAPdoesnt require the calculation of the denominator (in-

tegration) and thereby more computational inexpen-sive; whereas the former requires full knowledge ofthe prior, likelihood and evidence. Note that how-ever, MAP estimate has a drawback especially in ahigh-dimensional space. High probability density doesnot imply high probability mass. A narrow spike withvery small width (support) can have a very high den-sity, but the actual probability of estimated state (orparameter) belonging to it is small. Hence, the widthof the mode is more important than its height in thehigh-dimensional case.

The last three criteria are all ML oriented. By min-imizing the negative log-likelihood log p(x|y) andtaking the expectation w.r.t. a fixed or variationalpdf. Criterion 5 chooses the expectation w.r.t. jointpdf p(x, y); when Q(x) = p(x, y), it is equivalent toCriterion 7; Criterion 6 is a modified version of theupper bound of Criterion 5.

The criterion of optimality used for Bayesian filtering isthe Bayes risk of MMSE.27 Bayesian filtering is optimalin a sense that it seeks the posterior distribution whichintegrates and uses all of available information expressedby probabilities (assuming they are quantitatively correct).

However, as time proceeds, one needs infinite computingpower and unlimited memory to calculate the optimal

27For a discussion of difference between Bayesian risk and frequen-tist risk, see [388].


11/69

ANUSCRIPT 11

Time update:One-step predictionof the measurement

yn

Measurementupdate: Correctionto the state estimate

xn

g. 4. Schematic illustration of Kalman filters update as aedictor-corrector.

olution, except in some special cases (e.g. linear Gaussianr conjugate family case). Hence in general, we can onlyeek a suboptimal or locally optimal solution.

. Kalman Filtering

Kalman filtering, in the spirit of Kalman filter [250],53] or Kalman-Bucy filter [249], consists of an iterative

rediction-correction process (see Fig. 4). In the predic-on step, the time update is taken where the one-stephead prediction of observation is calculated; in the cor-ection step, the measurement update is taken where theorrection to the estimate of current state is calculated.n a stationary situation, the matrices An, Bn, Cn, Dn in3a) and (3b) are constant, Kalman filter is precisely the

Wiener filter for stationary least-squares smoothing. Inther words, Kalman filter is a time-variant Wiener filter1], [12]. Under the LQG circumstance, Kalman filter was

riginally derived with the orthogonal projection method.n the late 1960s, Kailath [245] used the innovation ap-

roach developed by Wold and Kolmogorov to reformulatehe Kalman filter, with the tool of martingales theory.28

rom innovations point of view, Kalman filter is a whiten-ng filter.29 Kalman filter is also optimal in the sense that

is unbiased E[xn] = E[xn] and is a minimum variancestimate. A detailed history of Kalman filter and its manyariants can be found in [385], [244], [246], [247], [238], [12],23], [96], [195].

Kalman filter has a very nice Bayesian interpretation12], [497], [248], [366]. In the following, we will show

hat the celebrated Kalman filter can be derived within a

ayesian framework, or more specifically, it reduces to aMAP solution. The derivation is somehow similar to theML solution given by [384]. For presentation simplicity,

e assume the dynamic and measurement noises are bothaussian distributed with zero mean and constant covari-

nce. The derivation of Kalman filter in the linear Gaussiancenario is based on the following assumptions:

E[dndTm] = dmn; E[vnvTm] = vmn.

The state and process noise are mutually independent:E[xnd

Tm] = 0 for n m; E[xnvTm] = 0 for all n, m.

28The martingale process was first introduced by Doob and dis-ussed in detail in [139].29Innovations concept can be used straightforward in nonlinear fil-ring [7]. From innovations point of view, one of criteria to justify the

ptimality of the solution to a nonlinear filtering problem is to checkow whitethe pseudo-innovations are, the whiter the more optimal.

The process noise and measurement noise are mutuallyindependent: E[dnv

Tm] = 0 for all n, m.

Let xMAPn denote the MAP estimate of xn that maxi-mizes p(xn|Yn), or equivalently logp(xn|Yn). By using theBayes rule, we may express p(xn|Yn) by

p(xn|Yn) = p(xn,Yn)p(Yn)

=p(xn, yn,Yn1)

p(yn,Yn1), (26)

where the expression of joint pdf in the numerator is furtherexpressed by

p(xn, yn,Yn1) = p(yn|xn,Yn1)p(xn,Yn1)= p(yn|xn,Yn1)p(xn|Yn1)p(Yn1)= p(yn|xn)p(xn|Yn1)p(Yn1). (27)

The third step is based on the fact that vn does not depend

on Yn1. Substituting (27) into (26), we obtain

p(xn|Yn) = p(yn|xn)p(xn|Yn1)p(Yn1)p(yn,Yn1)

=p(yn|xn)p(xn|Yn1)p(Yn1)

p(yn|Yn1)p(Yn1)=

p(yn|xn)p(xn|Yn1)p(yn|Yn1) , (28)

which shares the same form as (23). Under the Gaussianassumption of process noise and measurement noise, the

mean and covariance of p(yn|xn) are calculated byE[yn|xn] = E[Gnxn + vn] = Gnxn (29)

and

Cov[yn|xn] = Cov[vn|xn] = v, (30)respectively. And the conditional pdf p(yn|xn) can be fur-ther written as

p(yn|xn) = A1 exp1

2(yn Gnxn)T1v (yn Gnxn),

where A1 = (2)Ny/2|v|1/2.

Consider the conditional pdf p(xn|Yn1), its mean andcovariance are calculated by

E[xn|Yn1] = E[Fn,n1xn + dn1|Yn1]= Fn1,nxn1 = xn|n1, (32)

and

Cov[xn|Yn

1] = Cov[xn

xn

|n

1]

= Cov[en,n1], (33)

respectively, where xn|n1 x(n|Yn1) represents thestate estimate at time n given the observations up to n1,


12/69

ANUSCRIPT 12

n,n1 is the state-error vector. Denoting the covariance ofn,n1 by Pn,n1, by Gaussian assumption, we may obtain

p(xn|Yn1) = A2 exp

12

(xn xn|n1)T

P1n,n1(xn xn|n1)

, (34)

here A2 = (2)Nx/2

|Pn,n

1

|1/2. By substituting equa-

ons (31) and (34) to (26), it further follows

(xn|Yn) A exp

12

(yn Gnxn)T1v (yn Gnxn)

12

(xn xn|n1)TP1n,n1(xn xn|n1)

,

(35)

here A = A1A2 is a constant. Since the denominator isnormalizing constant, (35) can be regarded as an unnor-

malizeddensity, the fact doesnt affect the following deriva-

on.Since the MAP estimate of the state is defined by theondition

logp(xn|Yn)xn

xn=xMAP

= 0, (36)

ubstituting equation (35) into (36) yields

xMAPn =

GTn 1v Gn + P

1n,n1

1

P1n,n

1xn|n1 + G

Tn

1v yn.

y using the lemma of inverse matrix,30 it is simplified as

xMAPn = xn|n1 + Kn(yn Gnxn|n1), (37)

here Kn is the Kalman gain as defined by

Kn = Fn+1,nPn,n1GTn (GnPn,n1GTn + v)

1. (38)

bserving

en,n

1 = xn

xn|n

1

= Fn,n1xn1 + dn Fn,n1xMAPn1= Fn,n1eMAPn1 + dn1, (39)

nd by virtue of Pn1 = Cov[eMAPn1 ], we have

Pn,n1 = Cov[en,n1]= Fn,n1Pn1FTn,n1 + d. (40)

nce

en = xn

xMAP

n= xn xn|n1 Kn(yn Gnxn|n1), (41)

30For A = B1 + CD1CT, it follows from the matrix inversemma that A1 = B BC(D + CTBC)1CTB.

noting that en,n1 = xn xn|n1 and yn = Gnxn + vn,we further have

en = en,n1 Kn(Gnen,n1 + vn)= (I KnGn)en,n1 Knvn, (42)

and it further follows

Pn = Cov[

eMAP

n ]= (I KnGn)Pn,n1(I KnGn)T + KnvKTn .

Rearranging the above equation, it reduces to

Pn = Pn,n1 Fn,n+1KnGnPn,n1. (43)Thus far, the Kalman filter is completely derived fromMAP principle, the expression of xMAPn is exactly the samesolution derived from the innovations framework (or oth-ers).

The above procedure can be easily extended to ML case

without much effort [384]. Suppose we want to maximizethe marginal maximum likelihood of p(xn|Yn), which isequivalent to maximizing the log-likelihood

logp(xn|Yn) = logp(xn,Yn) logp(Yn), (44)and the optimal estimate near the solution should satisfy

logp(xn|Yn)xn

xn=xML

= 0. (45)

Substituting (35) to (45), we actually want to minimize the

the cost function of two combined Mahalanobis norms31

E= yn Gnxn21v + xn xn2P1n,n1

. (46)

Taking the derivative of Ewith respect to xn and settingas zero, we also obtain the same solution as (37).

Remarks:

The derivation of the Kalman-Bucy filter [249] wasrooted in the SDE theory [387], [360], it can be alsoderived within the Bayesian framework [497], [248].

The optimal filtering solution described by Wiener-Hopf equation is achieved by spectral factorization

technique [487]. By admitting state-space formula-tion, Kalman filter elegantly overcomes the station-arity assumption and provides a fresh look at thefiltering problem. The signal process (i.e.state)is regarded as a linear stochastic dynamical systemdriven by white noise, the optimal filter thus hasa stochastic differential structure which makes therecursive estimation possible. Spectral factorizationis replaced by the solution of an ordinary differen-tial equation (ODE) with known initial conditions.Wiener filter doesnt treat the difference between the

white and colored noises, it also permits the infinite-dimensional systems; whereas Kalman filter works for

31The Mahalanobis norm is defined as a weighted norm: A2B =ATBA.


13/69

ANUSCRIPT 13

finite-dimensional systems with white noise assump-tion.

Kalman filter is an unbiased minimum variance estima-tor under LOG circumstance. When the Gaussian as-sumption of noise is violated, Kalman filter is still opti-mal in a mean square sense, but the estimate doesntproduce the condition mean (i.e. it is biased), andneither the minimum variance. Kalman filter is not

robust because of the underlying assumption of noisedensity model.

Kalman filter provides an exact solution for linearGaussian prediction and filtering problem. Concerningthe smoothing problem, the off-line estimation versionof Kalman filter is given by the Rauch-Tung-Striebel(RTS) smoother [384], which consists of a forward fil-ter in a form of Kalman filter and a backward recursivesmoother. The RTS smoother is computationally effi-cient than the optimal smoother [206].

The conventional Kalman filter is a point-valued fil-ter, it can be also extended to set-valued filtering [39],[339], [80].

In the literature, there exists many variants of Kalmanfilter, e.g., covariance filter, information filter, square-root Kalman filters. See [205], [247] for more detailsand [403] for a unifying review.

C. Optimum Nonlinear Filtering

In practice, the use of Kalman filter is limited by thebiquitous nonlinearity and non-Gaussianity of physicalorld. Hence since the publication of Kalman filter, numer-

us efforts have been devoted to the generic filtering prob-

m, mostly in the Kalman filtering framework. A numberf pioneers, including Zadeh [503], Bucy [61], [60], Won-am [496], Zakai [505], Kushner [282]-[285], Stratonovich30], [431], investigated the nonlinear filtering problem.

ee also the papers seeking optimal nonlinear filters [420],89], [209]. In general, the nonlinear filtering problem per

ue consists in finding the conditional probability distribu-on (or density) of the state given the observations up to

urrent time [420]. In particular, the solution of nonlinearltering problem using the theory of conditional Markovrocesses [430], [431] is very attractive from Bayesian per-

pective and has a number of advantages over the othermethods. The recursive transformations of the posteriormeasures are characteristics of this theory. Strictly speak-ng, the number of variables replacing the density function

infinite, but not all of them are of equal importance.hus it is advisable to select the important ones and reject

he remainder.

The solutions of nonlinear filtering problem have two cat-gories: global method and local method. In the global ap-roach, one attempts to solve a PDE instead of an ODE

n linear case, e.g. Zakai equation, Kushner-Stratonovichquation, which are mostly analytically intractable. Hencehe numerical approximation techniques are needed to solvehe equation. In special scenarios (e.g. exponential family)ith some assumptions, the nonlinear filtering can admit

he tractable solutions. In the local approach, finite sum

approximation (e.g. Gaussian sum filter) or linearizationtechniques (i.e. EKF) are usually used. In the EKF, bydefining

Fn+1,n =df(x)

dx

x=xn

, Gn =dg(x)

dx

x=xn|n1

,

the equations (2a)(2b) can be linearized into (3a)(3b), andthe conventional Kalman filtering technique is further em-

ployed. The details of EKF can be found in many books,e.g. [238], [12], [96], [80], [195], [205], [206]. Because EKFalways approximates the posterior p(xn|y0:n) as a Gaus-sian, it works well for some types of nonlinear problems,but it may provide a poor performance in some cases whenthe true posterior is non-Gaussian (e.g. heavily skewed ormultimodal). Gelb [174] provided an early overview of theuses of EKF. It is noted that the estimate given by EKF isusually biased since in general E[f(x)] = f(E[x]).

In summary, a number of methods have been developedfor nonlinear filtering problems:

Linearization methods: first-order Taylor series expan-sion (i.e. EKF), and higher-order filter [20], [437].

Approximation by finite-dimensional nonlinear filters:Benes filter [33], [34], Daum filter [111]-[113], and pro-

jection filter [202], [55]. Classic PDE methods, e.g. [282], [284], [285], [505],

[496], [497], [235]. Spectral methods [312]. Neural filter methods, e.g. [209]. Numerical approximation methods, as to be discussed

in Section V.

C.1 Finite-dimensional Filters

The on-line solution of the FPK equation can beavoided if the unnormalized filtered density admits a finite-dimensional sufficient statistics. Benes [33], [34] first ex-plored the exact finite-dimensional filter32 in the nonlinearfiltering scenario. Daum [111] extended the framework to amore general case and included Kalman filter and Benes fil-ter as special cases [113]. Some new development of Daumfilter with virtual measurement was summarized in [113].The recently proposed projection filters [202], [53]-[57], alsobelong to the finite-dimensional filter family.

In [111], starting with SDE filtering theory, Daum intro-duced a gradient function

r(t, x) =

xln (t, x)

where (t, x) is the solution of the FPK equation of (11a)with a form

(t, x)

t= (t, x)

xf tr

fx

+

1

2tr

A2

xxT

,

with an appropriate initial condition (see [111]), and A =

(t, xt)(t, xt)T. When the measurement equation (11b) is

32Roughly speaking, a finite-dimensional filter is the one that canbe implemented by integrating a finite number of ODE, or the onehas the sufficient statistics with finite variables.


14/69

ANUSCRIPT 14

near with Gaussian noise (recalling the discrete-time ver-on (3b)), Daum filter admits a finite-dimensional solution

p(xt|Yt) = s(xt)exp1

2(xt mt)TP1t (xt mt)

,

here s is real number in the interval 0 < s < 1 defined inhe initial condition, mt and Pt are two sufficient statis-cs that can be computed recursively.33 The calculation of

(xt) can be done off line which does not rely on the mea-urement, whereas mt and Pt will be computed on linesing numerical methods. See [111]-[113] for more details.

The problem of the existence of a finite-dimensional fil-er is concerned with the necessary and sufficient condi-ons. In [167], a necessary condition is that the obser-ations and the filtering densities belong to the exponen-al class. In particular, we have the Generalized Fisher-armois-Koopman-Pitman Theorem:

Theorem 1: e.g. [388], [112] For smooth nowhere vanish-ng densities, a fixed finite-dimensional filter exists if and

nly if the unnormalizedconditional density is from an ex-onential family

(xn|y0:n) = (xn) exp[T(xn)(y0:n)], (47)here () is a sufficient statistics, () is a function in X

which turns out to be the solution of specific PDEs).

The nonlinear finite-dimensional filtering is usually per-rmed with the conjugate approach, where the prior and

osterior are assumed to come from some parametric prob-bility function family in order to admit the exact and ana-tically tractable solution. We will come back to this topic

n Section VII. On the other hand, for general nonlinearltering problem, no exact solution can be obtained, vari-us numerical approximation are hence need. In the nextection, we briefly review some popular numerical approxi-

mation approaches in the literature and focus our attentionn the sequential Monte Carlo technique.

V. Numerical Approximation Methods

. Gaussian/Laplace Approximation

Gaussian approximation is the simplest method to ap-roximate the numerical integration problem because of its

nalytic tractability. By assuming the posterior as Gaus-an, the nonlinear filtering can be taken with the EKF

method.

Laplace approximation method is to approximate the in-egral of a function

f(x)dx by fitting a Gaussian at the

maximum x of f(x), and further compute the volume un-er the Gaussian [319]:

f(x)dx (2)Nx/2f(x) log f(x)1/2 (48)

he covariance of the fitted Gaussian is determined by the

essian matrix of log f(x) at x. It is also used to approxi-mate the posterior distribution with a Gaussian centered at

33They degenerate into the mean and error covariance when (11a)linear Gaussian, and the filter reduces to the Kalman-Bucy filter.

the MAP estimate, which is partially justified by the factthat under certain regularity conditions the posterior dis-tribution asymptotically approaches Gaussian distributionas the number of samples increases to infinity. Laplace ap-proximation is useful in the MAP or ML framework, thismethod usually works for the unimodal distribution butproduces a poor approximation result for the multimodaldistribution, especially in a high-dimensional space. Some

new development of Laplace approximation can be foundin MacKays paper [319].

B. Iterative Quadrature

Iterative quadrature is an important numerical approxi-mation method, which was widely used in computer graph-ics and physics in the early days. One of the popularquadrature methods is Gaussian quadrature [117], [377]. Inparticular, a finite integral is approximated by a weightedsum of samples of the integrand based on some quadratureformula b

a

f(x)p(x)dx m

k=1

ckf(xk), (49)

where p(x) is treated as a weighting function, and xk isthe quadrature point. For example, it can be the k-th zerothe m-th order orthogonal Hermite polynomial Hm(x),

34

for which the weights are given by

ck =2m1m!

m

m2(Hm1(xk))2.

The approximation is good if f(x) is a polynomial of de-gree not bigger than 2m1. The values xk are determinedby the weighting function p(x) in the interval [a, b].35 Thismethod can produce a good approximation if the nonlinearfunction is smooth. Quadrature methods, alone or com-bined with other methods, were used in nonlinear filtering(e.g. [475], [287]). The quadrature formulae will be usedafter a centering about the current estimate of the condi-tional mean and rescaling according to the current estimateof the covariance.

C. Mulitgrid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretizedand approximated as finite), grid-based methods can pro-vide a good solution and optimal way to update the filtereddensity p(zn|y0:n) (To discriminate from the continuous-valued state x, we denote the discrete-valued state as zfrom now on). Suppose the discrete state z N consistsof a finite number of distinct discrete states {1, 2, , Nz}.For the state space zn1, let win1|n1 denote the condi-tional probability of each zin1 given measurement up to

34Other orthogonal approximation techniques can be also consid-ered.35The Fundamental Theorem of Gaussian Quadrature states that:

the abscissas of the m-point Gaussian quadrature formula are pre-cisely the roots of the orthogonal polynomial for the same intervaland weighting function.


15/69

ANUSCRIPT 15

1, i.e. p(zn1 = zi|y0:n1) = win1|n1. Then theosterior pdf at n 1 can be represented as

p(zn1|y0:n1) =Nzi=1

win1|n1(zn1 zin1), (50)

nd the prediction and filtering equations are further de-ved as

p(zn|y0:n1) =Nzi=1

win|n1(zn zin), (51)

p(zn|y0:n) =Nzi=1

win|n(zn zin), (52)

here

win|n1 =Nzj=1

wjn1|n1p(zin|zjn), (53)

win|n =win|n1p(yn|zin)Nzj=1 w

jn|n1p(yn|zjn)

. (54)

the state space is continuous, the approximate-grid basedmethod can be similarly derived (e.g. [19]). Namely, we

an always discretize the state space into Nz discrete cellates, then a grid-based method can be further used to

pproximate the posterior density. The grid must be suf-ciently dense to obtain a good approximation, especiallyhen the dimensionality ofNx is high, however the increase

f Nz will increase the computational burden dramatically.

the state space is not finite, then the accuracy of grid-ased methods is not guaranteed. As we will discuss inection VII, HMM filter is quite fitted to the grid-based

methods. The disadvantage of grid-based method is thatrequires the state space cannot be partitioned unevenly

o give a great resolution to the state with high density9]. Some adaptive grid-based methods were proposed to

vercome this drawback [65]. Given the predefined grid,ifferent methods were used to approximate the functionsnd carry out the dynamic Bayesian estimation and fore-asting [62], [258], [271], [424], [373], [372].

In studying the nonlinear filtering, Bucy [62] and Bucynd Senne [63] introduced the point-mass method, whicha global function approximation method. Such method

ses a simple rectangular grid, spline basis, step function,he quadrature methods are used to determine the gridoints [64], [475], [271], the number of grid points is pre-cribed to provide an adequate approximation. The density

assumed to be represented by a set of point masses whicharry the information about the data; mesh grid and direc-ons are given in terms of eignevalues and eigenvectors of

onditional error covariance; the floating grid is centered athe current mean estimate and rotated from the state co-rdinate frame into the principal axes of error ellipsoid (co-ariance); the grid along the axes is chosen to extend oversufficient distance to cover the true state. For the multi-

modal density, it is suggested to define a grid for each mode

(a) (b)

(c) (d)

(e) (f)

Fig. 5. Illustration of non-Gaussian distribution approximation: (a)true distribution; (b) Gaussian approximation; (c) Gaussian sum ap-proximation; (d) histogram approximation; (e) Riemannian sum (stepfunction) approximation; (f) Monte Carlo sampling approximation.

rather than for the entire density. Even so, the computa-tion of multigrid-based point-mass approximation methodis nontrivial and the complexity is high (see [271]).

Another sophisticated approximation method, based onthe piecewise constant approximation of density, was pro-posed in [271], [258]. The method is similar but not iden-tical to the point-mass approximation. It defines a sim-ple grid based on tiling the state space with a number ofidentical parallelepipeds, over each of them the density ap-

proximation is constant, and the integration is replaced bya discrete linear convolution problem. The method also al-lows error propagation analysis along the calculation [271].

D. Moment Approximation

Moment approximation is targeted at approximating themoments of density, including mean, covariance, and higherorder moments. The approximation of the first two mo-ments is widely used in filtering [367]. Generally, we canempirically use the sample moment to approximate the truemoment, namely

mk = E[xk] =

X

xkp(x)dx =1

N

Ni=1

|x(i)|k

where mk denotes the m-th order moment and x(i) are

the samples from true distribution. Among many, Gram-Charlier and Edgeworth expansion are two popular higher-order moment approximation approaches. Due to spaceconstraint, we cannot run into the details here, and re-fer the reader to [ ] for more information. The applica-tions of higher-order moment approximation to nonlinearfilters are found in [427]. However, the computation cost ofthese approaches are rather prohibitive, especially in high-dimensional space.


16/69

ANUSCRIPT 16

. Gaussian Sum Approximation

Different from the linearized EKF or second-order ap-roximation filter that only concentrate on the vicinityf the mean estimate, Gaussian sum approximation usesweighted sum of Gaussian densities to approximate the

osterior density(the so-called Gaussian mixture model):

p(x) =

mj=1

cjN(xj , j) (55)

here the weighting coefficients cj > 0 andm

j=1 cj = 1.he approximation is motivated by the observation that

ny non-Gaussian density can be approximated to someccurate degree by a sufficiently large number of Gaussian

mixture densities, which admits tractable solution by cal-ulating individual first and second order moments. Theaussian sum filter [421], [8], essentially uses this idea and

uns a bank of EKFs in parallel to obtain the suboptimaltimate. The following theorem reads the underlying prin-ple:

Theorem 2: [12] Suppose in equations (2a)(2b) theoise vectors dn and vn are white Gaussian noises withero mean and covariances d and v, respectively.

p(xn|y0:n) = N(xn;n|n1, n|n1), then for fixed(), n|n1 and v, the filtered density p(xn|y0:n) =np(xn|y0:n1)p(yn|xn) (where cn is a normalizingonstant) converges uniformly to N(xn;n|n, n|n) asn|n1 0. If p(xn|y0:n) = N(xn;n|n, n|n),

hen for fixed f(), n|n and d, the predicted density(xn+1

|y0:n) = p(xn+1|xn)p(xn|y0:n)dxn converges uni-rmly to N(xn+1;n+1|n, n+1|n) as n|n 0.

Some new development of Gaussian sum filter (as wells Gaussian-quadrature filter) is referred to [235], [234],here the recursive Bayesian estimation is performed, ando Jacobian matrix evaluation is needed (similar to thenscented transformation technique discussed below).

. Deterministic Sampling Approximation

The deterministic sampling approximation we discussedelow is a kind of method called unscented transformationUT). 36 It can be viewed as a special numerical method

o approximate the sufficient statistics of mean and co-ariance. The intuition of UT is somewhat similar to theoint-mass approximation discussed above: it uses the so-alled sigma-points with additional skewed parameters toover and propagate the information of the data. Based onT, the so-called unscented Kalman filter (UKF) was de-ved. The most mentionable advantage of UKF over EKFits derivative-nonlinear estimation (no need of calcula-

on of Jacobians and Hessians), though its computationalomplexity is little higher than the EKFs. There are alsother derivative-free estimation techniques available. In

55], a polynomial approximation using interpolation for-mula was developed and subsequently applied to nonlinear

36The name is somehow ad hoc and the word unscented does notmply its original meaning (private communication with S. Julier).

Kalman filtering, with a name of nprKF. The nprKF filter-ing technique was also used to train the neural networks[166].

The idea of derivative-free state estimation is following:In order to estimate the state information (mean, covari-ance, and higher-order moments) after a nonlinear trans-formation, it is favorable to approximate the probabilitydistribution directly instead of approximating the nonlin-

ear function (by linear localization) and apply the Kalmanfilter in the transformed domain. The derivative-free UKFcan overcome the drawback by using a deterministic sam-pling approach to calculate the mean and covariance. Inparticular, the (2Nx + 1) sigma-points are generated andpropagated through the true nonlinearity, and the weightedmean and covariance are further calculated [242], [474].Compared with the EKFs first-order accuracy, the esti-mation accuracy of UKF is improved to the third-order forGaussian data and at least second-order for non-Gaussiandata [242], [474].

However, UT and UKF often encounter the ill-conditioned 37 problem of covariance matrix in practice(though it is theoretically positive semi-definite), althoughthe regularization trick and square-root UKF [460] can al-leviate this. For enhancing the numerical robustness, wepropose another derivative-free KF based on singular-valuedecomposition (SVD).

The SVD-based KF is close in spirit to UKF, it onlydiffers in that the UT is replaced by SVD and the sigma-point covariance becomes an eigen-covariance matrix, inwhich the pairwise () eigenvectors are stored into the col-umn vector of the new covariance matrix. The number

of eigen-points to store is the same as the sigma points inUT. The idea behind SVD is simple: We assume the covari-ance matrix be characterized by a set of eigenvectors whichcorrespond to a set of eigenvalues.38 For the symmetric co-variance matrix C, ED and SVD are equivalent, and theeigenvalues are identical to the singular values. We preferto calculate SVD instead of eigen-decomposition becausethe former is more numerically robust. The geometricalinterpretation of SVD compared with UT is illustrated inFig. 6. By SVD of square-root of the covariance matrix C

C1/2 = U S 00 0 VT (56)where C1/2 = chol(C) and chol represents Cholesky fac-torization; S is a diagonal matrix S = diag{s1, , sk},when C1/2 is symmetric, U = V. Thus the eigenvaluesare k = s

2k, and the eigenvectors of C is represented by

the column vectors of matrix UUT. A Monte Carlo sam-pling of a two-dimensional Gaussian distribution passinga Gaussian nonlinearity is shown in Fig. 6. As shown,the sigma points and eigen-points can both approximatelycharacterize the structure of the transformed covariance

37Namely, the conditional number of the covariance matrix is verylarge.38By assuming that, we actually assume that the sufficient statistics

of underlying data is second-order, which is quite not true.


17/69

ANUSCRIPT 17

8 6 4 2 0 2 4 6 8 10 1225

20

15

10

5

0

5

10

15

20

0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.160.14

0.145

0.15

0.155

0.16

0.165

0.17

0.175

0.18

0.185

fweighted

covariance

weightedmean

+

SVD

x y

meanmean

Px Py+

_

covariance covariance

g. 6. SVD against Choleksy factorization in UT. Left: 1,000 data points are generated from a two-dimensional Gaussian distribution.he small red circles linked by two thin lines are sigma points using UT (parameters = 1, = 2, = 0; see the paper [ ] for notations); the

wo black arrows are the eigenvector multiplied by = 1.4; the ellipses from inside to outside correspond to the scaling factors = 1, 1.4, 2, 3;Middle: After the samples pass a Gaussian nonlinearity, the sigma points and eigen-points are calculated again for the transformed covariance;

ight: SVD-based derivative-free estimation block diagram.

matrix. For state space equations (2a)(2b) with additiveoise, the SVD-based derivative-free KF algorithm for theate estimation is summarized in Table X in Appendix E.

G. Monte Carlo Sampling ApproximationMonte Carlo methods use statistical sampling and esti-

mation techniques to evaluate the solutions to mathemati-al problems. Monte Carlo methods have three categories:) Monte Carlo sampling, which is devoted to developing

fficient (variance-reduction oriented) sampling techniquer estimation; (ii) Monte Carlo calculation, which is aimed

o design various random or pseudo-random number gen-rators; and (iii) Monte Carlo optimization, which is de-oted to applying the Monte Carlo idea to optimize somenonconvex or non-differentiable) functions, to name a few,

mulated annealing [257], dynamic weighting [494], [309],98], and genetic algorithm. In recent decades, modernMonte Carlo techniques have attracted more and more at-ention and have been developed in different areas, as weill briefly overview in this subsection. Only Monte Carlo

ampling methods are discussed. A detailed background ofMonte Carlo methods can refer to the books [168], [389],

06], [386] and survey papers [197], [318].

The underlying mathematical concept of Monte Carlopproximation is simple. Consider a statistical problemtimating a Lebesque-Stieltjes integral:

X

f(x)dP(x),

here f(x) is an integrable function in a measurable space.s a brute force technique, Monte Carlo sampling uses aumber of (independent) random variables in a probabil-y space (, F, P) to approximate the true integral. Pro-ded one draws a sequence of Np i.i.d. random samples

x(1), , x(Np)} from probability distribution P(x), thenhe Monte Carlo estimate of f(x) is given by

fNp =

1

Np

Np

i=1 f(

x(i)

), (57)

r which E[fNp ] = E[f] and Var[fNp ] =1

NpVar[f] =

2

Np

ee Appendix A for a general proof). By the Kolmogorov

Strong Law of Large Numbers (under some mild regular-

ity conditions), fNp (x) converges to E[f(x)] almost surely(a.s.) and its convergence rate is assessed by the CentralLimit Theorem

Np(f

bayesian filtering

Documents