bayesian estimation of time varying systems

Upload: bodhibrata-mukhopadhyay

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    1/115

    Written material for the course held in Spring 2011Version 1.2 (January 22, 2011)Copyright (C) Simo Särkkä, 2009–2011. All rights reserved.

    BAYESIAN ESTIMATION OFTIME-VARYING SYSTEMS:

    Discrete-Time Systems

    Simo Särkkä

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    2/115

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    3/115

    Contents

    Contents i

    1 Introduction 11.1 Why Bayesian Approach? . . . . . . . . . . . . . . . . . . . . . 11.2 What is Optimal Filtering? . . . . . . . . . . . . . . . . . . . . . 2

    1.2.1 Applications of Optimal Filtering . . . . . . . . . . . . . 21.2.2 Origins of Bayesian Optimal Filtering . . . . . . . . . . . 61.2.3 Optimal Filtering and Smoothing as Bayesian Inference . 71.2.4 Algorithms for Optimal Filtering and Smoothing . . . . . 10

    2 From Bayesian Inference to Bayesian Optimal Filtering 132.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.1.1 Philosophy of Bayesian Inference . . . . . . . . . . . . . 132.1.2 Connection to Maximum Likelihood Estimation . . . . . 132.1.3 The Building Blocks of Bayesian Models . . . . . . . . . 142.1.4 Bayesian Point Estimates . . . . . . . . . . . . . . . . . . 162.1.5 Numerical Methods . . . . . . . . . . . . . . . . . . . . . 17

    2.2 Batch and Recursive Estimation . . . . . . . . . . . . . . . . . . 192.2.1 Batch Linear Regression . . . . . . . . . . . . . . . . . . 192.2.2 Recursive Linear Regression . . . . . . . . . . . . . . . . 212.2.3 Batch vs. Recursive Estimation . . . . . . . . . . . . . . 22

    2.3 Towards Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . 242.3.1 Drift Model for Linear Regression . . . . . . . . . . . . . 242.3.2 Kalman Filter for Linear Model with Drift . . . . . . . . . 26

    3 Optimal Filtering 313.1 Formal Filtering Equations and Exact Solutions . . . . . . . . . . 31

    3.1.1 Probabilistic State Space Models . . . . . . . . . . . . . . 31

    3.1.2 Optimal Filtering Equations . . . . . . . . . . . . . . . . 343.1.3 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.2 Extended and Unscented Filtering . . . . . . . . . . . . . . . . . 393.2.1 Linearization of Non-Linear Transforms . . . . . . . . . . 393.2.2 Extended Kalman Filter (EKF) . . . . . . . . . . . . . . . 43

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    4/115

    ii CONTENTS

    3.2.3 Statistical Linearization of Non-Linear Transforms . . . . 473.2.4 Statistically Linearized Filter . . . . . . . . . . . . . . . . 493.2.5 Unscented Transform . . . . . . . . . . . . . . . . . . . . 503.2.6 Unscented Kalman Filter (UKF) . . . . . . . . . . . . . . 55

    3.3 Gaussian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.3.1 Gaussian Moment Matching . . . . . . . . . . . . . . . . 593.3.2 Gaussian Filter . . . . . . . . . . . . . . . . . . . . . . . 603.3.3 Gauss-Hermite Integration . . . . . . . . . . . . . . . . . 623.3.4 Gauss-Hermite Kalman Filter (GHKF) . . . . . . . . . . 653.3.5 Spherical Cubature Integration . . . . . . . . . . . . . . . 673.3.6 Cubature Kalman Filter (CKF) . . . . . . . . . . . . . . . 70

    3.4 Monte Carlo Approximations . . . . . . . . . . . . . . . . . . . . 733.4.1 Principles and Motivation of Monte Carlo . . . . . . . . . 733.4.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . 74

    3.5 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 753.5.1 Sequential Importance Sampling . . . . . . . . . . . . . . 75

    3.5.2 Sequential Importance Resampling . . . . . . . . . . . . 763.5.3 Rao-Blackwellized Particle Filter . . . . . . . . . . . . . 79

    4 Optimal Smoothing 834.1 Formal Equations and Exact Solutions . . . . . . . . . . . . . . . 83

    4.1.1 Optimal Smoothing Equations . . . . . . . . . . . . . . . 834.1.2 Rauch-Tung-Striebel Smoother . . . . . . . . . . . . . . 84

    4.2 Extended and Unscented Smoothing . . . . . . . . . . . . . . . . 864.2.1 Extended Rauch-Tung-Striebel Smoother . . . . . . . . . 864.2.2 Statistically Linearized RTS Smoother . . . . . . . . . . . 894.2.3 Unscented Rauch-Tung-Striebel (URTS) Smoother . . . . 90

    4.3 General Gaussian Smoothing . . . . . . . . . . . . . . . . . . . . 934.3.1 General Gaussian Rauch-Tung-Striebel Smoother . . . . . 934.3.2 Gauss-Hermite Rauch-Tung-Striebel (GHRTS) Smoother . 944.3.3 Cubature Rauch-Tung-Striebel (CRTS) Smoother . . . . . 94

    4.4 Fixed-Point and Fixed-Lag Gaussian Smoothing . . . . . . . . . . 964.4.1 General Fixed-Point Smoother Equations . . . . . . . . . 974.4.2 General Fixed-Lag Smoother Equations . . . . . . . . . . 99

    4.5 Monte Carlo Based Smoothers . . . . . . . . . . . . . . . . . . . 1004.5.1 Sequential Importance Resampling Smoother . . . . . . . 1004.5.2 Rao-Blackwellized Particle Smoother . . . . . . . . . . . 101

    A Additional Material 103

    A.1 Properties of Gaussian Distribution . . . . . . . . . . . . . . . . . 103

    References 105

    Index 110

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    5/115

    Chapter 1

    Introduction

    1.1 Why Bayesian Approach?

    The mathematical treatment of the models and algorithms in this document isBayesian, which means that all the results are treated as being approximationsto certain probability distributions or their parameters. Probability distributionsare used for modeling both the uncertainties in the models and for modeling thephysical randomness. The theory of non-linear optimal ltering is formulated interms of Bayesian inference and both the classical and recent ltering algorithmsare derived using the same Bayesian notation and formalism.

    The reason for selecting the Bayesian approach is more a practical engineeringthan a philosophical decision. It simply is easier to develop a consistent, practicallyapplicable theory of recursive inference under Bayesian philosophy than under,for example, least squares or maximum likelihood philosophy. Another usefulconsequence of selecting the Bayesian approach is that least squares, maximumlikelihood and many other philosophically different results can be obtained as spe-cial cases or re-interpretations of the Bayesian results. Of course, quite often thesame thing applies also other way around.

    Modeling uncertainty as randomness is a very “engineering” way of modelingthe world. It is exactly the approach also chosen in statistical physics as well asin nancial analysis. Also the Bayesian approach to optimal ltering is far fromnew (see, e.g., Ho and Lee , 1964 ; Lee , 1964 ; Jazwinski , 1966 ; Stratonovich , 1968 ;Jazwinski , 1970 ), because the theory already existed at the same time the seminalarticle of Kalman (1960b ) was published. The Kalman lter was derived from theleast squares point of view, but the non-linear ltering theory has been Bayesianfrom the beginning (see, e.g., Jazwinski , 1970 ).

    One should not take the Bayesian way of modeling unknown parameters asrandom variables too literally. It does not imply that one believes that there reallyis something random in the parameters - it is just a convenient way of representinguncertainty under the same formalism that is used for representing randomness.Also random or stochastic processes appearing in the mathematical models are

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    6/115

    2 Introduction

    not necessarily really random in physical sense, but instead, the randomness is just a mathematical trick for taking into account the uncertainty in a dynamicphenomenon.

    But it does not matter if the randomness is interpreted as physical randomnessor as a representation of uncertainty, as long as the randomness based models suc-

    ceed in modeling the real world. In the above engineering philosophy the contro-versy between so called “frequentists” and “Bayesians” is simply silly – it is quitemuch equivalent to the unnecessary controversy about interpretations of quantummechanics, that is, whether, for example, the Copenhagen interpretation or severalworlds implementation is the correct one. The philosophical interpretation doesnot matter as long as we get meaningful predictions from the theory.

    1.2 What is Optimal Filtering?

    Optimal ltering refers to the methodology that can be used for estimating thestate of a time-varying system, which is indirectly observed through noisy mea-

    surements. The state of the system refers to the collection of dynamic variablessuch as position, velocities and accelerations or orientation and rotational motionparameters, which describe the physical state of the system. The noise in the mea-surements refers to a noise in the sense that the measurements are uncertain, that is,even if we knew the true system state the measurements would not be deterministicfunctions of the state, but would have certain distribution of possible values. Thetime evolution of the state is modeled as a dynamic system, which is perturbedby a certain process noise . This noise is used for modeling the uncertainties inthe system dynamics and in most cases the system is not truly stochastic, but thestochasticity is only used for representing the model uncertainties.

    1.2.1 Applications of Optimal FilteringPhenomena, which can be modeled as time varying systems of the above type arevery common in engineering applications. These kind of models can be found, forexample, in navigation, aerospace engineering, space engineering, remote surveil-lance, telecommunications, physics, audio signal processing, control engineering,nance and several other elds. Examples of such applications are the following:

    • Global positioning system (GPS) (Kaplan , 1996 ) is a widely used satellitenavigation system, where the GPS receiver unit measures arrival times of signals from several GPS satellites and computes its position based on thesemeasurements. The GPS receiver typically uses an extended Kalman lteror some other optimal ltering algorithm for computing the position andvelocity such that the measurements and the assumed dynamics (laws of physics) are taken into account. Also the ephemeris information, which isthe satellite reference information transmitted from the satellites to the GPSreceivers is typically generated using optimal lters.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    7/115

    1.2 What is Optimal Filtering? 3

    Figure 1.1: In GPS system, the measurements are time delays of satellite signals and theoptimal lter (e.g., EKF) computes the position and the accurate time.

    • Target tracking (Bar-Shalom et al. , 2001 ; Crassidis and Junkins , 2004 ) refersto the methodology, where a set of sensors such as active or passive radars,radio frequency sensors, acoustic arrays, infrared sensors and other typesof sensors are used for determining the position and velocity of a remotetarget. When this tracking is done continuously, the dynamics of the targetand measurements from the different sensors are most naturally combinedusing an optimal lter. The target in this (single) target tracking case can be,for example, a robot, a satellite, a car or an airplane.

    Sensor

    angle

    target

    Figure 1.2: In target tracking, a sensor generates measurements (e.g., angle measure-ments) of the target, and the purpose is to determine the target trajectory.

    • Multiple target tracking (Bar-Shalom and Li , 1995 ; Blackman and Popoli ,1999 ; Stone et al. , 1999 ; Särkkä et al. , 2007b ) systems are used for remotesurveillance in the cases, where there are multiple targets moving at thesame time in the same geographical area. This arises the concept of dataassociation (which measurement was from which target?) and the problemof estimating the number of targets. Multiple target tracking systems are

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    8/115

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    9/115

    1.2 What is Optimal Filtering? 5

    switch to pure inertial navigation, when the GPS receiver is unable to com-pute its position (i.e., has no x) for some reason. This happens, for example,indoors, in tunnels and in other cases when there is no direct line-of-sightbetween the GPS receiver and the satellites.

    • Spread of infectious diseases (Anderson and May , 1991 ) can often be mod-eled as differential equations for the number of susceptible, infected and re-covered/dead individuals. When uncertainties are induced into the dynamicequations, and when the measurements are not perfect, the estimation of thespread of the disease can be formulated as an optimal ltering problem.

    • Biological processes (Murray , 1993 ) such as population growth, predator-pray models and several other dynamic processes in biology can also bemodeled as (stochastic) differential equations. The estimation of the statesof these processes from inaccurate measurements can be formulated as anoptimal ltering problem.

    • Telecommunications is also a eld where optimal lters are traditionallyused. For example, optimal receivers, signal detectors and phase lockedloops can be interpreted to contain optimal lters ( Van Trees , 1968 , 1971 )as components. Also the celebrated Viterbi algorithm ( Viterbi , 1967 ) can beinterpreted as a combination of optimal ltering and optimal smoothing of the underlying hidden Markov model.

    • Audio signal processing applications such as audio restoration (Godsill andRayner, 1998 ) and audio signal enhancement ( Fong et al. , 2002 ) often useTVAR (time varying autoregressive) models as the underlying audio signalmodels. These kind of models can be efciently estimated using optimallters and smoothers.

    • Stochastic optimal control (Maybeck , 1982b ; Stengel , 1994 ) considers con-trol of time varying stochastic systems. Stochastic controllers can typicallybe found in, for example, airplanes, cars and rockets. The optimality, inaddition to the statistical optimality, means that control signal is constructedto minimize a performance cost, such as expected time to reach a predenedstate, the amount of fuel consumed or average distance from a desired po-sition trajectory. Optimal lters are typically used for estimating the statesof the stochastic system and a deterministic optimal controller is constructedindependently from the lter such that it uses the estimate of the lter asthe known state. In theory, the optimal controller and optimal lter are notcompletely decoupled and the problem of constructing optimal stochasticcontrollers is far more challenging than constructing optimal lters and (de-terministic) optimal controllers separately.

    • Learning systems or adaptive systems can often be mathematically formu-lated in terms of optimal lters. The theory of stochastic differential equa-

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    10/115

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    11/115

    1.2 What is Optimal Filtering? 7

    were formulating the linear theory in the United States, Stratonovich was doing thepioneering work on the probabilistic (Bayesian) approach in Russia ( Stratonovich ,1968 ; Jazwinski , 1970 ).

    As discussed in the book of West and Harrison (1997 ), in the sixties, Kalmanlter like recursive estimators were also used in the Bayesian community and it is

    not clear whether the theory of Kalman ltering or the theory of dynamic linear models (DLM) was the rst. Although these theories were originally derived fromslightly different starting points, they are equivalent. Because of Kalman lter’suseful connection to the theory and history of stochastic optimal control, this doc-ument approaches the Bayesian ltering problem from the Kalman ltering pointof view.

    Although the original derivation of the Kalman lter was based on the leastsquares approach, the same equations can be derived from the pure probabilisticBayesian analysis. The Bayesian analysis of Kalman ltering is well covered in theclassical book of Jazwinski (1970 ) and more recently in the book of Bar-Shalomet al. ( 2001 ). Kalman ltering, mostly because of its least squares interpretation,

    has widely been used in stochastic optimal control. A practical reason to this isthat the inventor of the Kalman lter, Rudolph E. Kalman, has also made severalcontributions ( Kalman , 1960a ) to the theory of linear quadratic Gaussian (LQG)regulators, which are fundamental tools of stochastic optimal control ( Stengel ,1994 ; Maybeck , 1982b ).

    1.2.3 Optimal Filtering and Smoothing as Bayesian Inference

    Optimal Bayesian ltering (see, e.g. Jazwinski , 1970 ; Bar-Shalom et al. , 2001 ;Doucet et al. , 2001 ; Ristic et al. , 2004 ) considers statistical inversion problems,where the unknown quantity is a vector valued time series (x 1, x 2, . . . ) which is

    observed through noisy measurements (y 1, y 2, . . . ) as illustrated in the Figure 1.4 .An example of this kind of time series is shown in the Figure 1.5. The processshown is actually a discrete-time noisy resonator with a known angular velocity.The state x k = ( xk ẋk )T is two dimensional and consists of the position of the res-onator xk and its time derivative ẋk . The measurements yk are scalar observationsof the resonator position (signal) and they are corrupted by measurement noise.

    observed: y1 y2 y3 y4

    hidden: x1 x2 x3 x4 . . .

    Figure 1.4: In discrete-time ltering a sequence of hidden states x k is indirectly observedthrough noisy measurements y k .

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    12/115

    8 Introduction

    0 2 4 6 8 10 12 14 16

    −0.25

    −0.2

    −0.15

    −0.1

    −0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    Time

    SignalMeasurement

    Figure 1.5: An example of time series, which models a discrete-time resonator. The actualresonator state (signal) is hidden and only observed through the noisy measurements.

    The purpose of the statistical inversion at hand is to estimate the hidden states

    {x 1, . . . , x T }given the observed measurements {y 1, . . . , y T }, which means thatin the Bayesian sense ( Bernardo and Smith , 1994 ; Gelman et al. , 1995 ) all we haveto do is to compute the joint posterior distribution of all the states given all themeasurements. This can be done by straightforward application of the Bayes’ rule:

    p(x 1, . . . , x T |y 1, . . . , y T ) = p(y 1, . . . , y T |x 1, . . . , x T ) p(x 1, . . . , x T )

    p(y1, . . . , y

    T )

    , (1.1)

    where

    • p(x 1, . . . , x T ), is the prior dened by the dynamic model,• p(y 1, . . . , y T |x 1, . . . , x T ) is the likelihood model for the measurements,• p(y 1, . . . , y T ) is the normalization constant dened as

    p(y 1, . . . , y T ) = p(y 1, . . . , y T |x 1, . . . , x T ) p(x 1, . . . , x T ) d(x 1, . . . , x T ).(1.2)

    Unfortunately, this full posterior formulation has the serious disadvantage that eachtime we obtain a new measurement, the full posterior distribution would have tobe recomputed. This is particularly a problem in dynamic estimation (which is ex-actly the problem we are solving here!), because there measurements are typicallyobtained one at a time and we would want to compute the best possible estimate

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    13/115

    1.2 What is Optimal Filtering? 9

    after each measurement. When number of time steps increases, the dimensionalityof the full posterior distribution also increases, which means that the computationalcomplexity of a single time step increases. Thus after a sufcient number of timesteps the computations will become intractable, independently of available compu-tational resources. Without additional information or harsh approximations, there

    is no way of getting over this problem in the full posterior computation.However, the above problem only arises when we want to compute the full

    posterior distribution of the states at each time step. If we are willing to relaxthis a bit and be satised with selected marginal distributions of the states, thecomputations become order of magnitude lighter. In order to achieve this, we alsoneed to restrict the class of dynamic models into probabilistic Markov sequences,which as a restriction sounds more restrictive than it really is. The model for thestates and measurements will be assumed to be of the following type:

    • Initial distribution species the prior distribution p(x 0) of the hidden statex 0 at initial time step k = 0 .

    • Dynamic model models the system dynamics and its uncertainties as a Markovsequence , dened in terms of the transition distribution p(x k |x k− 1).

    • Measurement model models how the measurement y k depends on the cur-rent state x k . This dependence is modeled by specifying the distribution of the measurement given the state p(y k |x k ).

    Because computing the full joint distribution of the states at all time steps is com-putationally very inefcient and unnecessary in real-time applications, in optimal(Bayesian) ltering the following marginal distributions are considered instead:

    • Filtering distributions are the marginal distributions of the current state x kgiven the previous measurements {y 1, . . . , y k}: p(x k |y 1, . . . , y k ), k = 1 , . . . , T. (1.3)

    • Prediction distributions are the marginal distributions of the future states, nsteps after the current time step:

    p(x k+ n |y 1, . . . , y k), k = 1 , . . . , T, n = 1 , 2, . . . , (1.4)

    • Smoothing distributions

    are the marginal distributions of the states x

    k givena certain interval {y 1, . . . , y T }of measurements with T > k : p(x k |y 1, . . . , y T ), k = 1 , . . . , T. (1.5)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    14/115

    10 Introduction

    Figure 1.6: State estimation problems can be divided into optimal prediction, lteringand smoothing depending on the time span of measurements available with respect to theestimated state time span.

    0 2 4 6 8 10 12 14 16

    −0.25

    −0.2

    −0.15

    −0.1

    −0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    Time

    SignalMeasurement95% QuantileFilter Estimate

    Figure 1.7: The result of computing the ltering distributions for the discrete-time res-onator model. The estimates are the posterior means of the ltering distributions and thequantiles are the 95% quantiles of the ltering distributions.

    1.2.4 Algorithms for Optimal Filtering and Smoothing

    There exists a few classes of ltering and smoothing problems which have closedform solutions:

    • Kalman lter (KF) is a closed form solution to the discrete linear lteringproblem. Due to linear Gaussian model assumptions the posterior distribu-

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    15/115

    1.2 What is Optimal Filtering? 11

    0 2 4 6 8 10 12 14 16

    −0.25

    −0.2

    −0.15

    −0.1

    −0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    Time

    SignalMeasurement95% QuantileSmoother Estimate

    Figure 1.8: The result of computing the smoothing distributions for the discrete-timeresonator model. The estimates are the posterior means of the smoothing distributionsand the quantiles are the 95% quantiles of the smoothing distributions. The smoothingdistributions are actually the marginal distributions of the full state posterior distribution.

    tion is exactly Gaussian and no numerical approximations are needed.

    • Rauch-Tung-Striebel smoother (RTSS) is the corresponding closed form smootherto linear Gaussian state space models.

    • Grid lters and smoothers , are solutions to Markov models with nite statespaces.

    But because the Bayesian optimal ltering and smoothing equations are generallycomputationally intractable, many kinds of numerical approximation methods havebeen developed, for example:

    • Extended Kalman lter (EKF) approximates the non-linear and non-Gaussianmeasurement and dynamic models by linearization, that is, by forming aTaylor series expansion on the nominal (or Maximum a Posteriori, MAP)solution. This results in Gaussian approximation to the ltering distribution.

    • Extended Rauch-Tung-Striebel smoother (ERTSS) is the approximate non-linear smoothing algorithm corresponding to EKF.

    • Unscented Kalman lter (UKF) approximates the propagation of densitiesthrough the non-linearities of measurement and noise processes by unscented transform . This also results in Gaussian approximation.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    16/115

    12 Introduction

    • Unscented Rauch-Tung-Striebel smoother (URTSS) is the approximate non-linear smoothing algorithm corresponding to UKF.

    • Sequential Monte Carlo methods or particle lters and smoothers representthe posterior distribution a as weighted set of Monte Carlo samples.

    • Unscented particle lter (UPF) and local linearization based methods useUKFs and EKFs, respectively, for approximating the importance distribu-tions in sequential importance sampling.

    • Rao-Blackwellized particle lters and smoothers use closed form integration(e.g., Kalman lters and RTS smoothers) for some of the state variables andMonte Carlo integration for others.

    • Interacting multiple models (IMM), and other multiple model methods ap-proximate the posterior distributions with mixture Gaussian approximations.

    • Grid based methods approximate the distribution as a discrete distribution

    dened in a nite grid.

    • Other methods also exists, for example, based on series expansions, describ-ing functions, basis function expansions, exponential family of distributions,variational Bayesian methods, batch Monte Carlo (e.g., MCMC), Galerkinapproximations etc.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    17/115

    Chapter 2

    From Bayesian Inference toBayesian Optimal Filtering

    2.1 Bayesian Inference

    This section provides a brief presentation of the philosophical and mathematicalfoundations of Bayesian inference. The connections to the classical statisticalinference are also briey discussed.

    2.1.1 Philosophy of Bayesian Inference

    The purpose of Bayesian inference ( Bernardo and Smith , 1994 ; Gelman et al. ,1995 ) is to provide a mathematical machinery that can be used for modeling sys-tems, where the uncertainties of the system are taken into account and the decisionsare made according to rational principles. The tools of this machinery are theprobability distributions and the rules of probability calculus.

    If we compare the so called frequentist philosophy of statistical analysis toBayesian inference the difference is that in Bayesian inference the probability of anevent does not mean the proportion of the event in an innite number of trials, butthe uncertainty of the event in a single trial. Because models in Bayesian inferenceare formulated in terms of probability distributions, the probability axioms andcomputation rules of the probability theory (see, e.g., Shiryaev , 1996 ) also applyin the Bayesian inference.

    2.1.2 Connection to Maximum Likelihood Estimation

    Consider a situation, where we know the conditional distribution p(y k |θ ) of con-ditionally independent random variables (measurements) y1, . . . , y n , but the pa-rameter θ R d is unknown. The classical statistical method for estimating theparameter is the maximum likelihood method (Milton and Arnold , 1995 ), wherewe maximize the joint probability of the measurements, also called the likelihood

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    18/115

    14 From Bayesian Inference to Bayesian Optimal Filtering

    functionL(θ ) =

    k

    p(y k |θ ). (2.1)The maximum of the likelihood function with respect to θ gives the maximumlikelihood estimate (ML-estimate)

    θ̂ = arg maxθ

    L(θ ). (2.2)

    The difference between the Bayesian inference and the maximum likelihood methodis that the starting point of Bayesian inference is to formally consider the parameterθ as a random variable. Then the posterior distribution of the parameter θ can becomputed by using the Bayes’ rule

    p(θ |y 1, . . . , y n ) = p(y 1, . . . , y n |θ ) p(θ )

    p(y 1, . . . , y n ) , (2.3)

    where p(θ ) is the prior distribution, which models the prior beliefs of the parameterbefore we have seen any data and p(y 1, . . . , y n ) is a normalization term, which isindependent of the parameter θ . Often this normalization constant is left out and if the measurements y 1, . . . , y n are conditionally independent given θ , the posteriordistribution of the parameter can be written as

    p(θ |y 1, . . . , y n ) p(θ )k

    p(y k |θ ). (2.4)

    Because we are dealing with a distribution, we might now choose the most probablevalue of the random variable (MAP-estimate), which is given by the maximum of the posterior distribution. However, better estimate in mean squared sense is the

    posterior mean of the parameter (MMSE-estimate). There are an innite numberof other ways of choosing the point estimate from the distribution and the best waydepends on the assumed loss function (or utility function). The ML-estimate canbe considered as a MAP-estimate with uniform prior on the parameter θ .

    One can also interpret Bayesian inference as a convenient method for includ-ing regularization terms into maximum likelihood estimation. The basic ML-framework does not have a self-consistent method for including regularizationterms or prior information into statistical models. However, this regularization in-terpretation of Bayesian inference is not entirely right, because Bayesian inferenceis much more than this.

    2.1.3 The Building Blocks of Bayesian ModelsThe basic blocks of a Bayesian model are the prior model containing the pre-liminary information on the parameter and the measurement model determiningthe stochastic mapping from the parameter to the measurements. Using the com-bination rules, namely the Bayes’ rule, it is possible to infer an estimate of the

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    19/115

    2.1 Bayesian Inference 15

    parameters from the measurements. The distribution of the parameters, which isconditional to the observed measurements is called the posterior distribution and itis the distribution representing the state of knowledge about the parameters whenall the information in the observed measurements and the model is used. Predictive posterior distribution is the distribution of the new (not yet observed) measure-

    ments when all the information in the observed measurements and the model isused.

    • Prior modelThe prior information consists of subjective experience based beliefs on thepossible and impossible parameter values and their relative likelihoods be-fore anything has been observed. The prior distribution is a mathematicalrepresentation of this information:

    p(θ ) = Information on parameter θ before seeing any observations. (2.5)

    The lack of prior information can be expressed by using a non-informativeprior. The non-informative prior distribution can be selected in various dif-ferent ways ( Gelman et al. , 1995 ).

    • Measurement modelBetween the true parameters and the measurements there often is a causal,but inaccurate or noisy relationship. This relationship is mathematicallymodeled using the measurement model:

    p(y |θ ) = Distribution of observation y given the parameters θ . (2.6)

    • Posterior distributionPosterior distribution is the conditional distribution of the parameters, andit represents the information we have after the measurement y has been

    obtained. It can be computed by using the Bayes’ rule:

    p(θ |y ) = p(y |θ ) p(θ )

    p(y ) p(y |θ ) p(θ ), (2.7)where the normalization constant is given as

    p(y ) = R d p(y |θ ) p(θ ) dθ . (2.8)In the case of multiple measurements y 1, . . . , y n , if the measurements areconditionally independent the joint likelihood of all measurements is theproduct of individual measurements and the posterior distribution is

    p(θ |y 1, . . . , y n ) p(θ ) k p(y k |θ ), (2.9)where the normalization term can be computed by integrating the right handside over θ . If the random variable is discrete the integration reduces tosummation.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    20/115

    16 From Bayesian Inference to Bayesian Optimal Filtering

    • Predictive posterior distributionThe predictive posterior distribution is the distribution of new measurementsy n +1 :

    p(y n +1

    |y 1, . . . , y n ) =

    R d

    p(y n +1

    |θ ) p(θ

    |y 1, . . . , y n ) dθ . (2.10)

    After obtaining the measurements y 1, . . . , y n the predictive posterior distri-bution can be used for computing the probability distribution for n + 1 :thmeasurement, which has not been observed yet.

    In the case of tracking, we could imagine that the parameter is the sequence of dynamic states of a target, where the state contains the position and velocity. Orin the continuous-discrete setting the parameter would be an innite-dimensionalrandom function describing the trajectory of the target at a given time interval. Inboth cases the measurements could be, for example, noisy distance and directionmeasurements produced by a radar.

    2.1.4 Bayesian Point Estimates

    The distributions as such have no use in applications, but also in Bayesian compu-tations nite dimensional summaries (point estimates) are needed. This selectionof a point from space based on observed values of random variables is a statisti-cal decision, and therefore this selection procedure is most naturally formulatedin terms of statistical decision theory (Berger , 1985 ; Bernardo and Smith , 1994 ;Raiffa and Schlaifer , 2000 ).

    Denition 2.1 (Loss Function) . A loss function L(θ , a ) is a scalar valued function,which determines the loss of taking the action a , when the true parameter valueis θ . The action (or control) is the statistical decision to be made based on thecurrently available information.

    Instead of loss functions it is also possible to work with utility functions U (θ , a ),which determine the reward from taking the action a with parameter values θ .Loss functions can be converted to utility functions and vice versa by deningU (θ , a ) = −L(θ , a ).

    If the value of parameter θ is not known, but the knowledge on the parametercan be expressed in terms of the posterior distribution p(θ |y 1, . . . , y n ), then thenatural choice is the action, which gives the minimum (maximum) of the expected

    loss (utility) (Berger , 1985 ):

    E[L(θ , a ) |y 1, . . . , y n ] = R d L(θ , a ) p(θ |y 1, . . . , y n ) dθ . (2.11)Commonly used loss functions are the following:

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    21/115

    2.1 Bayesian Inference 17

    • Quadratic error loss : If the loss function is quadraticL(θ , a ) = ( θ −a )T (θ −a ), (2.12)

    then the optimal choice a o is the posterior mean of the distribution of θ :

    a o = R d θ p(θ |y 1, . . . , y n ) dθ . (2.13)This posterior mean based estimate is often called the minimum mean squar-ed error (MMSE) estimate of the parameter θ . The quadratic loss is the mostcommonly used loss function, because it is easy to handle mathematicallyand because in the case of Gaussian posterior distribution the MAP estimateand the median coincide with the posterior mean.

    • Absolute error loss : The loss function of the formL(θ , a ) =

    i|θi −a i |, (2.14)

    is called an absolute error loss and in this case the optimal choice is themedian of the distribution (i.e., medians of the marginal distributions inmultidimensional case).

    • 0-1 loss : If the loss function is of the formL(θ , a ) = 1 , if θ = a0 , if θ = a

    (2.15)

    then the optimal choice is the maximum of the posterior distribution, that is,the maximum a posterior (MAP) estimate of the parameter.

    2.1.5 Numerical Methods

    In principle, Bayesian inference provides the equations for computing the poste-rior distributions and point estimates for any model once the model specicationhas been set up. However, the practical problem is that computation of the inte-grals involved in the equations can rarely be performed analytically and numericalmethods are needed. Here we shall briey describe numerical methods, which arealso applicable in higher dimensional problems: Gaussian approximations, multi-dimensional quadratures, Monte Carlo methods, and importance sampling.

    • Very common types of approximations are Gaussian approximations (Gel-man et al., 1995 ), where the posterior distribution is approximated with aGaussian distribution

    p(θ |y 1, . . . , y n ) ≈N(θ |m , P ). (2.16)The mean m and covariance P of the Gaussian approximation can be eithercomputed by matching the rst two moments of the posterior distribution, orby using the maximum of the distribution as the mean estimate and approxi-mating the covariance with the curvature of the posterior on the mode.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    22/115

    18 From Bayesian Inference to Bayesian Optimal Filtering

    • Multi-dimensional quadrature or cubature integration methods such as Gauss-Hermite quadrature can also be often used if the dimensionality of the inte-gral is moderate. In those methods the idea is to deterministically form arepresentative set of sample points Θ = {θ (i) | i = 1 , . . . , N }(sometimescalled sigma points ) and form the approximation of the integral as weighted

    average:

    E[g (θ ) |y 1, . . . , y n ] ≈N

    i=1

    W (i) g (θ (i) ), (2.17)

    where the numerical values of the weights W (i) are determined by the al-gorithm. The sample points and weights can be selected, for example, togive exact answers for polynomials up to certain degree or to account for themoments up to certain degree.

    • In direct Monte Carlo methods a set of N samples from the posterior distri-bution is randomly drawn

    θ (i) p(θ |y 1, . . . , y n ), i = 1 , . . . , N , (2.18)and expectation of any function g (·) can be then approximated as the sampleaverage

    E[g (θ ) |y 1, . . . , y n ] ≈ 1N

    i

    g (θ (i) ). (2.19)

    Another interpretation of this is that Monte Carlo methods form an approxi-mation of the posterior density of the form

    p(θ

    |y 1, . . . , y n )

    ≈ 1

    N

    N

    i=1δ (x

    −x(i) ), (2.20)

    where δ (·) is the Dirac delta function. The convergence of Monte Carloapproximation is guaranteed by the central limit theorem (CLT) (see, e.g.,Liu , 2001 ) and the error term is, at least in theory, independent of the dimen-sionality of θ .

    • Efcient methods for generating non-independent Monte Carlo samples arethe Markov chain Monte Carlo (MCMC) methods (see, e.g., Gilks et al. ,1996 ). In MCMC methods, a Markov chain is constructed such that it has thetarget distribution as its stationary distribution. By simulating the Markovchain, samples from the target distribution can be generated.

    • Importance sampling (see, e.g., Liu , 2001 ) is a simple algorithm for gener-ating weighted samples from the target distribution. The difference to thedirect Monte Carlo sampling and to MCMC is that each of the particlescontains a weight, which corrects the difference between the actual target

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    23/115

    2.2 Batch and Recursive Estimation 19

    distribution and the approximation obtained from an importance distributionπ(·).Importance sampling estimate can be formed by drawing N samples fromthe importance distribution

    θ (i) π(θ |y 1, . . . , y n ), i = 1 , . . . , N . (2.21)The importance weights are then computed as

    w(i) = p(θ (i) |y 1, . . . , y n )π(θ (i) |y 1, . . . , y n )

    , (2.22)

    and the expectation of any function g (·) can be then approximated as

    E[g (θ ) |y 1, . . . , y n ] ≈N i=1 w

    (i) g (θ (i) )N i=1 w(i)

    . (2.23)

    2.2 Batch and Recursive Estimation

    In order to understand the meaning and applicability of optimal ltering and itsrelationship with recursive estimation, it is useful to go through an example, wherewe solve a simple and familiar linear regression problem in a recursive manner.After that we shall generalize this concept to include a dynamic model in order toillustrate the differences in dynamic and batch estimation.

    2.2.1 Batch Linear Regression

    Consider the linear regression model

    yk = θ1 + θ2 tk + k , (2.24)

    where we assume that the measurement noise is zero mean Gaussian with a givenvariance k N(0, σ2) and the prior distribution for parameters is Gaussian withknow mean and covariance θ N(m 0, P 0). In the classical linear regressionproblem we want to estimate the parameters θ = ( θ1 θ2)T from a set of measure-ment data D = {(y1, t1),..., (yK , tK )}. The measurement data and the true linearfunction used in simulation are illustrated in Figure 2.1.

    In compact probabilistic notation the linear regression model can be written as

    p(yk

    |θ ) = N( yk

    |H k θ , σ2)

    p(θ ) = N( θ |m 0, P 0). (2.25)where we have introduced the matrix H k = (1 tk) and N(·) denotes the Gaus-sian probability density function (see, Appendix A.1 ). The likelihood of yk is, of course, conditional on the regressors tk also (or equivalently H k ), but we will not

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    24/115

    20 From Bayesian Inference to Bayesian Optimal Filtering

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    Time

    MeasurementTrue signal

    Figure 2.1: The underlying truth and the measurement data in the simple linear regressionproblem.

    denote this dependence explicitly to simplify the notation and from now on thisdependence is assumed to be understood from the context.

    The batch solution to this linear regression problem can be obtained by straight-forward application of the Bayes’ rule:

    p(θ |y1:k ) p(θ )k

    p(yk |θ )= N( θ |m 0, P 0)

    k

    N(yk |H k θ , σ2).

    Also in the posterior distribution above, we assume the conditioning on tk andH k , but will not denote it explicitly. Thus the posterior distribution is denotedto be conditional on y1:k = {y1, . . . , yk}, and not on the data set D containingthe regressor values tk also. The reason for this simplication is that the simpliednotation will also work in more general ltering problems, where there is no naturalway of dening the associated regressor variables.

    Because the prior and likelihood are Gaussian, the posterior distribution willalso be Gaussian:

    p(θ |y1:k ) = N( θ |m K , P K ). (2.26)The mean and covariance can be obtained by completing the quadratic form in the

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    25/115

    2.2 Batch and Recursive Estimation 21

    exponent, which gives:

    m K = P − 10 + 1σ2

    H T H− 1 1

    σ2H T y + P − 10 m 0

    P K = P − 10

    + 1

    σ2H T H

    − 1

    ,(2.27)

    where H k = (1 tk ) and

    H =H 1

    ...H K

    =1 t1...

    ...1 tK

    , y =y1...

    yK

    . (2.28)

    Figure 2.2 shows the result of batch linear regression, where the posterior meanparameter values are used as the linear regression parameters.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    Time

    MeasurementTrue signal

    Estimate

    Figure 2.2: The result of simple linear regression with a slight regularization prior usedfor the regression parameters. For simplicity, the variance was assumed to be known.

    2.2.2 Recursive Linear Regression

    Recursive solution to the regression problem ( 2.25 ) can be obtained by assumingthat we already have obtained posterior distribution conditioned on the previousmeasurements 1, . . . , k −1:

    p(θ |y1:k− 1) = N( θ |m k− 1, P k− 1).

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    26/115

    22 From Bayesian Inference to Bayesian Optimal Filtering

    Now assume that we have obtained a new measurement yk and we want to computethe posterior distribution of θ given the old measurements y1:k− 1 and the newmeasurement yk . According to the model specication the new measurement hasthe likelihood

    p(yk

    |θ ) = N( yk

    |H k θ , σ2).

    Using the batch version equations such that we interpret the previous posterior asthe prior, we can calculate the distribution

    p(θ |y1:k ) p(yk |θ ) p(θ |y1:k− 1)N(θ |m k , P k ),

    (2.29)

    where the Gaussian distribution parameters are

    m k = P − 1k− 1 + 1σ2

    H T k H k− 1 1

    σ2H T k yk + P

    − 1k− 1m k− 1

    P k = P− 1k− 1 +

    1σ2 H

    T k H k

    − 1

    .

    (2.30)

    By using the matrix inversion lemma , the covariance calculation can be written as

    P k = P k− 1 −P k− 1H T k H k P k− 1H T k + σ2− 1

    H k P k− 1.

    By introducing temporary variables S k and K k the calculation of mean and covari-ance can be written in form

    S k = H k P k− 1H T k + σ2

    K k = P k− 1H T k S − 1k

    mk = m

    k− 1 + K

    k[y

    k −H

    km

    k− 1]

    P k = P k− 1 −K kS k K T k .(2.31)

    Note that S k = H k P k− 1H T k + σ2 is a scalar, because measurements are scalar and

    thus no matrix inversion is required.The equations above actually are special cases of the Kalman lter update

    equations. Only the update part of the equations is required, because the esti-mated parameters are assumed to be constant, that is, there is no a priori stochasticdynamics model for the parameters θ . Figure 2.3 illustrates the convergence of themeans and variances of parameters during the recursive estimation.

    2.2.3 Batch vs. Recursive EstimationIn this section we shall generalize the recursion idea used in the previous section togeneral probabilistic models. The underlying idea is simply that at each measure-ment we treat the posterior distribution of previous time step as the prior for thecurrent time step. This way we can compute the same solution in recursive manner

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    27/115

    2.2 Batch and Recursive Estimation 23

    that we would obtain by direct application of Bayesian rule to the whole (batch)data set.

    The batch Bayesian solution to a statistical estimation problem can be formu-lated as follows:

    1. Specify the likelihood model of measurements p(y k|θ ) given the parameter

    θ . Typically the measurements y k are assumed to be conditionally indepen-dent such that

    p(y 1:K |θ ) =k

    p(y k |θ ).

    2. The prior information about the parameter θ is encoded into the prior distri-bution p(θ ).

    3. The observed data set is D = {(t1, y 1), . . . , (tK , y K )}, or if we drop theexplicit conditioning to tk , the data is D = y 1:K .

    4. The batch Bayesian solution to the statistical estimation problem can be

    computed by applying the Bayes’ rule

    p(θ |y 1:K ) = 1Z

    p(θ )k

    p(y k |θ ).

    For example, the batch solution of the above kind to the linear regression problem(2.25 ) was given by Equations ( 2.26 ) and ( 2.27 ).

    The recursive Bayesian solution to the above statistical estimation problem canbe formulated as follows:

    1. The distribution of measurements is again modeled by the likelihood func-tion p(y k |θ ) and the measurements are assumed to be conditionally inde-pendent.

    2. In the beginning of estimation (i.e, at step 0), all the information about theparameter θ we have, is the prior distribution p(θ ).

    3. The measurements are assumed to be obtained one at a time, rst y 1 , then y 2and so on. At each step we use the posterior distribution from the previoustime step as the current prior distribution:

    p(θ |y 1) = 1Z 1

    p(y 1 |θ ) p(θ ) p(θ |y 1:2 ) =

    1Z 2

    p(y 2 |θ ) p(θ |y 1) p(θ |y 1:3 ) = 1Z 3 p(y 3 |θ ) p(θ |y 1:2 )

    ...

    p(θ |y 1:K ) = 1Z K

    p(y K |θ ) p(θ |y 1:K − 1).

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    28/115

    24 From Bayesian Inference to Bayesian Optimal Filtering

    It is easy to show that the posterior distribution at the nal step above isexactly the posterior distribution obtained by the batch solution. Also, re-ordering of measurements does not change the nal solution.

    For example, the Equations ( 2.29 ) and (2.30 ) give the one step update rule for thelinear regression problem in Equation ( 2.25 ).

    The recursive formulation of Bayesian estimation has many useful properties:

    • The recursive solution can be considered as the online learning solution tothe Bayesian learning problem. That is, the information on the parameters isupdated in online manner using new pieces of information as they arrive.

    • Because each step in the recursive estimation is a full Bayesian update step,batch Bayesian inference is a special case of recursive Bayesian inference.

    • Due to the sequential nature of estimation we can also model the effect of time to parameters. That is, we can build model to what happens to theparameter θ between the measurements – this is actually the basis of ltering

    theory , where time behavior is modeled by assuming the parameter to be atime-dependent stochastic process θ (t).

    2.3 Towards Bayesian Filtering

    Now that we are able to solve the static linear regression problem in recursivemanner, we can proceed towards Bayesian ltering by allowing the parameterschange between the measurements. By generalizing this idea, we encounter theKalman lter, which is the workhorse of dynamic estimation.

    2.3.1 Drift Model for Linear Regression

    Assume that we have similar linear regression model as in Equation ( 2.25 ), but theparameter θ is allowed to perform Gaussian random walk between the measure-ments:

    p(yk |θ k) = N( yk |H k θ k , σ2) p(θ k |θ k− 1) = N( θ k |θ k− 1, Q )

    p(θ 0) = N( θ 0 |m 0, P 0),(2.32)

    where Q is the covariance of the random walk. Now, given the distribution

    p(θ k− 1 |y1:k− 1) = N( θ k− 1 |m k− 1, P k− 1),the joint distribution of θ k and θ k− 1 is1

    p(θ k , θ k− 1 |y1:k− 1) = p(θ k |θ k− 1) p(θ k− 1 |y1:k− 1).1Note that this formula is correct only for Markovian dynamic models, where

    p (θ k | θ k − 1 , y 1: k − 1 ) = p (θ k | θ k − 1 ).

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    29/115

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    30/115

    26 From Bayesian Inference to Bayesian Optimal Filtering

    2.3.2 Kalman Filter for Linear Model with Drift

    The linear model with drift in the previous section had the disadvantage that thecovariates tk occurred explicitly in the model specication. The problem withthis is that when we get more and more measurements, the parameter tk growswithout a bound. Thus the conditioning of the problem also gets worse in time.For practical reasons it also would be desirable to have time-invariant model, thatis, a model which is not dependent on the absolute time, but only on the relativepositions of states and measurements in time.

    The alternative state space formulation of the linear model with drift, withoutusing explicit covariates can be done as follows. Let’s denote time differencebetween consecutive times as ∆ tk− 1 = tk −tk− 1 . The idea is that if the under-lying phenomenon (signal, state, parameter) xk was exactly linear, the differencebetween adjacent time points could be written exactly as

    xk −xk− 1 = ẋ ∆ tk− 1 (2.35)

    where ẋ is the derivative, which is constant in the exactly linear case. The diver-gence from the exactly linear function can be modeled by assuming that the aboveequation does not hold exactly, but there is a small noise term on the right handside. The derivative can also be assumed to perform small random walk and thusnot be exactly constant. This model can be written as follows:

    x1,k = x1,k − 1 + ∆ tk− 1x2,k− 1 + w1x2,k = x2,k + w2

    yk = x1,k + e,(2.36)

    where the signal is the rst components of the state x1,k and the derivative is thesecond x2,k . The noises are e

    N (0, σ2), (w1; w2)

    N(0, Q ). The model can

    also be written in form

    p(yk |x k ) = N( yk |H x k , σ2) p(x k |x k− 1) = N( x k |A k− 1 x k− 1, Q ),

    (2.37)

    where

    A k− 1 =1 ∆ tk− 10 1 , H = 1 0 .

    With suitable Q this model is actually equivalent to model (2.32 ), but in this for-mulation we explicitly estimate the state of the signal (point on the regression line)instead of the linear regression parameters.

    We could now explicitly derive the recursion equations in the same manner aswe did in the previous sections. However, we can also use the Kalman lter , whichis a readily derived recursive solution to generic linear Gaussian models of the form

    p(y k |x k ) = N( y k |H k x k , R k) p(x k |x k− 1) = N( x k |A k− 1 x k− 1, Q k− 1).

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    31/115

    2.3 Towards Bayesian Filtering 27

    Our alternative linear regression model in Equation ( 2.36 ) can be seen to be aspecial case of these models. The Kalman lter equations are often expressed asprediction and update steps as follows:

    1. Prediction step:

    m −k = A k− 1 m k− 1P −k = A k− 1 P k− 1 A

    T k− 1 + Q k− 1.

    2. Update step:

    S k = H k P −k HT k + R k

    K k = P −k HT k S

    − 1k

    m k = m −k + K k [y k −H k m −k ]P k = P −k −K k S k K T k .

    The result of tracking the sine signal with Kalman lter is shown in Figure 2.5. Allthe mean and covariance calculation equations given in this document so far havebeen special cases of the above equations, including the batch solution to the scalarmeasurement case (which is a one-step solution). The Kalman lter recursivelycomputes the mean and covariance of the posterior distributions of the form

    p(x k |y 1, . . . , y k ) = N( x k |m k , P k ).Note that the estimates of x k derived from this distribution are non-anticipative inthe sense that they are only conditional to measurements obtained before and at thetime step k. However, after we have obtained measurements y 1, . . . , y k , we couldcompute estimates of x k− 1, x k− 2, . . . , which are also conditional to the measure-ments after the corresponding state time steps. Because more measurements andmore information is available for the estimator, these estimates can be expected tobe more accurate than the non-anticipative measurements computed by the lter.

    The above mentioned problem of computing estimates of state by condition-ing not only to previous measurements, but also to future measurements is calledoptimal smoothing as already mentioned in Section 1.2.3. The optimal smoothingsolution to the linear Gaussian state space models is given by the Rauch-Tung-Striebel smoother . The full Bayesian theory of optimal smoothing as well as therelated algorithms will be presented in Chapter 4.

    It is also possible to predict the time behavior of the state in the future that we

    have not yet measured. This procedure is called optimal prediction . Because op-timal prediction can always be done by iterating the prediction step of the optimallter, no specialized algorithms are needed for this.

    The non-linear generalizations of optimal prediction, ltering and smoothingcan be obtained by replacing the Gaussian distributions and linear functions in

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    32/115

    28 From Bayesian Inference to Bayesian Optimal Filtering

    model (2.37) with non-Gaussian and non-linear ones. The Bayesian dynamic es-timation theory described in this document can be applied to generic non-linearltering models of the following form:

    y k p(y k |x k )x k p(x k |x k− 1).

    To understand the generality of this model is it useful to note that if we droppedthe time-dependence from the state we would get the model

    y k p(y k |x )x p(x ).

    Because x denotes an arbitrary set of parameters or hyper-parameters of the sys-tem, all static Bayesian models are special cases of this model. Thus in dynamicestimation context we extend the static models by allowing a Markov model forthe time-behavior of the (hyper)parameters.

    The Markovianity also is less of a restriction than it sounds, because what wehave is a vector valued Markov process, not a scalar one. The reader may recallfrom the elementary calculus that differential equations of an arbitrary order canbe always transformed into vector valued differential equations of the rst order.In analogous manner, Markov processes of an arbitrary order can be transformedinto vector valued rst order Markov processes.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    33/115

    2.3 Towards Bayesian Filtering 29

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    Time

    Recursive E[ θ 1]

    Batch E[ θ 1]

    Recursive E[ θ 2]

    Batch E[ θ 2]

    (a)

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110

    −4

    10−3

    10−2

    10−1

    100

    Time

    Recursive Var[ θ 1]

    Batch Var[ θ 1]

    Recursive Var[ θ 2]

    Batch Var[ θ 2]

    (b)

    Figure 2.3: (a) Convergence of the recursive linear regression means. The nal value isexactly the same as that was obtained with batch linear regression. Note that time has beenscaled to 1 at k = K . (b) Convergence of the variances plotted on logarithmic scale. Ascan be seen, every measurement brings more information and the uncertainty decreasesmonotonically. The nal values are the same as the variances obtained from the batchsolution.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    34/115

    30 From Bayesian Inference to Bayesian Optimal Filtering

    0 0.5 1 1.5 2−1.5

    −1

    −0.5

    0

    0.5

    1

    1.5MeasurementTrue SignalEstimate

    Figure 2.4: Example of tracking sine signal with linear model with drift, where the pa-rameters are allowed to vary according to Gaussian random walk model.

    0 0.5 1 1.5 2−1.5

    −1

    −0.5

    0

    0.5

    1

    1.5MeasurementSignalFiltered

    Figure 2.5: Example of tracking sine signal with locally linear state space model. Theresult differs a bit from the random walk parameter model, because of slightly differentchoice of process noise. It could be made equivalent if desired.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    35/115

    Chapter 3

    Optimal Filtering

    In this chapter we rst present the classical formulation of the discrete-time op-timal ltering as recursive Bayesian inference. Then the classical Kalman lters,extended Kalman lters and statistical linearization based lters are presented in

    terms of the general theory. In addition to the classical algorithms the unscentedKalman lter, general Gaussian lters, Gauss-Hermite Kalman lters and cubatureKalman lters are also presented. Sequential importance resampling based particleltering, as well as Rao-Blackwellized particle ltering are also covered.

    For more information, reader is referred to various articles and books cited atthe appropriate sections. The following books also contain useful information onthe subject:

    • Classic books: Lee (1964 ); Bucy and Joseph (1968 ); Meditch (1969 ); Jazwin-ski ( 1970 ); Sage and Melsa (1971 ); Gelb (1974 ); Anderson and Moore (1979 );Maybeck (1979 , 1982a )

    • More recent books on linear and non-linear Kalman ltering: Bar-Shalomet al. ( 2001 ); Grewal and Andrews (2001 ); Crassidis and Junkins (2004 )

    • Recent books with particle lters also: Ristic et al. (2004 ); Candy (2009 )

    3.1 Formal Filtering Equations and Exact Solutions

    3.1.1 Probabilistic State Space Models

    Before going into the practical non-linear ltering algorithms, in the next sectionsthe theory of probabilistic (Bayesian) ltering is presented. The Kalman lteringequations, which are the closed form solutions to the linear Gaussian discrete-timeoptimal ltering problem, are also derived.

    Denition 3.1 (State space model) . Discrete-time state space model or probabilis-tic non-linear ltering model is a recursively dened probabilistic model of the

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    36/115

    32 Optimal Filtering

    form

    x k p(x k |x k− 1)y k p(y k |x k ),

    (3.1)

    where

    • xk R n is the state of the system on the time step k.• yk R m is the measurement on the time step k.• p(x k |x k− 1) is the dynamic model , which models the stochastic dynamics

    of the system. The dynamic model can be a probability density, a countingmeasure or combination of them depending on if the state x k is continuous,discrete or hybrid.

    • p(y k |x k ) is the measurement model , which models the distribution of themeasurements given the state.

    The model is assumed to be Markovian, which means that it has the followingtwo properties:

    Property 3.1 (Markov property of states) .

    States {x k : k = 1 , 2, . . .} form a Markov sequence (or Markov chain if the stateis discrete). This Markov property means that x k (and actually the whole futurex k+1 , x k+2 , . . . ) given x k− 1 is independent from anything that has happened in the past:

    p(x k |x 1:k− 1, y 1:k− 1) = p(x k |x k− 1). (3.2) Also the past is independent of the future given the present:

    p(x k− 1 |x k:T , y k:T ) = p(x k− 1 |x k ). (3.3)Property 3.2 (Conditional independence of measurements) .

    The measurement y k given the state x k is conditionally independent from the mea-surement and state histories:

    p(y k |x 1:k , y 1:k− 1) = p(y k |x k). (3.4)As simple example of a Markovian sequence is the Gaussian random walk.

    When this is combined with noisy measurements, we obtain an example of a prob-abilistic state space model:

    Example 3.1 (Gaussian random walk) . Gaussian random walk model can be writ-ten as

    xk = xk− 1 + wk− 1, wk− 1 N(0, q )yk = xk + ek , ek N(0, r ),

    (3.5)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    37/115

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    38/115

    34 Optimal Filtering

    3.1.2 Optimal Filtering Equations

    The purpose of optimal ltering is to compute the marginal posterior distributionof the state x k on the time step k given the history of the measurements up to thetime step k

    p(xk |

    y1:k

    ). (3.10)

    The fundamental equations of the Bayesian ltering theory are given by the fol-lowing theorem:

    Theorem 3.1 (Bayesian optimal ltering equations) . The recursive equations for computing the predicted distribution p(x k |y 1:k− 1) and the ltering distribution p(x k |y 1:k) on the time step k are given by the following Bayesian ltering equa-tions :

    • Initialization. The recursion starts from the prior distribution p(x 0).• Prediction. The predictive distribution of the state x k on time step k given

    the dynamic model can be computed by the Chapman-Kolmogorov equation

    p(x k |y 1:k− 1) = p(x k |x k− 1) p(x k− 1 |y 1:k− 1) dx k− 1. (3.11)• Update. Given the measurement y k on time step k the posterior distribution

    of the state x k can be computed by the Bayes’ rule

    p(x k |y 1:k ) = 1Z k

    p(y k |x k) p(x k |y 1:k− 1), (3.12)where the normalization constant Z k is given as

    Z k = p(y k |x k) p(x k |y 1:k− 1) dx k . (3.13) If some of the components of the state are discrete, the corresponding integrals arereplaced with summations.

    Proof. The joint distribution of x k and x k− 1 given y 1:k− 1 can be computed as

    p(x k , x k− 1 |y 1:k− 1) = p(x k |x k− 1, y 1:k− 1) p(x k− 1 |y 1:k− 1)= p(x k |x k− 1) p(x k− 1 |y 1:k− 1),

    (3.14)

    where the disappearance of the measurement history y 1:k− 1 is due to the Markov

    property of the sequence {x k , k = 1 , 2, . . .}. The marginal distribution of x k giveny 1:k− 1 can be obtained by integrating the distribution ( 3.14 ) over xk− 1 , whichgives the Chapman-Kolmogorov equation

    p(x k |y 1:k− 1) = p(x k |x k− 1) p(x k− 1 |y 1:k− 1) dx k− 1. (3.15)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    39/115

    3.1 Formal Filtering Equations and Exact Solutions 35

    previous

    currentdynamics

    Figure 3.1: Visualization of the prediction step: the prediction propagates the statedistribution of the previous measurement step through the dynamic model such that theuncertainties (stochastic terms) in the dynamic model are taken into account.

    lhoodprior

    (a)

    posterior

    (b)

    Figure 3.2: Visualization of the update step: (a) Prior distribution from prediction and thelikelihood of measurement just before the update step. (b) The posterior distribution after combining the prior and likelihood by Bayes’ rule.

    If x k− 1 is discrete, then the above integral is replaced with sum over x k− 1. Thedistribution of x k given y k and y 1:k− 1 , that is, given y 1:k can be computed by the Bayes’ rule

    p(x k |y 1:k ) = 1Z k

    p(y k |x k , y 1:k− 1) p(x k |y 1:k− 1)=

    1Z k

    p(y k |x k) p(x k |y 1:k− 1)(3.16)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    40/115

    36 Optimal Filtering

    where the normalization constant is given by Equation ( 3.13 ). The disappearanceof the measurement history y 1:k− 1 in the Equation ( 3.16 ) is due to the conditionalindependence of y k from the measurement history, given x k .

    3.1.3 Kalman Filter

    The Kalman lter (Kalman , 1960b ) is the closed form solution to the optimalltering equations of the discrete-time ltering model, where the dynamic andmeasurements models are linear Gaussian:

    x k = A k− 1 x k− 1 + q k− 1y k = H k x k + r k ,

    (3.17)

    where x k R n is the state, y k R m is the measurement, q k− 1 N(0 , Q k− 1) isthe process noise, r k N(0 , R k ) is the measurement noise and the prior distribu-tion is Gaussian x 0 N(m 0, P 0). The matrix A k− 1 is the transition matrix of thedynamic model and H k is the measurement model matrix. In probabilistic terms

    the model is p(x k |x k− 1) = N( x k |A k− 1 x k− 1, Q k− 1)

    p(y k |x k ) = N( y k |H k x k , R k ).(3.18)

    Algorithm 3.1 (Kalman lter) . The optimal ltering equations for the linear l-tering model (3.17 ) can be evaluated in closed form and the resulting distributionsare Gaussian:

    p(x k |y 1:k− 1) = N( x k |m −k , P −k ) p(x k |y 1:k ) = N( x k |m k , P k )

    p(y k

    |y 1:k− 1) = N( y k

    |H k m −k , S k ).

    (3.19)

    The parameters of the distributions above can be computed with the followingKalman lter prediction and update steps :

    • The prediction step ism −k = A k− 1 m k− 1P −k = A k− 1 P k− 1 A

    T k− 1 + Q k− 1.

    (3.20)

    • The update step isv k = y k −H k m −kS k = H k P −k H T k + R k

    K k = P −k HT k S

    − 1k

    m k = m −k + K k v kP k = P −k −K k S k K T k .

    (3.21)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    41/115

    3.1 Formal Filtering Equations and Exact Solutions 37

    The initial state has a given Gaussian prior distribution x 0 N(m 0, P 0) , whichalso dened the initial mean and covariance.

    The Kalman lter equations can be derived as follows:

    1. By Lemma A.1 on page 103, the joint distribution of xk and xk− 1 given

    y 1:k− 1 is

    p(x k− 1, x k |y 1:k− 1) = p(x k |x k− 1) p(x k− 1 |y 1:k− 1)= N( x k |A k− 1 x k− 1, Q k− 1) N(x k− 1 |m k− 1, P k− 1)= N x k− 1

    x k m ′ , P ′ ,(3.22)

    where

    m ′ = m k− 1A k− 1 m k− 1

    , P ′ = P k− 1 P k− 1 A T k− 1

    A k− 1 P k− 1 A k− 1 P k− 1 A T k− 1 + Q k− 1.

    (3.23)and the marginal distribution of x k is by Lemma A.2

    p(x k |y 1:k− 1) = N( x k |m −k , P −k ), (3.24)where

    m −k = A k− 1 m k− 1, P−k = A k− 1 P k− 1 A

    T k− 1 + Q k− 1. (3.25)

    2. By Lemma A.1 , the joint distribution of y k and x k is

    p(x k , y k |y 1:k− 1) = p(y k |x k) p(x k |y 1:k− 1)= N( y k |H k x k , R k) N(x k |m −k , P −k )= N x k

    y k m ′′ , P ′′ ,(3.26)

    where

    m ′′ = m−k

    H k m −k, P ′′ = P

    −k P

    −k H

    T k

    H k P −k H k P−k H

    T k + R k

    . (3.27)

    3. By Lemma A.2 the conditional distribution of x k is

    p(x k |y k , y 1:k− 1) = p(x k |y 1:k)= N( x k |m k , P k ),

    (3.28)

    wherem k = m −k + P

    −k H

    T k (H k P

    −k H

    T k + R k )

    − 1[y k −H k m −k ]P k = P −k −P −k H T k (H k P −k H T k + R k )− 1 H k P −k

    (3.29)

    which can be also written in form ( 3.21 ).

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    42/115

    38 Optimal Filtering

    The functional form of the Kalman lter equations given here is not the only pos-sible one. In the numerical stability point of view it would be better to work withmatrix square roots of covariances instead of plain covariance matrices. The theoryand details of implementation of this kind of methods is well covered, for example,in the book of Grewal and Andrews (2001 ).

    Example 3.2 (Kalman lter for Gaussian random walk) . Assume that we are ob-serving measurements yk of the Gaussian random walk model given in Example 3.1and we want to estimate the state xk on each time step. The information obtained up to time step k −1 is summarized by the Gaussian ltering density

    p(xk− 1 |y1:k− 1) = N( xk− 1 |mk− 1, P k− 1). (3.30)The Kalman lter prediction and update equations are now given as

    m−k = mk− 1

    P −k = P k− 1 + q

    mk = m−k + P −kP −k + r

    (yk −m−k )

    P k = P −k − (P −k )

    2

    P −k + r.

    (3.31)

    0 20 40 60 80 100

    −10

    −8

    −6

    −4

    −2

    0

    2

    4

    6 MeasurementSignal

    Figure 3.3: Simulated signal and measurements of the Kalman ltering example (Example3.2) .

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    43/115

    3.2 Extended and Unscented Filtering 39

    0 20 40 60 80 100

    −10

    −8

    −6

    −4

    −2

    0

    2

    4

    6 MeasurementSignalFilter Estimate95% Quantiles

    Figure 3.4: Signal, measurements and ltering estimate of the Kalman ltering example(Example 3.2).

    3.2 Extended and Unscented Filtering

    Often the dynamic and measurement processes in practical applications are notlinear and the Kalman lter cannot be applied as such. However, still often theltering distributions of this kind of processes can be approximated with Gaussiandistributions. In this section, three types of methods for forming the Gaussianapproximations are considered, the Taylor series based extended Kalman lters,statistical linearization based statistically linearized lters and unscented transform

    based unscented Kalman lters.

    3.2.1 Linearization of Non-Linear Transforms

    Consider the following transformation of a Gaussian random variable x into an-other random variable y

    x N(m , P )y = g (x ).

    (3.32)

    where x R n , y R m , and g : R n → R m is a general non-linear function.Formally, the probability density of the random variable y is1 (see, e.g Gelmanet al., 1995 )

    p(y ) = |J (y )| N(g − 1(y ) |m , P ), (3.33)1This actually only applies to invertible g (·) , but it can be easily generalized to the non-invertible

    case.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    44/115

    40 Optimal Filtering

    where |J (y )| is the determinant of the Jacobian matrix of the inverse transformg − 1(y ). However, it is not generally possible to handle this distribution directly,because it is non-Gaussian for all but linear g .

    A rst order Taylor series based Gaussian approximation to the distribution of y can be now formed as follows. If we let x = m + δ x , where δ x N(0 , P ), we

    can form Taylor series expansion of the function g (·) as follows:g (x ) = g (m + δ x ) = g (m )+ G x (m ) δ x +

    i

    12

    δ x T G (i)xx (m ) δ x e i + . . . (3.34)

    where and G x (m ) is the Jacobian matrix of g with elements

    [G x (m )] j,j ′ = ∂g j (x )

    ∂x j ′x = m

    . (3.35)

    and G (i)xx (m ) is the Hessian matrix of gi (·) evaluated at m :

    G (i)xx (m ) j,j ′

    = ∂ 2gi(x )

    ∂x j ∂x j ′,

    x = m

    . (3.36)

    e i = (0 · · · 0 1 0 · · · 0)T is a vector with 1 at position i and other elements arezero, that is, it is the unit vector in direction of the coordinate axis i.

    The linear approximation can be obtained by approximating the function bythe rst two terms in the Taylor series:

    g (x ) ≈g (m ) + G x (m ) δ x . (3.37)Computing the expected value w.r.t. x gives:

    E[g (x )] ≈E[g (m )] + G x (m ) δ x ]= g (m ) + G x (m ) E[δ x ]= g (m ).

    (3.38)

    The covariance can be then approximated as

    E (g (x ) −E[g (x )]) (g (x ) −E[g (x )])T

    ≈E (g (x ) −g (m )) (g (x ) −g (m )) T

    ≈E (g (m ) + G x (m ) δ x −g (m )]) (g (m ) + G x (m ) δ x −g (m )) T = E (G x (m ) δ x ) (G x (m ) δ x )T

    = G x (m ) E δ x δ x T G T x (m )

    = G x (m ) P G T x (m ).(3.39)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    45/115

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    46/115

    42 Optimal Filtering

    Furthermore, in ltering models where the process noise is not additive, weoften need to approximate transformations of the form

    x N(m , P )q N(0 , Q )

    y = g (x , q ).

    (3.46)

    where x and q are uncorrelated random variables. The mean and covariance can benow computed by substituting the augmented vector (x , q ) to the vector x in Equa-tion ( 3.41 ). The joint Jacobian matrix can be then written as G x ,q = ( G x G q ).Here G q is the Jacobian matrix of g(·) with respect to q and both the Jacobianmatrices are evaluated at x = m , q = 0 . The approximations to the mean andcovariance of the augmented transform as in Equation ( 3.41 ) are then given as

    E[g̃ (x , q )] ≈g (m , 0 )Cov[g̃ (x , q )]

    I 0

    G x (m ) G q (m )

    P 0

    0 Q

    T

    I 0

    G x (m ) G q ()

    T

    = P P GT x (m )

    G x (m ) P G x (m ) P G T x (m ) + G q (m ) Q G T q (m )(3.47)

    The approximation above can be formulated as the following algorithm:

    Algorithm 3.3 (Linear approximation of non-additive transform) . The linear ap- proximation based Gaussian approximation to the joint distribution of x and thetransformed random variable y = g (x , q ) when x N(m , P ) and q N(0 , Q )is given as

    x

    yN m

    µ L, P C L

    CT L SL

    , (3.48)

    where

    µ L = g (m )

    S L = G x (m ) P G T x (m ) + G q (m ) Q G T q (m )

    C L = P G T x (m ),

    (3.49)

    and G x (m ) is the Jacobian matrix of g with respect to x , evaluated at x = m , q =0 with elements

    [G x (m )] j,j ′ = ∂g j (x , q )

    ∂x j ′x = m ,q = 0

    . (3.50)

    and G q (m ) is the corresponding Jacobian matrix with respect to q :

    [G q (m )] j,j ′ = ∂g j (x , q )

    ∂q j ′x = m ,q = 0

    . (3.51)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    47/115

    3.2 Extended and Unscented Filtering 43

    3.2.2 Extended Kalman Filter (EKF)

    The extended Kalman lter (EKF) (see, e.g., Jazwinski , 1970 ; Maybeck , 1982a ;Bar-Shalom et al. , 2001 ; Grewal and Andrews , 2001 ) is an extension of the Kalmanlter to non-linear optimal ltering problems. If process and measurement noisescan be assumed to be additive, the EKF model can be written as

    x k = f (x k− 1) + q k− 1y k = h (x k ) + r k ,

    (3.52)

    where x k R n is the state, yk R m is the measurement, q k− 1 N(0, Q k− 1)is the Gaussian process noise, r k N(0, R k ) is the Gaussian measurement noise,f (·) is the dynamic model function and h (·) is the measurement model function.The functions f and h can also depend on the step number k, but for notationalconvenience, this dependence has not been explicitly denoted.

    The idea of extended Kalman lter is to form Gaussian approximations

    p(xk |

    y1:k

    )

    ≈N(x

    k |m

    k, P

    k), (3.53)

    to the ltering densities. In EKF this is done by utilizing linear approximations tothe non-linearities and the result is

    Algorithm 3.4 (Extended Kalman lter I) . The prediction and update steps of the rst order additive noise extended Kalman lter (EKF) are:

    • Prediction:m −k = f (m k− 1)P −k = F x (m k− 1) P k− 1 F

    T x (m k− 1) + Q k− 1.

    (3.54)

    • Update:v k = y k −h (m −k )S k = H x (m −k ) P

    −k H

    T x (m −k ) + R k

    K k = P −k HT x (m −k ) S

    − 1k

    m k = m −k + K k v kP k = P −k −K k S k K T k .

    (3.55)

    These ltering equations can be derived by repeating the same steps as inderivation of the Kalman lter in Section 3.1.3 and by applying Taylor seriesapproximations on appropriate steps:

    1. The joint distribution of xk and xk− 1 is non-Gaussian, but we can formGaussian approximation to it by applying the approximation Algorithm 3.2to the function

    f (x k− 1) + q k− 1, (3.56)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    48/115

    44 Optimal Filtering

    which results in the Gaussian approximation

    p(x k− 1, x k , |y 1:k− 1) ≈Nx k− 1

    x k m ′ , P ′ , (3.57)where

    m ′ = m k− 1f (m k− 1)

    P ′ = P k− 1 P k− 1 FT x

    F x P k− 1 F x P k− 1 F T x + Q k− 1,

    (3.58)

    and the Jacobian matrix F x of f (x ) is evaluated at x = m k− 1 . The marginalmean and covariance of x k are thus

    m −k = f (m k− 1)

    P −k = F x P k− 1 FT x + Q k− 1.

    (3.59)

    2. The joint distribution of y k and x k is also non-Gaussian, but we can againapproximate it by applying the Algorithm 3.2 to the function

    h (x k ) + r k . (3.60)

    We get the approximation

    p(x k , y k |y 1:k− 1) ≈Nx ky k m ′′ , P ′′ , (3.61)

    where

    m ′′ = m−k

    h (m −k ), P ′′ = P

    −k P

    −k H

    T x

    H x P −k H x P−k H

    T x + R k

    , (3.62)

    and the Jacobian matrix H x of h (x ) is evaluated at x = m−k .

    3. By Lemma A.2 the conditional distribution of x k is approximately

    p(x k |y k , y 1:k− 1) ≈N(x k |m k , P k ), (3.63)where

    m k = m −k + P−k H

    T x (H x P −k H

    T x + R k )− 1[y k −h (m −k )]

    P k = P −k −P −k H T x (H x P −k H T x + R k)− 1 H x P −k(3.64)

    A more general non-additive noise EKF ltering model can be written as

    x k = f (x k− 1, q k− 1)y k = h (x k , r k ), (3.65)

    where qk− 1 N(0, Q k− 1) and r k N(0, R k ) are the Gaussian process andmeasurement noises, respectively. Again, the functions f and h can also depend onthe step number k.

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    49/115

    3.2 Extended and Unscented Filtering 45

    Algorithm 3.5 (Extended Kalman lter II) . The prediction and update steps of the(rst order) extended Kalman lter (EKF) in the non-additive noise case are:

    • Prediction:m −

    k = f (m k− 1, 0 )

    P −k = F x (m k− 1) P k− 1 FT x (m k− 1) + F q (m k− 1) Q k− 1 F T q (m k− 1).

    (3.66)

    • Update:v k = y k −h (m −k , 0 )S k = H x (m −k ) P

    −k H

    T x (m −k ) + H r (m

    −k ) R k H

    T r (m −k )

    K k = P −k HT x (m −k ) S

    − 1k

    m k = m −k + K k v k

    P k = P−k −K k S k K

    T k .

    (3.67)

    where the matrices F x (m ) , F q (m ) , H x (m ) , and H r (m ) , are the Jacobian ma-trices of f and h with respect to state and noise, with elements

    [F x (m )] j,j ′ = ∂f j (x , q )

    ∂x j ′x = m ,q = 0

    (3.68)

    [F q (m )] j,j ′ = ∂f j (x , q )

    ∂q j ′x = m ,q = 0

    (3.69)

    [H x (m )] j,j ′ = ∂h j (x , r )

    ∂x j ′x = m ,r = 0

    (3.70)

    [H r (m )] j,j ′ = ∂h j (x , r )

    ∂r j ′x = m ,r = 0

    . (3.71)

    These ltering equations can be derived by repeating the same steps as in thederivation of the extended Kalman lter above, but instead of using the Algorithm3.2, we use the Algorithm 3.3 for computing the approximations.

    The advantage of EKF over the other non-linear ltering methods is its relativesimplicity compared to its performance. Linearization is very common engineeringway of constructing approximations to non-linear systems and thus it is very easyto understand and apply. A disadvantage of it is that because it is based on alocal linear approximation, it will not work in problems with considerable non-linearities. Also the ltering model is restricted in the sense that only Gaussiannoise processes are allowed and thus the model cannot contain, for example, dis-crete valued random variables. The Gaussian restriction also prevents handling of

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    50/115

    46 Optimal Filtering

    hierarchical models or other models where signicantly non-Gaussian distributionmodels would be needed.

    The EKF is also the only ltering algorithm presented in this document, whichformally requires the measurement model and dynamic model functions to be dif-ferentiable. This as such might be a restriction, but in some cases it might also be

    simply impossible to compute the required Jacobian matrices, which renders theusage of EKF impossible. And even when the Jacobian matrices exist and couldbe computed, the actual computation and programming of Jacobian matrices canbe quite error prone and hard to debug.

    In so called second order EKF the non-linearity is approximated by retainingthe second order terms in Taylor series expansion:

    Algorithm 3.6 (Extended Kalman lter III) . The prediction and update steps of the second order extended Kalman lter are:

    • Prediction:m −k = f (m k− 1) +

    1

    2 ie i tr F

    (i)xx (m k− 1) P k− 1

    P −k = F x (m k− 1) P k− 1 FT x (m k− 1)

    + 12

    i,i ′e i e T i ′ tr F

    (i)xx (m k− 1)P k− 1F

    (i ′ )xx (m k− 1)P k− 1

    + Q k− 1.

    (3.72)

    • Update:v k = y k −h (m −k ) −

    12

    i

    e i tr H(i)xx (m −k ) P

    −k

    S k = H x (m −k ) P−k H

    T x (m −k )

    + 12

    i,i ′e i e T i ′ tr H

    (i)xx (m −k ) P

    −k H

    (i ′ )xx (m −k ) P

    −k + R k

    K k = P −k HT x (m −k ) S

    − 1k

    m k = m −k + K k v kP k = P −k −K k S k K T k ,

    (3.73)

    where the matrices F x (m ) and H x (m ) are given by the Equations (3.68 ) and (3.70 ). The matrices F (i)xx (m ) and H (i)xx (m ) are the Hessian matrices of f i and hirespectively:

    F (i)xx (m ) j,j ′

    = ∂ 2f i (x )

    ∂x j ∂x j ′x = m

    (3.74)

    H (i)xx (m ) j,j ′

    = ∂ 2h i(x )∂x j ∂x j ′

    x = m

    . (3.75)

  • 8/11/2019 Bayesian Estimation of Time Varying Systems

    51/115

    3.2 Extended and Unscented Filtering 47

    The non-additive version can be derived in analogous manner, but due to itscomplicated appearance, it is not presented here.

    3.2.3 Statistical Linearization of Non-Linear Transforms

    In statistically linearized lter (Gelb , 1974 ) the Taylor series approximation used inthe EKF is replaced by statistical linearization. Recall the transformation problemconsidered in Section 3.2.1, which was stated as

    x N(m , P )y = g (x ).

    In statistical linearization we form a linear approximation to the transformation asfollows:

    g (x ) ≈b + A δ x , (3.76)where δ x = x

    −m , such that the mean squared error is minimized:

    MSE( b , A ) = E[( g (x ) −b −A δ x )T (g (x ) −b −A δ x )]. (3.77)Setting derivatives with respect to b and A zero gives

    b = E[ g (x )]

    A = E[ g (x ) δ x T ]P − 1.(3.78)

    In this approximation of transform g (x ), b is now exactly the mean and the ap-proximate covariance is given as

    E[(g (x )

    −E[g (x )]) (g (x )

    −E[g (x )])T ]

    ≈A P A T = E[ g (x ) δ x T ]P − 1 E[g (x ) δ x T ]T .

    (3.79)

    We may now apply this approximation to the augmented function g̃ (x ) = ( x ; g (x ))in Equation (3.40 ) of Section 3.2.1, where we get the approximation

    E[g̃ (x )] ≈ m

    E[g (x )]

    Cov[g̃ (x )] ≈ P E[g (x ) δ x ]T

    E[g (x ) δ x T ] E[g (x ) δ x T ]P − 1