cmop technical report tr-07-001 - stccmop.org · cmop technical report tr-07-001 ... although...

Sequential Data Assimilation with Sigma-point KalmanFilter on Low-dimensional Manifold

April 23, 2007

Zhengdong Lu, Todd K. Leen and Rudolph van der MerweDepartment of Computer Science & Electrical Engineering, OGI School of Science & Engi-neering, Oregon Health & Science University, Portland, OR 97006, USA

Sergey Frolov and Antonio M. BaptistaDepartment of Environmental & Biomolecular Systems, OGI School of Science & Engineering,Oregon Health & Science University, Portland, OR 97006, USA

Abstract

In order to address the highly nonlinear dynamics in estuaryflow, we propose a noveldata assimilation system based on components designed to accurately reflect nonlinear dy-namics. The core of the system is a sigma-point Kalman filter coupled to a fast, neuralnetwork emulator for the flow dynamics. In order to be computationally feasible, the entiresystem operates on a low-dimensional subspace obtained by principal component projec-tion. Our probabilistic latent state space analysis properly accounts for noise induced by thedimensionality reduction and by errors in the emulator for the flow dynamics. Experimentson a benchmark estuary problem show that our data assimilation method can significantlyreduce prediction errors.

1 Introduction

The strongly non-linear dynamics encountered in estuarineflow presents a significant challengeto data assimilation systems. The nonlinear flow dynamics, together with the desire to buildportable data assimilation technology that can be applied to different problems without substan-tial model-building overhead, led us to construct the noveldata assimilation system presentedhere.

We present an efficient data assimilation system with a rigorous probabilistic foundationthat deals naturally and accurately with nonlinear flow. Ourresearch is part of the CORIE (cita-tion) environmental observation and forecasting system (EOFS) for the Columbia River estuaryand near ocean. CORIE integrates a real-time network, data management system, advanced

1

numerical models such as ELCIRC (Zhang and Baptista, 2005) or SELFE (Zhang et al., 2004).Through this integration, we seek to characterize and predict complex circulation and mixingprocesses in a system encompassing the lower river, the estuary and the near-ocean.

Bathymetry

Data Assimilation

Probabilistic Sensor Fusion

Numerical

Codes

(ELCIRC, SELFE)

Data Products

• Daily forecasts

• Historic hindcasts

• Bio-constituents, etc.

External Forcings

• Short waves

• Ocean tides & circulation

• River discharges

• Atmospheric forcings (wind, pressure, heat exchange, etc.)

Sensor Data

• Remote sensing

(satellite) data

• In-situ data from

CORIE sensor platforms

Figure 1: CORIE is a pilot environmental observation and forecasting system (EOFS) forColumbia River. It integrates a real-time network, a data management system, advanced nu-merical models such as ELCIRC or SELFE, and an advanced data assimilation framework.

Although variational assimilation is an icon of formal rigor with widespread successful ap-plication (Bennett, 1998), two of the features of oceanographic problems suggest a differentapproach. The strong non-linearity in the CORIE estuarine dynamics suggested an iterative in-ference (Kalman filtering) may be better suited. Secondly, development of an adjoint numericalcode for CORIE represents a significant development overhead that is not portable to unrelatedassimilation problems. In contrast, the algorithm we propose in this paper is based on off-the-shelf neural network and Kalman filtering algorithms, and thus can be customized to other dataassimilation tasks with little effort.

Another difficulty posed by CORIE system is the intimidatingdimensionality of state. Forthe whole CORIE domain, the 3D grid used by ELCIRC includes over 106 vertices. The vari-ables of interest include the elevation, salinity, temperature and velocities on each vertex in thegrid, which gives the degree of freedom with order107. The huge state dimension makes di-rect use of Kalman filter intractable, since we need to deal with covariance with size around107 × 107. To make the computation tractable, we need to significantlyreduce the model size.In this paper, we use principal components analysis (PCA) technique (Jolliffe, 1986; Berkoozet al., 1993; Sirovich, 1987) to achieve a much simpler modelwith state dimension20 ∼ 60.

To deal with the strong non-linearity of dynamics, we use Sigma-point Kalman filter (SPKF)(van der Merwe and Wan, 2003, 2004), a Monte-Carlo based technique originally developed forcontrol and navigation. This particular Kalman filter implementation requires2N +1 ensemblestates, whereN is the dimension of the Kalman filter state vector (for the problem in this paper,N is several hundred). Efficient evaluation and propagation of such large (>500) state ensemblegroups demand a ultra-fast simulator of numerical model. Towards this end, we developed

2

neural network surrogate for the dynamics of the reduced system, which works at least 30times faster than the numerical model (ELCIRC) and 1000 faster than real time (van der Merweand Leen, 2005).

The combination of dimension reduction, neural network surrogate and Sigma-point Kalmanfilter (SPKF) is specially designed for extra-large non-linear model like CORIE, and to ourknowledge, is unique in data assimilation practice. Much work has been done on harnessingthe high dimensionality of state space. A large body of it is on reducing the covariance matrix,which is realized either by explicit low-rank approximation of error covariance matrix (Phamet al., 1998; Hoteit et al., 2002, 2001; Verlaan and Heemink,2001), or by representing errorstatistics with an ensemble of states (Evensen, 2003, 2002;Heemink et al., 2001). All thesemodels are dealing with full states and typically require propagating multiple states throughmodel operator during each Kalman filtering iteration, which makes them expensive for hugemodels with slow simulation. The idea of directly working ona reduced model has been pur-sued by several authors (Cane et al., 1996; Hoteit and Pham, 2003) prior to our work. However,their models require direct reduction of model dynamics in its explicit form, which is feasibleonly for linear system or relatively small non-linear models. We circumvent this difficulty bytraining a neural net to mimic the non-linear dynamics in reduced system. Although neuralnets have long been used as fast surrogates for complex physical model (Principe et al., 1992;Grzeszczuk et al., 1998; Bishop et al., 1995), they have never been integrated into a data assimi-lation system. In data assimilation community, the non-linearity of dynamics is usually handledeither by extended Kalman filter (EKF) (Chui and Chen, 1999; Hoteit and Pham, 2003): amethod based on local linearization , or by random sampling,for example Ensemble Kalmanfilter (EnKF) (Evensen, 2003, 2002). It is known (van der Merwe and Wan, 2003, 2004) thatSPKF can capture the nonlinearity more accurately than EKF with the same computationalcomplexity. Different from EnKF, SPKF employees a more sensible deterministic samplingstrategy, which enables SPKF can achieve decent performance with a relatively small numberof ensemble states. Although SPKF is widely known for its success in control and navigation,is relatively new for oceanographic applications.

Another contribution of this paper is to give a rigorous probabilistic framework for data as-similation in a low-dimensional subspace. This probabilistic interpretation allows us to analyzedifferent noise terms with techniques developed in machinelearning community(Bishop, 1995;Tipping and Bishop, 1997). For example, we can effectively estimate the extra observationnoise caused by the dimension reduction, which is consistent with the one given by Cane et al.(1996).

The paper is organized as follows. In section 2, we shall iantroduce the probabilistic foun-dation for model reduction. In section 3, we shall give the framework of data assimilation withKalman filtering. In section 4, we shall apply our approach toa estuary benchmark problem.Section 5 summarizes the whole paper and point out directions for future research.

2 Model Dimension Reduction

In this section, we discuss the model dimensionality reduction for the dynamic system used indata assimilation. In section 2.1 and section 2.2, we introduce the dimension reduction tech-

3

nique based on principal component analysis (PCA) and its probabilistic interpretation. In sec-tion 2.3, we analyze the dynamics of the reduced model. In section 2.4, we give the observationform for the reduced model.

2.1 Principal Component Analysis

Principal component analysis (PCA) is a conventional modelreduction technique for dynamicsystems (Berkooz et al., 1993; Sirovich, 1987). UseX

.= R

m to denote the state space of thefull numerical model, e.g., for CORIEm ≈ 107. The basic idea is to take a set of snapshotsX = {x1, x2, ..., xN}1, xi ∈ X , randomly sampled from the trajectory of the numerical model,and find the firstd principal components ofX. We assume here that the trajectories of the sys-tem occupies some significantly lower dimensional (non-linear) manifold and that a principalcomponent analysis is capable of accurately identifying a compact linear subspace that con-tains that manifold. This assumption is justified by the factthat PCA subspace captures a highpercentage (over98% in our example) of the variance exhibited in the full numerical model.

For simplicity, we assume samples inX have zero mean:x = 1N

∑di=1 xi = 0. Let theΣ

be the covariance matrix of state vector estimated fromX. The firstd principal components ofX, denoted as{φ1, φ2, ...φd} are thed eigenvectors ofΣ with the largest eigenvalues2. UseXS

to denote the PCA subspace spanned by thed vectors:

XS = span{φ1, φ2, · · · , φd}.

Apparently,XS is d-dimensional. The dimension reduction is achieved by the mapping Π :R

m → Rd defined as follows:

Πx.=

(φ1)T (x − x)(φ2)T (x − x)

...(φd)T (x − x)

= πx, (1)

whereπ = [φ1 φ2 · · · φd]T is a d × m matrix. It is easy to verify that the operatorΠ islength-preserving for vectors inXS .

2.2 Latent Space Formula

Using the probabilistic interpretation of PCA proposed by Tipping and Bishop (1997), we cangive a rigorous treatment of the subspaceXS . First we assume that the dynamics are confinedto ad-dimensionallatent space S, which is mapped toX through a linear transform:

x = Ws + µ + ǫ, s ∈ S (2)

1In this paper, a number used as superscript is a index of elements in finite set; a numbe used as subscript is aindex of time step, as will be defined in this section.

2In practice, the PCA is implemented with singular value decomposition (SVD), which yields the same set ofeigenvectors but is numerically superior to direct digitalization ofΣ (Golub and van Loan, 1996).

4

hereW is a m × d matrix (of rankd), µ ∈ Rm is the mean, andǫ is the Gaussian noise:

ǫ ∼ N(0, σ2Im×m). We further assume that the PCA subspaceXS is the image of latent spaceS after the isomorphic transformation:

xs = Ws + µ, (3)

which is essentially the mapping in equation (2) before noise is added. We can representxs

more concisely using the coordinate ofXS, which can be conveniently implemented using themappingΠ defined from the PCA:

xπ .= Πxs = ΠWs + Πµ. (4)

Note in equation (3) and (4),xs (∈ Rm) andxπ (∈ R

d) are two different representations ofthe same point.xπ can be re-embedded into the full-space with the conjugate transpose ofΠ,denoted asΠ+:

Π+xπ .= πT xπ.

The projection operatorPΠ = Π+Π maps any vectorx ∈ X to its orthogonal component inXS . In the remainder of the paper, thed-dimensional vectorxπ will be refereed to assubspacestate, as opposed tofull-space states, which we will use to denote them-dimensional states inX , such asx or xs.

There are two things we should bear in mind. First, the noiseǫ explains the deviation ofsamples from subspaceXS. This deviation is the reconstruction error when we try to approxi-mate the a statex with its projectionPΠx onXS. Second, the noiseǫ also causes the differencebetweenxπ andΠx, assumingx andxπ are associated with the sames in equation (2) and (4).Indeed, from equation in equation (2), (3) and (4),

Πx − xπ = Π(x − xs) = Πǫ. (5)

Figure 2 shows the relation betweenx, xs andΠx in a three dimensional illustration. In dataassimilation, the true subspace statexπ is unknown, so we useΠx as its approximation. Thequality of this approximation can measured by:

Exπ,x(||Πx − xπ||2) = Eǫ(||Πǫ||2) = dσ2. (6)

In section 2.4 and 3.1, we will see that this approximation introduces some extra error. However,as will be shown in section 2.4, this error is negligible due to the extremely small value ofσ2.

2.3 The Dynamics in Subspace

Our central assumption is that the dynamics inX is merely the noisy image of the dynamicsin S after the mapping in equation (2). According to this assumption, we can simplify our taskby studying the dynamics inS, the dimension of which (d) is typically much smaller than thedimension of the numerical model (m). However, the dynamic inS is not directly accessiblesince we do not know the concrete form of the mapping in equation (2). Instead, we can studyits isomorphic image onXS . This isomorphism between two dynamic systems is illustrated inFigure 3 with a two-dimensional example.

5

Figure 2: A geometric illustration of several variables:x, xs, xπ andΠxπ. (a) Laten spaceSand a latent variables. (b) Full-state spaceX and images ofs after mappings in equation (2)and (3). (c) PCA subspaceXS . Note the variables in (c) are subspace states.

We consider the evolution of system state at discrete timesxk.= x(t0 + kτ), wheret0 is the

origin of the time axis andτ is the time interval used in data assimilation. Noteτ can be differentfrom the “internal” time interval used by the full numeric code (ELCRIC in our examples). Inthe remainder of the paper, we will use subscripts to index the time step, which also applies toother time-dependent variable likexπ, s andǫ. To facilitate the training of the neural networksurrogate (as will be seen in section 3.1), we consider an adjusted form of dynamics in whichthe statesk is decided not only by the statesk−1 and driving forceuk−1, but also the past statesand forces up toT time steps ago:

sk = fL(sk−1, · · · , sk−T , uk−1, · · · , uk−T ), (7)

wheresk anduk are the latent space state and driving force at discrete timeindexk, andfL isa non-linear function modeling the dynamics.

The isomorphism betweenS andXS is given in equation (4), which can be inverted (it iseasy to prove thatΠW is a non-singulard × d matrix ):

s = (ΠW )−1(xπ − Πµ). (8)

Using equation (7) and (8), we get the the corresponding dynamics inXS as follows:

xπk = ΠWfL((ΠW )−1(xπ

k−1 − Πµ), · · · ,

(ΠW )−1(xπk−T − Πµ), uk−1, · · · , uk−T ) + Πµ, (9)

which can be rewritten in the following concise form:

xπk = f(xπ

k−1, · · · , xπk−T , uk1, · · · , uk−T ). (10)

Equation (10) gives the dynamics in the subspaceXS. As mentioned earlier, we will use a neuralnetwork surrogate to mimic the dynamic system in equation (10), which will be discussed insection 3.1.

6

Figure 3: The dynamics in latent spaceS is isomorphically mapped toXS. Note that states andvectors in subspaceXS are represented in subspace coordinate while states inX are representedin full space coordinate

2.4 Observation Noise for the Reduced Model

2.4.1 Observation Form

Our observation is usually from the sensors on stations or from cruise data, both of which canbe formulated as:

yk = Hxk + wmk , (11)

hereH is a linear operator andwmk is the observation noise. Equation (11) is not instantly appli-

cable in Kalman filtering sincexk ∈ X is the full-space state while the dynamic, as expressedin equation (10), is in the subspace coordinates. We need to find the observation operator thatrelates subspace states to observations. We first notice that

xk = xsk + ǫk (12)

= Π+xπk + ǫk. (13)

Combining equation (11) and equation (13), it is not hard to see

yk = H(Π+xπk + ǫk) + wm

k (14)

= HΠ+xπk + (Hǫk + wm

k ) (15)

= HΠ+xπk + wo

k, (16)

wherewok = Hǫk + wm

k . Therefore for the dynamics inXS expressed in equation (10), theobservational operator isHΠ+, and the observation noisewo

k consists of the measurementnoisewm

k and the noiseHǫk. We assumewmk andǫk are both stationary white Gaussian noise

and independent to each other. We further assume the variance ofwm is known asR.

7

2.4.2 Analysis and Estimation of Observation Noise

The observation noisewok can be further decomposed into three parts:

wok = Hǫk + wm

k (17)

= H(xk − PΠxk) + H(PΠxk − xsk) + wm

k (18)

The first termH(xk −PΠxk) comes from the reconstruction errorxk −PΠxk, which is the partof noiseǫk orthogonal toXS; The second term comes from the part of noiseǫk that is withinXS . It is straightforward to show that the two terms are independent to each other. Thereforethe covariance ofwo consists of the following three parts:

cov(wok) = cov(H(xk − PΠxk)) + cov(H(PΠxk − xs

k)) + cov(wmk ). (19)

The first term of equation (19) is the covariance of reconstruction error, which can be estimatedfrom X = {x1, x2, · · · , xN}:

cov(H(xk − PΠxk)) ≈1

NH

N∑

i=1

(PΠxi − xi)(PΠxi − xi)T HT ; (20)

The second term of equation (19) is

cov(H(PΠxk − xsk)) = σ2HU

[

Id×d 00 0

]

UT HT , (21)

whereU is am × m orthogonal matrix with each column a eigenvector ofΣ. The maximumlikelihood estimation ofσ2 from X is given by (Tipping and Bishop, 1997):

σ2ML =

1

N(m − d)

N∑

i=1

(PΠxi − xi)T (PΠxi − xi). (22)

Note that 1N

∑Ni=1(PΠxi − xi)T (PΠxi − xi) is the estimated variance lost in dimension reduc-

tion. For our problem the lost variance is usually less than2% of the total variance. Consideringm is usually greater than105, we haveσ2

ML less than0.0001% of the total variance.In practice, we neglect the second term in equation (19), andthus estimate the covariance

of the observation noisewok as:

cov(wok) ≈ H

1

N

N∑

i=1

(PΠxi − xi)(PΠxi − xi)T HT + R. (23)

Correspondinglywo

k ≈ H(xk − PΠxk) + wmk . (24)

We notice that the square norm of the vector(xk − PΠxk) is the reconstruction error forxk.It follows Equation (24) that when the reconstruction erroris high, we are likely to get a bigobservation noise.

8

3 Kalman Filtering for Data Assimilation

Based on the probabilistic dimension reduction model we proposed in section 2, we obtaina reduced dynamic model with manageable size and proper observation noise form. In thissection we will build the framework of sequential data assimilation with Kalman filter. First, insection 3.1, we discuss using neural networks as ultra-fastsurrogates for dynamics inXS . Thenin section 3.2, we combine the context from all previous sections and give the complete Kalmanfilter equations. In section 3.3, we briefly introduced Sigma-point Kalman filter technique thatwill be used in our data assimilation system. Finally in section 3.4, we discuss how to obtaindata-assimilated state estimation from the Kalman filtering result.

3.1 Neural Network Surrogate

As mentioned in the introduction section, our Sigma-point Kalman filter requires evaluating2N + 1 ensemble states at each time step, withN the dimension of the Kalman filter statevector. For our problem,N is usually several hundred to over one thousand. However, thenumerical model (ELCIRC), typically runs1.5 ∼ 10 times faster than real time. This is thusprohibitively expensive for the evaluation and propagation of such large (>500) state ensemble,and an ultra-fast simulator is needed. Towards this end, we developed neural network surrogatefor the dynamics of the reduced system, more specifically, asan approximator of the non-linearfunction f in equation (10). The trained neural network surrogate works at least 30 timesfaster than the numerical model (ELCIRC) and 1000 faster than real time (van der Merwe andLeen, 2005). Our neural network surrogates are non-linear feed-forward multi-layer percep-trons (Bishop, 1995). Such networks are very well-equippedfor modeling non-linear relationsamong high-dimensional inputs and outputs where large datasets are available for model fit-ting (or training). Where there is significant non-linearity, their performance exceeds that oftraditional linear models such as ARMA and GLMs (Bishop, 1995). In fact, for the predictionproblem at hand, we found that linear predictors, fit using standard robust least-squares regres-sion techniques were inherently unstable with poles lying outside the unit circle (van der Merweand Leen, 2005). This results in unbounded exponential growth of the networks response overtime, which renders them useless for data assimilation.

The neural network is a standard multi-layer perceptron (MLP) with a single hidden layerwith hyperbolic tangent activation functions and a linear output layer. This standard structure isoften used for general non-linear regression problems. Thesize of the network input and outputlayers are dictated by the dimension and embedding length ofthe subspace state variables andforcings. The size of hidden layer can be set in order to control total model complexity (numberof free parameters). Typically this ‘hyperparameters’ canbe set using some form of cross-validation. We chose not to constrain the size of the hidden layer not too severely, but rathercontrol model overfitting through the use of weight-decay regularization (Bishop, 1995). Thetraining of the neural network is based on the full-space states{xk} sampled from the trajectoryof numerical model. Since we do not know the corresponding true subspace states{xπ

k}, weuse the dimension-reduced states{Πxk} instead. As mentioned in section 2.2,Πxk deviatesfrom xπ

k by Πǫk. The expectation length of this derivation can be measured as follows

Exk,xπk(||Πxk − xπ

k ||2) = Eǫk

(||Πǫk||2) = dσ2.

9

Sinceσ2 is extremely small, as established in section 2.3, this deviation can be safely neglected.The prediction of neural network prediction is formulated as:

xπk+1 = fNN (xπ

k , · · · , xπk−T+1, uk, · · · , uk−T+1) + wx

k , (25)

wherefNN is the neural network predictor andwxk is the prediction error (process noise) at

thekth step. Again we assume the process noise is white andwxk ∼ N(0, Q). The covariance

matrixQ can be estimated from the residual of surrogate predictions:

Q =1

Nt

Nt∑

k=1

(xπk − Πxk)(x

πk − Πxk)

T . (26)

whereNt is the number of time steps used for covariance matrix estimation, andxπk is the

surrogate prediction ofxπk :

xπk = fNN(Πxk−1, · · · ,Πxk−T , uk−1, · · · , uk−T ) (27)

The prediction error may originate from different sources.First, there is the inherent error ofneural network in function approximation (Bishop, 1995). Second, our latent space assumptionmay be deficient: the dynamics can not be accurately described byd-dimensional variables, soinformation important for state prediction may be lost in dimension reduction. Finally, even ifour assumption of the low-dimensional subspace is correct,the true subspace could be signifi-cantly different from the one we estimate from a finite set of trajectory snapshots.

3.2 Kalman Filtering Equations

In modeling the dynamics of ocean, besides the process noisewx in equation (25), we needto consider another important source of uncertainty: the error in measuring or modeling thedriving force u. In this paper, we model this inaccuracy with a perturbed driving force uk,formulated as follows:

uk.= uk(v

1k, v2

k, · · · , vqk), (28)

where{v1k, v

2k, · · · , v

qk} areq different sources of noise at timek. Note equation (28) is a fairly

general form and it subsumes the noise in our benchmark problem (introduced in section 4.1)as a particular case. We generally assume eachvi is color noise and modeled as:

vik = gi(v

ik−1, v

ik−2, · · · , vi

k−T i) + wik, i = 1, · · · , q (29)

wherewik is the driving white noise for theith color noise at time stepk. Without loss of

generality, we assumeT i ≥ T x, i = 1, 2, · · · , q.With the uncertainty in driving force incorporated, we haveour basic equations summarized

as:

xπk = fNN (xπ

k−1, · · · , xπk−T x, uk−1, · · · , uk−T x) + wx

k−1 (30)

yk = HπT xπk + wo

k. (31)

10

The Kalman filter for data assimilation will be based on the two equations. To deal with thecolor noise formulated in equation (29) and the time-delayed states in equation (30), we employthe Kalman filter formulation proposed by Gibson et al. (1991). We take an extended statevector:

xk = [(xπk )T , · · · , (xπ

k−T x+1)T , v1

k, · · · , v1k−T 1+1, v

2k, · · · , v2

k−T 2+1, · · · , vqk, · · · , v

qk−T q+1]

T .

(32)The extended state vector consists of two parts: the currentand past (up toT x time step earlier)subspace statexπ; the current and past (up toT i time step earlier) noise from sourcei fori = 1, · · · , q. The length of the extended vector would bedE = T xd + T 1 + · · · + T q. Usingthe extended vector allows us to write a dynamic state-spacemodel of the following form:

xk = f(xk−1) + wpk−1 (33)

yk = Hxk + wok. (34)

The extended process noise vectorwpk consists of process noise for subspace state prediction

and driving white noise for color noise modeling:

wpk = [(wx

k)T , 0, · · · , 0, w1k, 0, · · ·, 0, · · · , w

qk, 0, · · ·, 0]

T , (35)

and the extended observation noise vectorwok is

wok = [(wo

k)T , 0, · · · , 0]T . (36)

Note both extended process noisewpk and extended observation noisewo

k are white Gaussian.Equation (33) is the dynamic equation for Kalman filter, which expands into:

xπk...

xπk−T x+1

v1k...

v1k−T 1+1

...

vqk...

vqk−T q+1

=

fu()1 · · · 0 0 0 · · · 0...

. . ....

...... · · ·

...0 · · · 1 0 0 · · · 0

0 · · · 0 0 A1 · · · 0... · · ·

......

.... . .

...0 · · · 0 0 0 · · · Aq

xπk−1...

xπk−T x

v1k−1...

v1k−T 1

...

vqk−1...

vqk−T q

+

wxk−1...0

w1k−1...0...

wqk−1...0

, (37)

where

fu(xπk−1, · · · , xπ

k−T x , v1k−1, · · · , v1

k−T 1 , · · · vqk−1, · · · , v

qk−T q) =

fNN (xπk−1, · · · , xπ

k−T x , uk−1, · · · , uk−T x), (38)

11

and each matrixAi is the module color noisevi

Ai =

gi()1 · · · 0 0...

.. .... 0

0 · · · 1 0

i = 1, 2, · · · , q. (39)

The equation (34) is the observation equation, which expands into

yk =[

HπT 0 · · · 0]

xπk...

xπk−T x+1

v1k...

v1k−T 1+1

...

vqk...

vqk−T q+1

+

wok

0...0

. (40)

3.3 Implementing Sigma-point Kalman Filter

After the system dynamic modelf(·) and observation operatorH are known, the data assim-ilation is to find the maximum-likelihood estimation of state xk given the noisy observations{yk, yk−1, · · · },

xk|k = xk|k−1 − Kk(yk − yk|k−1) (41)

Pxk|k = Px

k|k−1 − KkPy

k|k−1KTk (42)

where xk|k−1 is the optimal prediction of the state at timek conditioned on all of the ob-served information up to and including timek − 1, and yk|k−1 is the optimal prediction ofthe observation at timek. Px

k|k−1 is the covariance ofxk|k−1 andPy

k|k−1 is the covariance ofyk = yk − yk|k−1, termed theinnovation. The optimal terms in this recursion are given by

xk|k−1 = E[f(xk−1|k−1) + wpk−1] (43)

yk|k−1 = E[Hxk|k−1 + wok] (44)

Kk = Pxy

k|k−1(Py

k|k−1)−1 (45)

= E[(xk − xk|k−1)(yk − yk|k−1)T ]E[(yk − yk|k−1)(yk − yk|k−1)

T ]−1 (46)

where the optimal predictionxk|k−1 corresponds to the expectation of a non-linear functionof the random variablesxk−1|k−1 andw

pk−1 (see equation (43)). Similar interpretation holds

for the optimal prediction of the observationyk|k−1 in equation (44). The optimal gain term

12

Kk is expressed as a function of posterior covariance matricesin equation (45) and (46). Notethese terms also require taking expectations of a non-linear function of the prior state estimaterandom variables. This recursion provides the optimal minimum mean-square error (MMSE)linear estimator ofxk assuming all the relevant random variables in the system canbe efficientlyand consistently modeled by maintaining their first and second order moments, i.e., they can beaccurately modeled by a Gaussian random variables (GRVs). We need not assume the linearityof the system modelf(·).

Sigma-point Kalman filter (SPKF) (van der Merwe and Wan, 2003, 2004) addresses theestimation of the expectation in equation (43)-(46). In SPKF, the state distribution is approxi-mated with a Gaussian distribution, but is represented using a minimal set of carefully chosenweighted sample points, termed Sigma-points. The number ofSigma-points is2dE + 1, wheredE is the dimension ofxk as defined in Section 3.2. The Sigma-pointscompletely capture thetrue mean and covariance of the Gaussian distribution, and when propagated through the truenon-linear system, captures the posterior mean and covariance accurately to the 2nd order forany non-linearity. The extended Kalman filter (EKF), in contrast, only achieves first-order ac-curacy. Also, the computational complexity of the SPKF is the same order as that of the EKF.The program coding in this paper makes use of R. van der Merwe’s ReBEL toolbox, which isavailable athttp://choosh.ece.ogi.edu/rebel/.

The implementation of SPKF on our problem can be summarized as follows. Considerpropagating the random variablexk−1 through the nonlinear functionf(·). According to ourprevious estimation step, variablexk−1 has meanxk−1|k−1 and covariancePx

k−1|k−1. To cal-

culate the statistics ofxk, we form a set of2dE +1 sigma-points {X ik−1 : i = 0, ..., 2dE } where

X ik−1 ∈ R

dE. The sigma-points are calculated using the following general selection scheme:

X 0k−1 = xk−1|k−1 (47)

X ik−1 = xk−1|k−1 + λ(

√

Pxk−1|k−1)i i = 1, ..., dE (48)

X ik−1 = xk−1|k−1 − λ(

√

Pxk−1|k−1)i i = dE + 1, ..., 2dE (49)

whereλ is a scalar scaling factor that determines the spread of the sigma-points aroundxk−1|k−1

and(√

Pxk−1|k−1)i indicates theith column of the matrix square-root of the covariance matrix

Pxk−1|k−1. Once the sigma-points are calculated from the prior statistics as shown above, they

are propagated through the non-linear function,

X ik|k−1 = f(X i

k−1) i = 0, ..., 2dE . (50)

The meanxk (before observation) is approximated using a weighted sample mean

xk|k−1 ≈2dE∑

i=0

wmi X i

k|k−1. (51)

and the covariance ofxk (before observation) is approximated as

Pxk|k−1 ≈

2dE∑

i=0

2dE∑

j=0

wcijX

ik|k−1(X

ik|k−1)

T + Ppk−1 (52)

13

wherePpk−1 is the known covariance matrix of the additive process noisew

pk−1, and the co-

efficientswmi and wc

ij are non-negative scalar weights. The sigma points for observation isdefined

Y ik|k−1 = HX i

k|k−1 i = 0, ..., 2dE , (53)

whereH is the observation operator (as in equation (35) ). The termyk|k−1, Py

k|k−1, andPxy

k|k−1(as appear in equation (42)-(47)) can be approximated as:

yk|k−1 ≈2dE∑

i=0

wmi Y i

k|k−1 (54)

Py

k|k−1 ≈

2dE∑

i=0

2dE∑

j=0

wcij(Y

ik|k−1)(Y

ik|k−1)

T + Pok (55)

Pxy

k|k−1 ≈2dE∑

i=0

2dE∑

j=0

wcijX

ik|k−1(Y

ik|k−1)

T , (56)

wherePok is the known covariance matrix of the additive observation noise wo

k. For morecomplicated situation, e.g. when the observation noise is not additive, please see (van derMerwe and Wan, 2003, 2004).

In our experiment we set

wcij = 0, i, j : i 6= j

The specific value of weightsw and the scaling factorλ depend on the type of the sigma-pointapproach used. In this experiment, we use a variant of SPKF: the central-difference Kalmanfilter.

3.4 Estimation of Full State

Naturally, we can obtain two estimations ofxπk from KF analysis states: the “0-lag” estimation

xπk|k =

(

Id×d 0d×(dE−d)

)

xk|k (57)

wheredE is the dimension of extended state, and the “full-lag” estimation

xπk|k+T x−1 =

(

0d×(T x−1)d Id×d 0d×(T 1+···+T q)

)

xk+T x−1|k+T x−1. (58)

It is clear from equation (57) and (58) that 0-lag estimationxπk|k has observations up to time

stepk incorporated, while full-lag estimationxπk|k+Tx−1 is the optimal estimation after fu-

ture observations{yk+Tx−1, yk+Tx−2, · · · , yk+1} being available. Hence the full-lag estimationxπ

k|k+Tx−1 is the result after fixed-lag Kalman smoothing with lag equalto Tx − 1.We usexπ

k|k′ to denote the optimal estimation ofxπk after seeing observations{yk′ , yk′−1, · · · }.

The corresponding full-space state, denoted asxk|k′, is xπk|k′ re-embedded into full spaceX :

xk|k′

.= Π+xπ

k|k′ = πT xπk|k′. (59)

14

To evaluate the data assimilation result, we calculate the difference between the 0-lag full stateestimationxk|k and true full-space statexk. Easy to see that the residual can be decomposedinto two orthogonal parts:

xk − xk|k = (xk − πT πxk) ⊕ (πT πxk − πT xπk|k). (60)

The first part(xk − πT πxk) is orthogonal toXS and therefore independent of the Kalmanfiltering result. The second part(πT πxk−πT xπ

k|k) is the difference between estimated subspacestatexπ

k|k and the projection ofxk onXS . Accordingly, the square error ofxk andxk|k, termedDA error, can be viewed as the sum of the two terms:

||xk − xk|k||2 = ||xk − πT πxk||

2 + ||πxk − xπk|k||

2. (61)

The first term on the left side of equation (61) isreconstruction error and the second term willbe refereed to assubspace DA error. Since the reconstruction error does not depend on Kalmanfiltering, it provides a natural lower bound of DA error.

4 Experiment on Benchmark Problem

4.1 Estuary Benchmark Description

In this section, we will apply the data assimilation approach described in section 3 to a simulatedestuary benchmark. Figure 4 gives a graphical representation of the physical layout of theestuary benchmark simulation. The simplified estuary is 8kmlong with a 8km wide oceanmouth and 2km wide river inlet. The depth varies from 10m at the ocean side to 5m at the riverinlet. The ELCIRC code uses a 3D grid with84214 vertices. The variables considered in thismodel include the elevation, salinity and velocity on each vertex, which gives the total degreeof freedom174979.

In the simplified model, the tidal forcing is put uniformly across the boundary on the oceanside. It has a simple periodic form:

ut(t0 + kτ ′) = A cos(2π

T(t0 + kτ ′) + φ), (62)

wheret0 is the origin of time axis andτ ′ = 15 minutes is the time interval used in ELCRIC.The river flux is a constant:

ur = B.

We considered the amplitude (A) and the phase (φ) are perturbed by independent color noise:

A(t0 + kτ ′) = A + vA(t0 + kτ ′) (63)

φ(t0 + kτ ′) = φ + vφ(t0 + kτ ′). (64)

The color noisevA andvφ is generated using the following moving average (MA) model:

vA(t0 + kτ ′) =

nA∑

i=0

αAi wA(t0 + (k − i)τ ′) (65)

vφ(t0 + kτ ′) =

nφ∑

i=0

αφi wφ(t0 + (k − i)τ ′), (66)

15

tidal forcing river flux

Model Bathymetry

x

z

2

1cos

Tf t A t 2

f t B

z

Salinity Transect

Figure 4: Simple tidally-influenced estuary benchmark simulation with constant river flux feed-ing system

wherewA andwφ are independent white Gaussian noise. The perturbed form oftidal forcing is

ut(t0 + kτ ′) = A(t0 + kτ ′) cos(2π

T(t0 + kτ ′) + φ(t0 + kτ ′)). (67)

Figure 5 gives an instance of the perturbed amplitude and phase. Since the river flux is heldconstant through the simulation, the tidal forcing is the sole source of perturbation. In dataassimilation, we consider the discretization in time with ainterval τ = 4τ ′ = 1 hour. Thenotation rule is same as described in section 2.3, for example ut

k = ut(t0 + kτ).It is desirable to have samples of system trajectory under a broad variety of driving forces,

based on which we can have a better coverage of both the manifold and dynamics. Intuitivelythis will directly benefit the model reduction and the training of neural network surrogates. Toincrease the variety of driving forces, we can either collect system trajectory of longer duration,or create virtual example of system trajectory with artificially designed forcings. In the bench-mark problem, we implement this idea by having multiple runsof the system with differentperturbation on tidal forcing. We have one ELCIRC run with noise-free tidal forcing (Run 0)and three runs with different instantiation of color noise on both amplitude and phase (Run 1,Run 2, and Run 3), as formulated through equation (63)-(67),for two weeks3. The noise isadded starting from the first day of second week (day 8), so thefirst week simulation of all fourruns are the same. We use Run 0 as thereference run, with forcing in equation (62) as our bestguess of the forcing prior to any observation on state vector. Besides the reference run, we alsoassume we have simulation results with forcing perturbed indifferent ways (Run 1 and Run 2),and the samples from the three runs are used in model reduction and neural network training.

3Actually, we have Run 0 and Run 3 for three weeks. The data fromthe third week do not enter the trainingphase, but the data assimilation is from day 8 to day 20

16

8 9 10 11 12 13 145.85

5.9

5.95

6

6.05

6.1

ampl

itude

(m

eter

s)

perturbed amporiginal amp

8 9 10 11 12 13 14−20

−10

0

10

20

phas

e (d

egre

e)

time (day)

perturbed phaseoriginal phase

Figure 5: An example of amplitude and phase perturbed with color noise

Run 3 will be considered as the unknownground truth with noisy observations available. Thegoal of data assimilation is thus to recover the states in Run3 by incorporating the observationsto the dynamics we learned from Run 0, Run 1 and Run 2.

Observations used in data assimilation are given by three stations located as shown in Figure6. Each station provides the measurement of elevation at itshorizontal location, and the salinityand velocity of each vertex in the vertical line. The number of valid observations is not aconstant considering when a vertex is above the sea surface or below the sea bottom (dry node),there is no valid observation on it. The averaged number of valid observations over the wholesimulation process is around 50.

Figure 6: The white crosses are the location of data assimilation stations.

We perform PCA over data collected from Run 0,1 and 2, and keepthe first20 principal

17

components, which retain over98% variance. We train the linear surrogate (GLM) with12-hours history (both for statexπ and forceu). For the non-linear surrogate (MLP), we use24-hours history.

4.2 The Implementation of Kalman Filter

In data assimilation, we estimate not only the statexπ, but also the noise added on the amplitudeand phase:vA andvφ. For the evolving of subspace state, we have

xπk = fNN (xπ

k−1, · · · , xπk−T x , ut

k−1, · · · , utk−T x) + wx

k−1. (68)

The generation of color noise on amplitude and phase is approximated with auto-regression(AR) model4, as a special case of equation (29). The noise model can be written as:

vAk =

T A∑

i=1

aAi vA

k−i + wAk−1 (69)

vφk =

T φ∑

i=0

aφi v

φk−i + w

φk−1. (70)

The AR coefficientsaAi anda

φi are estimated from the color noise samples from Run 1 and 2.

As shown in section 3.1, we can write the dynamic state-spacemodel in the followingconcise form:

xk = fk(xk−1) + wpk−1 (71)

yk = Hxk + wok. (72)

with extended state vector

xk = [(xπk)T , · · · , (xπ

k−T x+1)T , vA

k , · · · , vAk−T A+1, v

φk , · · · , v

φ

k−T φ+1]T ,

and extended process noise vectorwpk

wpk = [(wx

k)T , 0, · · · , 0, wAk , 0 · · · , 0, wφ

k , 0, · · · , 0]T .

For the linear surrogate, the dimension of Kalman filter statex is 264, thus SPKF needs around500 sigma-points; for the non-linear case, due to longer history considered in the neural networkpredictor, we have a bigger state vector (528 dimensional) and thus more sigma-points (around1000). We can get an alternative to the previously describedcolor noise model by assuming thenoise on both the amplitude and the phase are white, which leads to

aAi = 0, i = 1, 2, ..., nA

aφi = 0, i = 1, 2, ..., nφ

in equation (69) and (70).

4Note the noise is generated with a MA model and with a different time interval.

18

4.3 Tuning Model Noise Covariance with Validation Stations

The covariance of the process noisewx is estimated with equation (26). In reality, this estima-tion is often not the most appropriate description of the process noise. So instead of using theoriginal Q, we useαQ as the covariance matrix and try to tune theshrinkage factor α accordingto some ground truth. For our benchmark problem, we have two more stations as the validationstations, as shown in Figure 7. UsingHV to denote the observation operator associated withthose validation stations, the validation error at timek is defined as

||yVk − HV Π+xπ

k|k||2

whereyVk is the observations from validation stations at timek. The value ofα can then be

tuned to achieve a smaller average validation error.

Figure 7: Validation stations. The white crosses are the location of data assimilation stations;the black crosses are the location of validation stations

It is not surprising that theα that gives minimum validation error does not necessarily leadto minimum DA error. In evaluating a certain Kalman filter setting, we include the DA errorwith theα tuned with validation stations, as well as the minimum DA error achieved with theoptimalα.

4.4 Comparison of Different Data Assimilation Settings

As discussed in section 3.1, we can choose to have a linear surrogate or non-linear (neuralnetwork) surrogate in data assimilation. Linear predictorhas the marginal advantage of fasterpropagation but often leads to unstable system. In this benchmark problem, the trained linearmodel has multiple poles with real part greater than one, which indicates the predictor is po-tentially unstable. Those unstable components, if not sufficiently suppressed by the observationduring analysis step, can lead to quickly diverging state estimation. Another choice we need tomake is the model of the noise added toA andφ. We may consider the correlation of noise intime domain and model it as color noise. When the temporal correlation is not well justified orhard to capture, we can assume it to be white noise. In section3.2, we provides the Kalmanfilter equations under the two different noise assumptions.Intuitively, the color noise model iscloser to the physical reality we simulated in the benchmarkproblem and, if modeled properly,should yield better data assimilation result.

19

Next we will compare four data assimilation settings with different types of surrogate andnoise model. As suggested in section 3.4, the data assimilation performance at timek can beevaluated by the DA error, which is defined as the square errorbetween the true statexk andthe data-assimilated estimationxk|k = Π+xπ

k|k. As shown in equation (60) and (61), the DA

error is the sum of two parts: the reconstruction error||xk − Π+Πxk||2 and the subspace DA

error ||Πxk − xπk|k||

2. Since all the data assimilation models in this paper share the same PCAsubspaceXS , they yield the same reconstruction error. Therefore it is enough to compare theirsubspace DA error. We consider the subspace DA error averaged over day 8 to day 20:

1

k1 − k0 + 1

k1∑

k=k0

||Πxk − xπk|k||

2

wherek0 andk1 are respectively the time data assimilation starts and ends. As suggested insection 3.3, for each data assimilation setting, we estimate two shrinkage factors: one based onDA error and one based on validation error.

Table 1 summaries the performance of different data assimilation settings. From Table 1,the comparison between the linear and the non-linear surrogate is relatively simple. Appar-ently non-linear surrogate gives substantially smaller subspace DA error than linear surrogate,regardless the noise model is color or white. Moreover, the best shrinkage factorα according tothe validation stations (= 1) and the true optimalα (= 0.3) suggest less radical change of theoriginal process noise covariance. The comparison of noisemodel settings, however, is morecomplicated. For linear surrogate, the color noise model isapparently better than the whitenoise model, while for non-linear model, this superiority is marginal. We speculate that whenusing non-linear surrogate, the Kalman filter relies less onthe estimation of the color noise thandoes linear surrogate. To get a better understanding of this, we examine the noise estimationwith linear and non-linear surrogate. Figure 8 shows the estimated noise5on both amplitude andphase with two different type of surrogates with optimalα. Apparently, the estimation given bynon-linear surrogate is significantly worse than that givenby linear surrogate.

method sf (min val) mse (min val) sf (min mse) min mseGLM (white) ≈ 0 637.97 (0.0066) ≈ 0 637.97 (0.0066)GLM (color) 8 × 10−3 335.35 (0.0035) 5 × 10−4 281.85 (0.0029)MLP (white) 1 256.39 (0.0027) 0.3 229.86 (0.0024)MLP (color) 1 256.29 (0.0027) 0.3 229.39 (0.0024)

Table 1: Comparison of different settings. The column “sf (min val)” is the shrinkage factorαminimizing the validation error; the column “mse (min val)”is the subspace DA error achievedwith α with least validation error; the column “sf (min mse)” is theshrinkage factorα mini-mizing the DA error; the column “min mse” is the least subspace DA error achieved with theoptimalα. The numbers in parentheses are the normalized subspace error, i.e., the subspace DAerror divided by state variance

5Like the subspace statexπ, we can get estimation of noise with different fixed-lag smoothing. Here we showthe estimation with 12-hour lag.

20

8 10 12 14 16 18 20−0.1

−0.05

0

0.05

0.1

0.15

ampl

itude

noi

se (m

eter

s) TrueGLMMLP

8 10 12 14 16 18 20−30

−20

−10

0

10

20

time (day)

phas

e no

ise

(deg

ree)

TrueGLMMLP

Figure 8: Color noise estimation with linear and non-linearsurrogate

4.5 Data Assimilation Results

In this section, we give a detailed analysis of the data assimilation result given by a non-linearsurrogate and color noise model. First of all, we need to showthis data assimilation is actuallybetter than our best guess without observation, the reference run (Run 0). To show this, wemeasure the square error between the reference run state andthe true state at each time stepk: ||xk − xk||

2, calledreference error, wherex and x stand respectively for true state (fromRun 3) and reference state (from Run 0). The reference error thus provides a baseline for thecomparison to data assimilation results.

For evaluating data assimilation performance at timek, we considered the DA error||xk −xk|k||

2, the reconstruction error||xk − πT πxk||2 and the subspace DA error||πxk − xπ

k|k||2.

Figure 9 shows the time series of reference error, the DA error, the subspace DA error and thereconstruction error. It is clear from Figure 9 that data assimilation achieves a significantly re-duced error comparing to reference run. Indeed, the averagereference error over the two weeksrun is1665.2 while the DA error is 465.06. Another important observationis that reconstructionerror is a substantial part of the DA error. Actually the average reconstruction error is 243.6,over half of the average DA error. Also, we notice that there is an apparent correlation betweenreconstruction error and DA error in subspace. This is understandable, since as established insection 2.4, higher reconstruction error at timek usually means bigger observation noisewo

k,and thus poorer recovery of the statexk.

Recalling each full-space state consists of measurements on elevation, salinity and velocity,

21

8 10 12 14 16 18 200

20

40

60

80

100

120

140

Time (day)

erro

r (sv

d un

its)

true−ref.true−anal.(full)true−anal.(sub)recon.

Figure 9: Time series of data assimilation result. We show the square root of errors. Here thegreen curve is the reference error; Red curve is the DA error;Blue curve is the subspace DAerror; Black curve is the reconstruction error.

we find it meaningful to discuss the data assimilation resulton the three variables separately.Take salinity for example, we can simply compare the entriescorresponding to salinity inxk

and those entries inxk|k. The square difference will be called DA error on salinity. In thesame way, we can also calculate the reference error for the three variables. We find that dataassimilation achieves a significant reduction of error on elevation and velocity, while salinitymeasurement does not benefit much from the data assimilation. In fact, the averaged DA erroron salinity (156.11) is about the same with averaged reference error (158.96). The reason of thisinutility may partially lie in the fact we lost too much information on salinity in the dimensionreduction. In fact around1.7% of variance of salinity is lost in dimension reduction, while forelevation and velocity, the figure is0.1% and0.2% respectively. This high reconstruction erroron salinity has two consequences. First, we may not be able tomodel the dynamic of salinityaccurately in subspace, comparing to elevation and velocity. Second, the observation may notprovide enough information for salinity due to the high observation noise on salinity variables.

For comparison to SPKF, we also tried ensemble Kalman filter (EnKF) (Evensen, 2003),which is widely used in data assimilation for oceanography,and particle filter (Arulampalamet al., 2002). Unlike previous work (Lisaeter et al., 2003; Evensen, 2002), in which EnKF isused in full space, we apply EnKF in the subspace due to the extremely high dimension of thesystem state. We set the number of ensemble states to be the same with the number of statesin SPKF, so the two types of Kalman filter yield roughly the same computation burden. Thedata assimilation result given by EnKF is usually slightly worse than that given by SPKF. It isreasonable to assume this difference is due to SPKF’s bettersampling strategy.

Particle filters do not work well for our problem. The reason,as we speculate, is the highdimensionality of Kalman filter state (264 for the linear surrogate and 528 for the non-linearsurrogate). Since each particle follows a very high dimensional Gaussian distribution, the massof posterior probability can easily concentrate on one or two particles, which is what we ob-served in practice. This situation can be ameliorated by using a better sampling strategy, whichis worth further exploration.

22

8 10 12 14 16 18 200

10

20

30

err

or

(svd

un

its)

errors in salinity

true−ref.true−anal.

8 10 12 14 16 18 200

10

20

30

err

or

(svd

un

its)

errors in elevation


8 10 12 14 16 18 200

50

100

Time (day)

err

or

(svd

un

its)

errors in velocity


Figure 10: Time series of DA error for different variables. We show the square root of errors. Onelevation, the average reference error is 91.14 and the averaged DA error is 6.03; On salinity,the averaged reference error is 108.01 and the averaged DA error is 110.56; On velocity, theaveraged reference error is 1446.1 and the averaged DA erroris 348.47;

5 Conclusion and Future Research

We build the framework of doing sequential data assimilation for oceanographic system basedon Kalman filter. Our approach starts with a greatly reduced model obtained via principalcomponent analysis and ultra-fast neural network surrogate trained to mimic the reduced model.We propose a probabilistic latent state space interpretation and based on which, give a modifiedobservation form. We employ the Sigma-point Kalman filter, astate-of-the-art technique, tohandle the non-linearity in system dynamics. The experiments on a benchmark estuary problemshow that our data assimilation method can significantly reduce the error of model predictionby incorporating observation from sparsely located stations.

Despite the promising results on benchmark problem, this technique needs to be testedon real world problems, which may include more complex dynamics and even larger model

23

dimension. Moreover, we may need to face several issues not encountered in the benchmarkproblem. For example, our numerical model may be substantially biased from truth. Or, thenoise on the driving force is unknown and difficult to estimate.

Several important components of the current method can be further improved. Firstly, inthe model reduction phase, we may want to use non-linear dimension reduction approaches toachieve a less reconstruction error with a fixed subspace dimension. The available techniqueincludes the alignment of local PCA models (Teh and Roweis, 2003) and regularized principalmanifold (Smola et al., 2001), both of them can provide a coordination on the low dimensionalmanifold. Secondly, there is no effort to preserve the dynamics of the original system. Importantcomponents of the system dynamics, if not of high variation by itself, is likely to be discardedduring the PCA. As the result, in the reduced system after projection, although states are pre-served with high fidelity (in terms of mean square error), thedynamic is hard to recovered dueto the lack of certain important components. One symptom of this is the high prediction errorof the neural network surrogate. A more sensible dimension reduction technique would seek agood tradeoff between the reconstruction error of states and the preserved dynamics.

Another issue we ignored in the current data assimilation isthe conservation of impor-tant physical quantities (mass, salinity, et. al.). Intuitively, it requires the conservation law tobe enforced in each stage of data assimilation, which includes dimension reduction, surrogateprediction and Kalman filtering. We generally do not expect those processes automatically sat-isfy the conservation law. Indeed, preliminary research indicates (not surprisingly) that at leastsalinity is not conserved even in the dimension reduction stage.

Acknowledgements

This work was funded by NSF grant OCI-0121475. We thank Joseph Zhang for helpful discus-sion and preparation of simulation results.

References

Arulampalam, S., Maskell, S., Gordon, N., and Clapp, T. (2002). Tutorial on particle filters foron-line nonlinear/non-Gaussian Bayesian tracking.IEEE Transaction on Signal Processing,50(2).

Bennett, A. (1998).Inverse Modeling of the Ocean and Atmosphere. Cambridge UniversityPress.

Berkooz, G., Holmes, P., and Lumley, J. (1993). The Proper Orthogonal Decomposition in theAnalysis of Turbulent Flows.Annual Review of Fluid Mechanics, pages 539–575.

Bishop, C. (1995).Neural networks for pattern recognition. Oxford University.

Bishop, C., Haynes, P., Smith, M., Todd, T., and Trotman, D. (1995). Real-time control oftokamak plasma using neural network.Neural Computation, 7(1):206–217.

24

Cane, M., Kaplan, A., Miller, R., Tang, B., Hackert, E., and Busalacchi, A. (1996). Mappingtropical Pacific sea level: Data assimilation via a reduced state space Kalman filter.Journalof Geophyisical Research, 101(C10):22,599–22,617.

Chui, C. and Chen, G. (1999).Kalman Filtering for Real Time Application. Springer-Verlag.

Evensen, G. (2002).Sequential Data Assimilation for Nonlinear Dynamics: The EnsembleKalman Filter. Springer-Verlag Berlin Heidelberg.

Evensen, G. (2003). The Ensemble Kalman Filter: Theoretical Formulation and Practical Im-plementation.Ocean Dynamics, 53:343–367.

Gibson, J., Koo, B., and Gray, S. (1991). Filtering of Colored Noise for Speech Enhancmentand Coding.IEEE Transactions on Signal Processing, 39(8).

Golub, G. and van Loan, C. (1996).Matrix computations, third edition. The Johns HopkinsUniversity Press, London.

Grzeszczuk, R., Terzopoulos, D., and Hinton, G. (1998). Fast neural network emulation ofdynamical systems for computer animation.Adavances in neural information processingsystem, 11:882–888.

Heemink, A., Verlaan, M., and Segers, A. (2001). Variance Reduced Ensemble Kalman Filter.Monthly Weather Review, 129:1718–1728.

Hoteit, I. and Pham, D. (2003). Evolution of Reduced State Space and Data AssimilationSchemes Based on the Kalman filter.Journal of Meteorological Society of Japan, 81:21–39.

Hoteit, I., Pham, D., and Blum, J. (2001). A semi-evolutive partially local filter for data assim-ilation. Marine Pollution Bulletin, 43:164–174.

Hoteit, I., Pham, D., and Blum, J. (2002). A simplified reduced order Kalman filtering andapplication to altimetric data assimilation.Journal of Marine Systems, 36:101–127.

Jolliffe, I. (1986). Principal Component Analysis. Springer-Verlag.

Lisaeter, K., Rosanova, J., and Evensen, G. (2003). Assimilation of ice concentration in acoupled ice-ocean model, using the Ensemble Kalman Filter.Ocean Dynamics, 53:368–388.

Pham, D., Verron, J., and Roubaud, M. (1998). A singular evolutive extended Kalman filter fordata assimilation in oceanography.Journal of Marine Systems, 16:323–340.

Principe, C., Rathie, A., and Kuo, J. (1992). Prediction of chatoic time series with neuralnetworks and the issue of dynamic programming.International Journal of Bifurcation andChaos, 2(4):989–996.

Sirovich, L. (1987). Turbulence and the dynamics of coherenct sturctures. Part 1: coherentSturctures.Quarterly of applied mathematics, 45(3):561–571.

25

Smola, A., Mika, S., Schö"lkopf, B., and Williamson, R. (2001). Regularized Principal Mani-fold. Journal of Machine Learning Research, 1:179–209.

Teh, Y. and Roweis, S. (2003). Automatic alignment of local representations. InAdvances inNeural Information Processing System, volume 15.

Tipping, M. and Bishop, C. (1997). Probabilistic principalcomponent analysis. TechnicalReport NCRG/97/010, Neural Computing Research Group, Aston University.

van der Merwe, R. and Leen, T. (2005). Fast neural network surrogates for very high dimen-sional physics-based models in computational oceanograph. In Preparation.

van der Merwe, R. and Wan, E. (2003). Sigma-Point Kalman Filters for Probabilistic Inferencein Dynamic State-Space Models. InProceedings of the Workshop on Advances in MachineLearning.

van der Merwe, R. and Wan, E. (2004). Sigma-Point Kalman Filters for Integrated Navigation.In Proceedings of the 60th Annual Meeting of The Institute of Navigation.

Verlaan, M. and Heemink, A. (2001). Tidal flow forecasting using reduced rank square rootfilters. Stochastic hydrology and Hydraulics, 11:346–368.

Zhang, Y. and Baptista, A. (2005). A semi-implicit finite element ocean circulation model. PartI: Formulations and benchmarks.International Journal for Numerical Methods in Fluids.

Zhang, Y., Baptista, A., and Myers, E. (2004). A cross-scalemodel for 3D baroclinic circula-tion in estuary-plume-shelf systems: I formulations and skill asscessment.Continental ShelfResearch, 24:2187–2214.

Appendix: Estimation of cov(Hǫ)

From equation (2)

cov(ǫ) = Ex,xs(x − xs)(x − xs)T .

Here we define the projector operatorP (m × m matrix) for subspaceX s with

• ∀x ∈ F, Px ∈ S, (x − Px) ⊥ X s

• its connection to the projection matrixπ (d × m matrix): P = πT π

We can rewriteEx,xs(x − xs)(x − xs)T

Ex,xs(x − xs)(x − xs)T = Ex,xs((xs − Px) + (Px − x))((xs − Px) + (Px − x))T

= Ex,xS(xs − Px)(xs − Px)T + Ex,xs(xs − Px)(Px − x)T

+Ex,xs(Px − x)(xs − Px)T + Ex,xs(Px − x)(Px − x)T

26

Proof for Ex,xs(Px − x)(xs − Px)T = 0

Easy to see,

Ex,xs(Px − x)(xs − Px)T = ExsEx|xs(Px − x)(xs − Px)T

= Exs

∫

X

P (x|xs)(Px − x)(xs − Px)T dx

Sincexs − Px ∈ X s andPx− x ⊥ X s, there is a orthogonal matrixU (UUT = I), such that,

xs − Px = U

z1...zd

0...0

; and Px − x = U

0...0

zd+1...

zm

;

Let z = [z1, z2, ..., zm]T = UT (xs − x)

∫

X

P (x|xs)(Px − x)(xs − Px)T dx =

∫

Rm

P (z)U

z1...zd

0...0

0...0

zd+1...

zm

T

UT det(U)dz

= U

∫

Rm

P (z)

0 · · · 0 z1zd+1 · · · z1zm

......

.... . .

...0 · · · 0 zdzd+1 · · · zdzm

zd+1z1 · · · zd+1zd 0 · · · 0...

.. ....

......

zmz1 · · · zmzd 0 · · · 0

dzUT

where

P (x|xs) = P (z) ∝ exp(z21 + z2

2 + · · · + z2m

2σ2).

Easy to prove∫

Rm

P (z)zizjdz = 0 if i 6= j.

It is then obvious that∫

X

P (x|xs)(Px − x)(xs − Px)T dx = 0

27

�

So we have

Ex,xs(xs − x)(xs − x)T = Ex,xs(xs − Px)(xs − Px)T + Ex,xs(Px − x)(Px − x)T

First term Ex,xs(xs − Px)(xs − Px)T

Ex,xs(xs − Px)(xs − Px)T = ExsEx|xs(xs − Px)(xs − Px)T

= Exs

∫

X

P (x|xs)(xs − Px)(xs − Px)T dx

Using the same orthogonal transformation introduced previously

xs − Px = U

z1...zd

0...0

we have

∫

X

P (x|xs)(xs − Px)(xs − Px)T dx =

∫

X

P (z)U

z1...zd

0...0

z1...zd

0...0

T

UT dz

= U

∫

Rm

P (z)

z1z1 · · · z1zd 0 · · · 0...

.. ....

......

zdz1 · · · zdzd 0 · · · 00 · · · 0 0 · · · 0...

......

. .....

0 · · · 0 0 · · · 0

dzUT = U

σ2 · · · 0 0 · · · 0...

.. ....

......

0 · · · σ2 0 · · · 00 · · · 0 0 · · · 0...

......

. . ....

0 · · · 0 0 · · · 0

UT

We then have

Ex,xs(xs−Px)(xs−Px)T = ExsU

σ2 · · · 0 0 · · · 0...

. . ....

......

0 · · · σ2 0 · · · 00 · · · 0 0 · · · 0...

......

. . ....

0 · · · 0 0 · · · 0

UT = U

σ2 · · · 0 0 · · · 0...

. .....

......

0 · · · σ2 0 · · · 00 · · · 0 0 · · · 0...

......

. . ....

0 · · · 0 0 · · · 0

UT

28

Second term Ex,xs(Px − x)(Px − x)T

It is obvious that

Ex,xs(Px − x)(Px − x)T = Ex(Px − x)(Px − x)T

≈1

N

N∑

i=1

(Pxi − xi)(Pxi − xi)T

We then have

cov(ǫ) ≈ U

[

σ2Id×d 00 0

]

UT +1

N

N∑

i=1

(Pxi − xi)(Pxi − xi)T (73)

From equation (13) and (14), the covariance of observation noise is

cov(wot ) ≈ R + HU

[

σ2Id×d 00 0

]

UT HT + H1

N

N∑

i=1

(Pxi − xi)(Pxi − xi)T HT (74)

whereH 1N

∑Ni=1(Pxi − xi)(Pxi − xi)T HT = 1

N

∑Ni=1(HPxi − Hxi)(HPxi − Hxi)T is

very easy to calculate.

Estimating σ We know

Ex,xs(Px − x)T (Px − x) = ExsEx|xs(Px − x)T (Px − x)

= Exs

∫

X

P (x|xs)(x − Px)T (x − Px)dx

Using the same orthogonal transformation introduced previously

Px − x = U

0...0

zd+1...

zm

29

we have

∫

X

P (x|xs)(Px − x)T (Px − x)dx =

∫

X

P (z)

0...0

zd+1...

zm

T

UT U

0...0

zd+1...

zm

dz

=

∫

X

P (z)(z2d+1 + z2

d+2 + · · · + z2m)dz

= (m − d)σ2

We then have

Ex,xs(Px − x)T (Px − x) = Exs

∫

X

P (x|xs)(x − Px)T (x − Px)dx

= Exs(m − d)σ2 = (m − d)σ2

On the other hand,

Ex,xs(Px − x)T (Px − x) ≈1

N

N∑

i=1

(Pxi − xi)T (Pxi − xi)

We then get the estimation ofσ

σ2 =1

N(m − d)

N∑

i=1

(Pxi − xi)T (Pxi − xi).

In our case,m − d is a huge number (≈ 2 × 105). As the result,

σ2 =1

N(m − d)

N∑

i=1

(Pxi − xi)T (Pxi − xi) ≈ 0.

In practice, we use the approximation:

cov(wot ) ≈ R + +H

1

N

N∑

i=1

(Pxi − xi)(Pxi − xi)T HT (75)

30

cmop technical report tr-07-001 - stccmop.org · cmop technical report tr-07-001 ... although...

Documents