openda-nemo framework for ocean data assimilation · abstract data assimilation methods provide a...

Ocean Dynamics (2016) 66:691–702DOI 10.1007/s10236-016-0945-z

OpenDA-NEMO framework for ocean data assimilation

Nils van Velzen1 ·Muhammad Umer Altaf1 ·Martin Verlaan1

Received: 29 September 2015 / Accepted: 2 March 2016 / Published online: 22 March 2016© The Author(s) 2016. This article is published with open access at Springerlink.com

Abstract Data assimilation methods provide a means tohandle the modeling errors and uncertainties in sophis-ticated ocean models. In this study, we have created anOpenDA-NEMO framework unlocking the data assimila-tion tools available in OpenDA for use with NEMOmodels.This includes data assimilation methods, automatic paral-lelization, and a recently implemented automatic localiza-tion algorithm that removes spurious correlations in themodel based on uncertainties in the computed Kalman gainmatrix. We have set up a twin experiment where we assimi-late sea surface height (SSH) satellite measurements. Fromthe experiments, we can conclude that the OpenDA-NEMOframework performs as expected and that the automaticlocalization significantly improves the performance of thedata assimilation algorithm by successfully removing spuri-ous correlations. Based on these results, it looks promisingto extend the framework with new kinds of observations andwork on improving the computational speed of the auto-matic localization technique such that it becomes feasible toinclude large number of observations.

Keywords NEMO · OpenDA · Data assimilation ·Localization techniques · Double-gyre ocean model

Responsible Editor: Lars Nerger

This article is part of the Topical Collection on the 47th InternationalLiege Colloquium on Ocean Dynamics, Liege, Belgium, 4–8 May2015

Nils van [email protected]

1 Delft University of Technology, Delft, The Netherlands

1 Introduction

Due to present day computational resources, it is possibleto build a sophisticated ocean models. There are alwaysunavoidable inaccuracies in the observations and in themodels, therefore a realistic oceanographic system must becapable of modeling these uncertainties. Data assimilationmethods provide a good way of reducing uncertainties inocean dynamic systems.

Data assimilation (DA) is usually used to enhance modelpredictions by constraining their outputs with availableobservations. DA methods are usually divided into twomain categories: variational methods based on least-squaresfitting (Bennet 1992), and sequential methods based onthe so called Kalman filter (Evensen 2003). Recently,sequential ensemble Kalman filter methods became popularin geophysical applications as they are easy to imple-ment, are robust, and computationally reasonable. DifferentEnsemble Kalman filter (EnKF) variants were developedin recent years depending on whether or not the observa-tions are perturbed before assimilation (Tippett et al. 2003;Burgers et al. 1998; Verlaan and Heemink 1997;Houtekamer and Mitchell 1998; Anderson 2001; Bishopet al. 2001; Whitaker and Hamill 2002; Hoteit et al. 2002).

EnKFs are often used with relatively small ensembles ascompared to the numerical dimension of the system state(e.g., ensembles are often in the order ofO(10−100) whilethe dimension of state vectors in an ocean model can eas-ily be in millions). This means that the estimation errorcovariance matrix is always singular and can only repre-sent the model errors in a low dimensional space. A smallensemble size could certainly lead to systematically under-estimated variances and spuriously large cross covariancesin the sample error covariance matrix (Hamill et al. 2001).In order to negate these effects, two auxiliary techniques

http://crossmark.crossref.org/dialog/?doi=10.1186/10.1007/s10236-016-0945-z-x&domain=pdf

http://orcid.org/0000-0001-7290-9748

mailto:[email protected]

692 Ocean Dynamics (2016) 66:691–702

are used. These are covariance inflation (Anderson andAnderson 1999) and localization (Hamill et al. 2001).Covariance inflation tackles with the underestimation ofthe variances, while covariance localization addresses theproblems of singularity and overestimation of the crosscovariances.

Implementing a correct and efficient DA method for amodel can be an elaborate task. There are various soft-ware packages that provide standard implementations thattake away the need to develop custom implementations foreach simulation model. All these packages have their strongand weak points. The package that suits best for a particu-lar application depends on many factors. Among these arePDAF (Nerger and Hiller 2013), a toolbox written in FOR-TRAN 90 that contains highly optimized implementation ofLETKF, and LSEIK for usage in strong parallel environ-ments. Sequoia (Le Henaff and Marsaleix 2009) containsa collection of FORTRAN 95 routines that allow users tobuild their own data assimilation system. DART (Raederet al. 2012) is a framework containing ensemble-based dataassimilation algorithms and tools. The NERSC repository,containing implementations of various EnKFs. OAK (Barthet al. 2008) is a toolbox containing various filters and toolsthat interacts with the model using NetCDF files. SESAM(Brasseur 2006) is a set of data assimilation building blocksthat are combined together using NetCDF files. OpenDA isa data assimilation framework written in JAVA and someparts in C. Models can be coupled to OpenDA in two ways:in memory coupling through a software library or a blackbox coupling in which all the interactions between dataassimilation method and model goes though input and out-put files. OpenDA is described in more detail in Section 4. Athorough description of all the frameworks mentioned abovecan be found in (Heemink et al. 2012).

In this work, we generalize the OpenDA framework tothe specific needs for ocean applications. We have cou-pled the NEMO (Madec 2014) ocean model with OpenDAtoolbox using the black box approach. Several syntheticexperiments, assimilating the simulated altimetry into adouble-gyre NEMO ocean model, are performed with theobjective to investigate the impact of different EnKF setupsin the quality of analysis. These setups mainly focusedon the observation distribution, the ensemble size, and thelocalization length scale. Furthermore, attention will also bepaid to investigate the efficiency and usefulness of local-ization. Presently, OpenDA only supports distance-basedlocalization for in memory coupling. The use of the auto-matic localization technique which is part of the OpenDAtoolbox is not limited to in memory coupled models andcan be easily applied for the NEMO model using the blackbox coupling. The development of automatic localizationtechniques is still new and limited to few studies, e.g.,(Anderson and Anderson 1999; Zhang and Oliver 2011).

From these, only the later is suitable for operational usagedue to the relative limited additional computational costs.It is interesting to investigate whether automatic localiza-tion is able to improve the performance of EnKF, when itis applied to a medium sized model like the double-gyreNEMO configuration.

The double-gyre NEMO ocean model corresponds toan idealized configuration of the NEMO model: a squareand 5000-m-deep flat bottom ocean at mid latitudes (theso called square-box or SQB configuration). The domainextends from 24N to 44N, over 30 in longitude (60–30 W) with 11 vertical levels between 152 and 4613 m indepth. The minimum horizontal resolution of the model is1/4. This setup is known as the SEABASS configuration(Bouttier et al. 2012).

The paper is organized as follows. Section 2 presents anoverview of the EnKF implemented in this study. Section3 describes the automatic localization technique. A briefoverview of OpenDA data assimilation toolbox with its newfeatures, parallelization of ensembles, and automatic local-ization is given in Section 4. The ocean model, NEMO,configuration, and the assimilation setup are explained inSection 5. In Section 6, assimilation results of the EnKF fil-ter are presented and discussed. Concluding remarks followin Section 7.

2 Ensemble Kalman filter

Consider the following discrete dynamical system

xtk = Mk,k−1xt

k−1 + ηk, (1)

where xtk denotes the vector representing the true state of the

system at time k,Mk,k−1 is the state transition operator thattakes as inputs the state at time k − 1 and outputs the stateat time k, and ηk is the system noise with covariance matrixQk . At time k, the observed data vector is given by

yk = Hkxtk + εk. (2)

Here, yk is the vector with observed values and εk isthe observational noise with covariance matrix Rk . If boththe dynamical and observation systems are linear, the stateestimation problem is solved by the Kalman filter (Kalman1960). The main procedures of the Kalman filter involverecursively computing the means and covariance matrices ofthe system state xk (conditioned on available observations),as outlined below.

– Forecast step: At time instant (k − 1), propagate thestate mean (also called analysis) xa

k−1 and the associatederror covariance matrix Pa

k−1 to obtain the forecast state

Ocean Dynamics (2016) 66:691–702 693

(also referred to as background) xfk and the associated

covariance Pfk at next time k respectively as

xfk = Mk,k−1xa

k−1 , (3)

Pfk = Mk,k−1 Pa

k−1MTk,k−1 + Qk . (4)

– Analysis step: When a new observation yk becomesavailable at time k, update xf

k and Pfk to their anal-

ysis counterparts, xak and Pa

k with the Kalman updateformula:

xak = xf

k + Kk

(yk − Hkx

fk

), (5)

Pak = Pf

k − KkHkPfk , (6)

Kk = Pfk H

Tk (HkP

fk H

Tk + Rk)

−1 , (7)

with Kk being Kalman gain.

Geophysical fluid dynamical (and sometimes observa-tional) models are nonlinear, and thus the conventionalKalman filter cannot be directly applied for data assim-ilation into these systems. To account for nonlinearities,modifications are introduced in the original Kalman filterusing a Monte Carlo approach and refers usually as ensem-ble Kalman filtering (EnKF) (Evensen 1994). The EnKFalgorithm uses an ensemble of the system states to repre-sent the model error. The ensemble size n is typically muchsmaller than the dimension mx of the state vector. Supposethat an n-member analysis ensemble Xa

k−1,i : i = 1, 2, ..., nis available at the (k − 1)th filtering step, we take the setXf

k as the forecast ensemble at instant k, and the forecast

sample mean xfk and covariance Pf

k are given by

xfk = 1

n

n∑i=1

xfk,i; Pf

k = 1

n − 1

n∑i=1

(xfk,i − xf

k

)

×(xfk,i − xf

k

)T

. (8)

In practice, Pfk needs not be calculated. The Kalman gain

Kk is then approximated by

Kk = Pfk H

Tk (HkP

fk H

Tk + Rk)

−1 (9)

When observation yk is made available, one updates eachmember of the forecast ensemble by

xak,i = xf

k,i + Kk

(ysk,i − Hkx

fk,i

), i = 1, 2, ..., n, (10)

where ysk,i are the “surrogate” (i.e., perturbed) observations

generated by drawing n samples from the normal distribu-tion of mean yk and covariance Rk (Burgers et al. 1998).Propagating Xa

k forward to the next time instant, one startsa new assimilation cycle, and so on.

Most of the computational time of an ensemble baseddata assimilation method is dominated by the forward modelsimulations. The ensemble members can propagate themodel state forward in time in parallel but still due to lim-ited computational resources, we often need to use relativelysmall ensemble sizes. The use of a small ensemble may leadto filter divergence due to spurious correlations and unreal-istic updates of the ensemble members (Hamill et al. 2001).Distance-dependent localization techniques are a popularway to compensate for small ensemble size. The distance-based localization techniques determine the weighting coef-ficients β between model variables/measurement locations.These weights β ∈ [0, 1] are determined based on thephysical distance of the variables. Weights will be close tozero for elements with no correlation and near one for ele-ments we do expect strong correlations. Localization canbe applied directly on the covariance matrices or to thecomputed gain matrix. In this paper, we only consider thelatter and update the model using a localized gain matrixaccording to

KLoc = K βxy, (11)

where K denotes the gain matrix, KLoc the localized gainmatrix, and βxy the weighting coefficients between theobservation locations and the model state elements.

The localization weights are often determined using acorrelation function with local support that takes the dis-tance as input. This function is non zero for distances up to agiven threshold like the fifth-order piecewise rational func-tion as defined in Gaspari and Cohn (1999). Non smoothfunctions assigning only 0 and 1 weights can also be used(Anderson and Anderson 1999), but these have shown to beinferior (Anderson 2004). Although this approach assumesto be a good measure for correlation, the distances cannotalways be properly defined for some sources of obser-vations (Anderson 2004) or the correlation scales can betime dependent. Hence, this method often needs quite sometuning in order to find the optimal distances.

3 Automatic localization

A step forward in fully generic data assimilation methodsis to handle the localization in an automatic way avoidingmanual configuration, modeling, and tuning of the local-ization distances. Anderson (2004) suggests a hierarchicalensemble filter. This method tries to tackle the heuristicissues and determine localization weights based on a theo-retical basis. The idea is to split all the ensembles into Ng

groups of Ne ensembles of ensembles. The gain matricesKj , j = 1, .., Ng for all the groups are computed. The ele-ments ki,j of the gain matrices are assumed to be randomly

694 Ocean Dynamics (2016) 66:691–702

drawn from the same distribution that contains the true val-ues of ki . The localization weights αi are selected such thatit minimizes√√√√√

Ng∑j=1

Ng∑k=1,k =j

(αiki,k − ki,j

). (12)

This method requires a large ensemble size and is thereforecomputationally inefficient compared to a traditional imple-mentation using distance-based localization. However, itcan be used to derive localization statistics which can pro-duce high-quality localization strategy for production runs.

Another approach suggested by Zhang and Oliver (2011)uses a bootstrap technique to identify spurious correlation inthe gain matrix and then generate weights accordingly. Fromthe original ensemble xi with i = 1, .., Ne,NB ensembles ofsize Ne are generated by randomly drawing ensemble mem-bers from the original ensemble. The bootstrapped ensemblewill contain some of the ensemble members of the originalensemble several times. From each bootstrapped ensembles,we generate gain matrices Kk, k ∈ [1, .., NB ] for each ele-ment ki,j of the gain matrix we compute the estimator of thevariance as

σ 2ki,j

=∑NB

m=1

(k∗i,j,m − ki,j

)2

NB

, (13)

where K denotes the mean of the bootstrapped gain matricesKk . The squared ratio between mean and variance is definedas

C2vi ,j

=σ 2

ki,j

k2i,j

. (14)

The localization weights are then defined by

βi,j = 1

1 +(1 + 1/σ 2

α C2vi ,j

) . (15)

The parameter σ 2α can be used to balance between the

deletion of spurious correlations and possibly deleting realcorrelations. Zhang and Oliver (2010), investigated the bestpossible value of σ and came up with an optimal value ofσ 2

α = 0.36.This method can easily be incorporated into an existing

implementation of an EnKF. The analysis step will becomeroughly NB times more expensive computationally whichmight be acceptable for assimilating a moderate amount ofobservations but is not be practical for assimilating largenumber of observations. A possible solution for assimilat-ing large amounts of observations is to define a large regionaround each observation isolating it from the surrounding to

avoid any correlations from which it is safe to assume thereis no correlation between the observation and the pointsoutside this area. The automatic localization can then beapplied only to the observations inside this area.

4 OpenDA data assimilation toolbox

OpenDA is an open-source toolbox for developing andapplying data assimilation and calibration algorithms todynamic models (OpenDA 2013). The design philosophyof OpenDA is to breakdown data assimilation into a set ofbuilding blocks using the principles of object-oriented pro-gramming. The OpenDA toolbox consists of two parts; adescription of the building blocks with their interfaces, and alibrary of default implementations of those building blocks.It stems from the merger of two data assimilation initia-tives: COSTA (van Velzen and Segers 2010; Altaf et al.2009; Heemink et al. 2010; Altaf et al. 2011) and DATools(Weerts et al. 2010).

OpenDA is written in Java. The object oriented natureof java fits well to the modular design of OpenDA andallows easy extension of functionality. The system is set upin such a way that end users can setup experiments withoutany coding or compiling requirements. Java was historicallyconsidered slow, the performance has improved signifi-cantly by many improvements in the Java virtual machinea.o. Just-In-Time compilation and Adaptive optimization(Sestoft 2010). In order to get a performance similar to adata assimilation algorithms written purely in languages likeC and Fortran, some computationally demanding parts arewritten in C.

Once a model is prepared for OpenDA, it is possibleto try out various off-the-shelf data assimilation algorithmswithout any further programming. It has been shown thatthe performance is close to dedicated implementations(van Velzen and Verlaan 2007) and furthermore can auto-matically propagate the ensemble of models either seriallyor concurrently.

Fig. 1 OpenDA defines three building blocks:method (the data assim-ilation or calibration algorithms), observations (the stochastic observerfor handling the observations), and the model

Ocean Dynamics (2016) 66:691–702 695

4.1 OpenDA components

As illustrated in Fig. 1, OpenDA defines three major build-ing blocks: method (the data assimilation or calibrationalgorithms), observations (the stochastic observer for han-dling the observations), and the model. The end user canmake arbitrary combinations of these building blocks to setup an experiment without requiring any programming. Thesystem is configured using an XML schema where a userdefines options regarding: algorithm type (EnSR, EnKF,etc.) and ensemble size; the variables to assimilate andgrid location within the model; and associated uncertaintiesof the model, forcing data, parameters, and observations.Using the default implementations, observations can bestored in various formats like NetCDF, CSV, NOOS, or inan SQL database. OpenDA can produce output in variousformats such as ASCII, Matlab, Python, and NetCDF to eas-ily analyze the filtering performance and by plotting usefulgraphs.

4.2 Coupling a model to OpenDA

The OpenDA model interface has already been imple-mented for over 20 dynamic model codes (van Velzenet al. 2012). There are two approaches to link a model toOpenDA. The most simple and often the best choice is aso called black box coupling. In case of a black box cou-pling, OpenDA interacts with the model only by writingmodel input files, running the model executable and inter-pretation of the model output files. The coupling can berealized without access and changes to the model code.

There are situations where the black box coupling is notthe best choice especially when the overhead of restartingthe model is high or when the model lacks proper restart-ing functionality. In these situations, it is possible to setupan “in-memory coupling.” The model must be made avail-able as a software library that implements a specified listof routines, called the OpenDA model interface. OpenDAsupports in-memory coupling for models written in variousprogramming languages, e.g., Fortran, C, C#, and Python.The in-memory coupling is most efficient computationallyand potentially unlock all data. This allows easy creation ofthe complex interpolation operators and the noise processes.The amount of work needed to create an in memory cou-pling is very model dependent. The amount of work can besmall when the model itself is available as a library withfunctionality to access the model state, e.g., the models thatimplement the OpenMI interface (Gregersen et al. 2007;Gijsbers et al. 2010). In such an ideal situation, there is noneed to have access to the model source code. Writing somebridging code will be sufficient (Ridler et al. 2014). On theother hand, it can be a serious amount of work to realizean in memory coupling especially for some large legacysimulation codes.

OpenDA contains a black box coupling tool that mini-mizes the effort needed to create a highly configurable blackbox coupling as illustrated in Fig. 2. The user must write awrapper code for reading and writing the input and outputfiles of the model. Depending on the type of file, input, out-put, or both, the wrapper code will be able to read and/orwrite the relevant data from/to the model files. This is theonly programming effort required to realize the black box

Fig. 2 The OpenDA black boxcoupling tool simplifies couplinga model and arbitrary can be runin parallel. The user only needsto provide a small part of modelspecific code to read and writethe model output and input filesin order to couple the model.The parallelization is handled byin interface model that handlesall parallel interaction betweenthe data assimilation methodand the models

696 Ocean Dynamics (2016) 66:691–702

coupling. All other configuration and specification is doneusing black box configuration files.

The black box configuration consists of three layers:

– Wrapper layer: contains the specification of location of(template) input files, the names of executables neededto perform a model run and the input and output filesthat have to be processed using the wrapper code.

– Model layer: contains a list of the variables from therequired input and output files. In this layer, it is possi-ble to rename the ID’s of the variables in order to ensureuniqueness.

– StochModel layer: The model as presented to the dataassimilation method is constructed. The state vectoris composed by listing the associated variables andarrays, noise models on model forcings can be speci-fied and information is provided about matching modelpredictions to observations.

The advantage of this approach is that it offers a lot of flex-ibility without the need to do any additional programming.The end user can easily add or remove values to the part ofthe state that is updated by assimilation and experiment withvarious sources of observations or noise models.

4.3 Parallel computing

To reduce the computational time, the model runs inensemble-based data assimilation methods can be executedin parallel. The setup of OpenDA allows automatic exe-cution of the models in a natural way. Parallelization ishandled using the so called interface model. This is amodel that extends the functionality of arbitrary OpenDAmodel. In this case, it handles and hides the parallelisationof ensemble members. Other examples of interface mod-els in OpenDA are models for automatic bias correctionand Kalman smoothing. These interface model create aninterface between the actual model and the data assimila-tion method. It supports both parallelisation of models usingthreads and running the models on remote machines. Theinterface model works for arbitrary models since it only

0 20 40 60 80 100 1200

20

40

60

80

−0.8

−0.4

0

0.4

0.8

Fig. 3 A snapshot of SSH (m) for the SQBmodel configuration (April23, 2009)

20 40 60 80 100 120

10

20

30

40

50

60

70

80

−0.5

−0.3

−0.1

0.1

0.3

Fig. 4 SSH Observations for assimilation period September 18, 2009

makes use of the same methods from the model interface asthe algorithm does as illustrated by the shape of the puzzlepeaces in Fig. 2.

4.4 Localization in OpenDA

The current release of OpenDA (2.1) supports localizationusing the shur product on the Kalman gain matrix. The userprovides a function that takes the meta-information on theobservations as input and produces the weights βxy . Fornumerical models with a fixed, regular computational grid,such an implementation is not difficult. However, OpenDAhas been coupled to many simulation suites, e.g., formodeling water quality, rivers, groundwater. The computa-tional grids are often irregular and the state vector containsvarious quantities depending on the user settings. Writinga generic function for determining the localization weights,based on some parameters, is an elaborate task for thesemodels. Moreover, some of the end users are novice dataassimilation users and require a more black box approach.

The method suggested by Zhang and Oliver (2011)is part of OpenDA and is conceptually very suitable. Itdoes not need any configuration except for the bootstrapsize NB and also adds at least some necessary form oflocalization to all models in OpenDA. Though OpenDAstill provides provision for a custom localization functionwhen needed.

20 40 60 80 100 120 140 160 1800.05

0.1

0.15

0.2

0.25

0.3

Analysis steps

RM

SE

(m

)

Fig. 5 Root mean square errors (RMSE) in SSH (m) between the freemodel run and truth taken at every 2 days period for the year 2009

Ocean Dynamics (2016) 66:691–702 697

Fig. 6 Snapshots of sea surfaceheight (m) at analysis step 1, 50,and 90 for top: forecast usingEnKF, second row: analysisusing EnKF, third row: truth,and bottom: free run withoutdata assimilation

5 NEMO ocean model

The model setup used in this study is based on an idealizedconfiguration of the NEMO primitive equation ocean model(Cosme et al. 2010). The domain is a square box (between25 and 45 N), 5000-m-deep flat bottom ocean at mid lati-tudes and refers as SQB configuration. The model domainis closed with lateral boundaries. A double gyre circulationis created with constant zonal wind forcing blowing west-ward in the northern and southern parts of the basin andeastward in the middle part of the basin. Figure 3 shows thesnapshot of the model sea surface height (SSH). The flowis dominated by chaotic mesoscale dynamics, with largesteddies that are about 100 km wide (Beckers et al. 2012), andto which correspond velocities of about 1 m/s and dynamicheight differences of about 1 m. These properties are quitesimilar in shape and magnitude to the Gulf Stream (NorthAtlantic).

The primitive equations of the NEMO ocean model arediscretized on an Arakawa C grid, with a horizontal resolu-tion of 1/4×1/4 cos(φ) and 11 z-coordinate levels along thevertical. The model uses a “free surface” formulation, withan additional term in the momentum equation to damp thefastest external gravity waves and the bi-harmonic diffusionoperator. The model time integration is based on a leap-frogscheme and an Asselin filter (Cosme et al. 2010).

5.1 Assimilation setup

For the present SQB configuration, a time step is chosenas 15 min. The model is spin-up for a period of 40 years.The model calendar is idealized to 30 days per month.The model state vector is multivariate, containing 2 timeinstances of 17 variables. The two time instances are due to

the leap-frog scheme. The user can select which variables ofthe state are updated by the data assimilation algorithm byselecting them in the OpenDA configuration file. In orderto keep the leap-frog scheme in tact, the user should alwaysselect both time instances of a variable to be updated.

The ensemble Kalman filter (EnKF) is used for all theexperiments. The assimilation is performed during 1 year(360 days calendar) with analysis frequency of 2 days. Themodel output of the year 75 is taken as the truth. Obser-vations of sea surface heights (SSH) are generated fromthe truth by adding spatially uncorrelated Gaussian noisewith standard deviation of 6 cm. The observation grid isgenerated from the ENVISAT and Jason-1 flight parame-ters: 35 days and 1002 passes one cycle for ENVISAT and10 days and 254 passes one cycle for Jason-1. Figure 4shows SSH observation for one assimilation period. In ourexperiments, we update the sea surface height variables ofthe state vector.

0 20 40 60 80 100 120 140 160 1800.05

0.1

0.15

0.2

0.25

0.3

Analysis steps

RM

SE

Free RunEnsembleSize 15EnsembleSize 100

Fig. 7 RMSE in SSH (m) between EnKF and truth at analysis stepsusing ensemble members of sizes 15 and 100

698 Ocean Dynamics (2016) 66:691–702

Fig. 8 Rank histogram of theensembles forecast step 20, 70,and 140 (from left to right) usingEnKF. The top row correspondsto an ensemble size of 15 andthe bottom row to an ensemblesize of 100. The U shape can bean indication that the ensemblessuffer from conditional biases orlack of variability

20 40 60 80 1000

0.05

0.1

Rank

P(ran

k)

20 40 60 80 1000

0.05

0.1

Rank

P(ran

k)

20 40 60 80 1000

0.05

0.1

RankP(ran

k)

5 10 150

0.05

0.1

0.15

0.2

0.25

Rank

P(ran

k)

5 10 150

0.05

0.1

0.15

0.2

0.25

RankP(ran

k)

5 10 150

0.05

0.1

0.15

0.2

0.25

Rank

P(ran

k)

The beginning pass numbers for both satellites of the year2009 are taken. Since the observation grid is different fromthe model grid, while generating the SSH observations, bi-linear interpolation is used to generate SSH values frommodel grid to observation grid (Yan et al. 2014). Figure 5presents the root mean square errors (RMSE) in SSH (m),between the truth and the free model run with an intervalof every 2 days (analysis steps). The errors are in the rangebetween 0.15 and 0.30 m.

The initial ensemble is generated with 100 members froma 10-year interval (years 41–50) of free model simulationwith outputs every 30 days are used. The last 100 outputsare used to initialize the ensemble. The 10-year interval isselected far from the true state year, for better evaluationof the assimilation performance. From this 100 ensemble

members, 4 sets of different ensemble members are picked(15, 30, 50, 100). This is helpful in evaluating the resultswith respect to ensemble members and localization.

The computational time is dominated by the NEMOmodel runs. Depending on the number of ensemble mem-bers, 1–3 % is consumed by the classical EnKF imple-mentation in OpenDA when 8 NEMO models are run inparallel.

6 Results and discussion

As a first step, a set of four assimilation experiments areperformed with different set of ensemble members (15, 30,50, 100). Figure 6 presents the snapshots of SSH for an

Fig. 9 Snapshots of sea surfaceheight errors from truth atanalysis step 20, 70, and 140 fortop: using EnKF, middle: EnKFwith localization, and bottom:free run without dataassimilation

Ocean Dynamics (2016) 66:691–702 699

Fig. 10 Snapshots of errors inSSH from truth at analysis step20, 70, and 140 using LocalizedEnKF with top: ensemble size:15, middle: ensemble size: 30,and bottom: ensemble size: 100

assimilation experiment with an ensemble of 100 mem-bers. Comparison of forecast and analysis with the truth andfree run demonstrates that the figure clearly shows that theassimilation experiments have improved the model resultssignificantly and the analysis at each step brings the modelclose to the truth. Keeping in mind, we are only assimilat-ing and updating SSH, the forecast results look reasonableas well.

To see the filter performance with lower ensemble mem-ber, the RMSE in SSH over the whole model domain (withrespect to each analysis step) are compared for differentensemble members. We have plotted these results in Fig. 7for 15 and 100 ensemble members. Decreasing the numberof ensemble members has a negative effect on the perfor-mance of the EnKF. The results demonstrate that the RMSEare lower for 100 ensemble case specially for later part ofthe assimilation. The results also demonstrate that a smallensemble size also gives reasonable assimilation results.One should also note that the data assimilation process hasimproved the model results significantly. The range of errorswith no data assimilation is between 0.15 and 0.30 m whichis decreased to 0.05 and 0.15 with data assimilation (Fig. 7).

So far, we have examined the EnKF performance basedon RMSE which is the most commonly used measure toevaluate the filter performance. A rank histogram is anothercommon diagnostic to measure the ensemble reliability byrepeatedly tallying the rank of the verifying observations ateach assimilation step (Hamill 2000). A reliable rank his-togram will show up a flat rank histogram. Figure 8 presentsthe rank histogram of the ensembles forecast at assimila-tion step 20, 70, and 140 for an ensemble size of 15 and anensemble size of 100. The U-shape histograms are observedfor both the ensemble sizes and can be an indication thatthe ensembles suffer from conditional biases or lack ofvariability (Hamill 2000).

The EnKF showed reasonable improvements in assimila-tion results and showed that the black box coupling between

the NEMO model and OpenDA works quite well. The nextstep is to apply and test the automatic localization strategyexplained in Section 2 with the NEMO-OpenDA frameworkusing the same experimental setup. We have selected thebootstrap size NB = 50 which is approximately the averageensemble size in our experiments.

Figure 9 shows the error results in the snapshots of theSSH taken at 20th, 70th, and 140th analysis steps with anensemble of 100 members. These error plots are shown forEnKF, EnKF with localization and free run without dataassimilation. It is evident from these plots that the local-ization has improved the EnKF results. Figure 10 presentsthe differences in the snapshots between the truth and local-ized EnKF using different ensemble sizes (15, 30, and100). These error snapshots are taken at 20th, 70th, and140th analysis steps. Increasing the ensemble size greatlyinfluences the simulation results as evident from Fig. 10.As an example, consider the first column of the Fig. 10(analysis step 20), it can be seen that the scale of thesedifferences has notably reduced and large differences inthe upper domain have almost vanished as we increase theensemble size from 15 to 100. This is true for all threeanalysis steps shown here.

0 20 40 60 80 100 120 140 160 1800.05

0.07

0.09

0.11

0.13

Analysis steps

RM

SE

AnalysisLocal Analysis

Fig. 11 RMSE in SSH (m) between EnKF and truth at analysis stepswith and without auto localization using ensemble of 100 members

700 Ocean Dynamics (2016) 66:691–702

Fig. 12 Illustration of the impact of automatic localization for a singlerow of the K matrix for SSH for various ensemble sizes. The top rowis created for an ensemble size of 15 and the lower row with an ensem-ble size of 100. TheKmatrices are generated for the same observationon analysis step 60. The first column shows the localizedKmatrix, thesecond column the K matrix without localization, and the last columnshows the automatically computed correction factors ρ. The correction

factors are not distance based but, they are a measure for uncertaintyin the computed elements of K. This is different from factors whichare determined using classical localization techniques. However, themethod is well capable of removing structures in the K matrix whichare expected to spurious, sharpening the important structures in the Kmatrix

To see the overall improvements at different analysissteps, Fig. 11 presents the RMSE of SSH over the wholemodel domain with and without localization for 100 ensem-ble cases. The RMSE clearly demonstrate that filter perfor-mance has improved with the introduction of localization.From the experimental results, we see that the automaticlocalization has a positive impact on the performance of thefilter. To get some idea on what the automatic localizationdoes, we have made plots of the localized and non localizedgain matrix and the correction factor ρ for the SSH correc-tion. Figure 12 shows these plots for an arbitrary observationat analysis step 60 created with ensemble size 15 and 100.Note that we cannot do a direct comparison between thegain of the two ensemble sizes since they are created with adifferent ensemble.

We can clearly see that the localization enhances thestructure in the gain matrix by removing some of the smallstructures which are likely to be created by spurious cor-relations. At the same time, we see that the non localizedgain matrix created with 15 ensembles shows more spuri-ous correlations than the gain created with 100 ensembles.The correction factors are very different from what wewould expect from classical localization techniques. The

corrections are not location based but are based on theuncertainty of gain elements. As a result, it shows large val-ues as well on locations where the gain has a small valuewith high probability.

80 160 240 3200.05

0.1

0.15

0.2

0.25

0.3

Time [Days]

RM

SE

Saw

too

th

Fig. 13 RMSE in SSH (m) between the truth and free model run(red line), RMSE sawtooth patterns with with localized EnKF—100ensemble members (blue line), forecast with localized EnKF—100ensemble members (red line)

Ocean Dynamics (2016) 66:691–702 701

Finally, RMSE are computed for both forecast (beforeassimilation) and analysis (after assimilation). Both forecastand analysis RMSE are plotted to produce sawtoothdiagrams. Figure 13 presents RMSE in SSH (m) betweenthe truth and free model run and RMSE sawtooth patterns ofthe Forecast-Analysis with localized EnKF as we progressin time using 100 ensemble members. It is evident that notonly analysis but the forecasts results are also improvedwith significant margin. One important point to notice hereis that we are only assimilating SSH observations, andthe model consists of 17 variables including temperature,salinity, etc. So for, a realistic case we expect differentsources of observations to be part of assimilation processand that will further improve our analysis and forecastsresults.

7 Conclusions and future work

We have created an OpenDA-NEMO ocean data assimila-tion framework using OpenDA black box coupling. Withlimited programming effort, we have managed to read andwrite the NEMO netcdf restart file and configuration file.With the present framework, a set of twin experiments wasperformed for the NEMO model where we assimilate satel-lite SSH values using EnKF. We have used the automaticparallelization functionality of OpenDA to run the ensemblemembers in parallel.

From the results of these experiments, we see that arun without data assimilation has an RMSE in SSH thatvaries around 23 cm which is quite large since the SSHvaries between −1 and 1 m. The EnKF, assimilating SSHimproves the model predictions over the whole domain.Increasing the ensemble size does improve the performanceof the EnKF filter but only to some extent. The experi-ments were repeated using the auto localization techniquewhich is now a part of OpenDA. We have used the autolocalization technique as the distance based localizationtechnique was not yet available for black box coupling,however, will be present in the future version of OpenDA.The auto localization method does not determine weightsbased on the distances but based on the uncertainty ofelements in the gain matrix. The introduction of auto-matic localization significantly improved the performanceof EnKF for all ensemble sizes. A visual inspection showsthat this method yields a very different localization weightsas we are used to see in distance based localization. Themethod is, however, capable of keeping the important struc-tures in the gain matrix intact while reducing artifactsthat are likely caused by spurious correlations. The per-formance of the method justifies more research. In futurework, we should compare the automatic localization tech-nique to the distance-based localization and find out which

method performs best for various types of observationsand situations.

The automatic localization technique is very easy to usebut increases the computational time of the analysis signifi-cantly. Using a bootstrap size of 50, the computational timeof the analysis increased with a factor of 10 in our exper-iments. This makes the method only feasible to use for amoderate number of observations. We have to look for waysto reduce the computational time of the bootstrapping tech-niques. Possible solutions are to perform the bootstrappingin parallel and to combine the automatic localization methodwith a local analysis strategy such that the auto localiza-tion only needs to be computed for a limited computationaldomain.

Acknowledgments This work is supported by SANGOMA a Euro-pean FP7-SPACE-2011 project, Grant 283580.

Open Access Open Access This article is distributed under theterms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unre-stricted use, distribution, and reproduction in any medium, providedyou give appropriate credit to the original author(s) and the source,provide a link to the Creative Commons license, and indicate ifchanges were made.

References

Altaf MU, Heemink AW, Verlaan M (2009) Inverse shallow-waterflow modelling using model reduction. Int J Multiscale ComputEng 7:577–596

Altaf MU, Verlaan M, Heemink AW (2011) Efficient identificationof uncertain parameters in a large scale tidal model of Europeancontinental shelf by proper orthogonal decomposition. Int J NumerMeth Fluids. doi:10.1002/fld.2511

Anderson JL (2001) An ensemble adjustment Kalman filter for dataassimilation. Mon Wea Rev 129:2884–2903

Anderson JL (2004) A hierarchical ensemble filter for data assimila-tion. http://www.image.ucar.edu/staff/jla/filter.pdf

Anderson JL, Anderson SL (1999) A Monte Carlo implemen-tation of the nonlinear filtering problem to produce ensem-ble assimilations and forecasts. Mon Wea Rev 127:2741–2758

Barth A, Alvera-Azcarate A, Weisberg RH (2008) Assimilation ofhigh-frequency radar currents in a nested model of the WestFlorida shelf. Geophys Res 113:C08,033

Beckers JM, Barth A, Heemink A, Velzen N, Verlaan M, AltafMU, Van Leeuwen PJ, Nerger L, Brasseur P, Brankart JM, deMey P, Bertino L (2012) Deliverable 4.1: Benchmark definitions.http://www.data-assimilation.net/Documents/sangomaDL4.1.pdf

Bennet A (1992) Inverse methods in physical oceanography. Camp-bridge Univ. Press

Bishop CH, Etherton BJ, Majumdar SJ (2001) Adaptive sampling withensemble transform Kalman filter. Part I: theoretical aspects. MonWea Rev 129:420–436

Bouttier P-A, Blayo E, Brankart JM, Brasseur P, Cosme E, VerronJ, Vidard A (2012) Towards variational assimilation of altimet-ric data in a high resolution ocean model. Newsletter MercatorOcean:2012

http://creativecommons.org/licenses/by/4.0/

http://dx.doi.org/10.1002/fld.2511

http://www.image.ucar.edu/staff/jla/filter.pdf

http://www.data-assimilation.net/Documents/sangomaDL4.1.pdf

702 Ocean Dynamics (2016) 66:691–702

Brasseur PVJ (2006) The seek filter method for data assimilation inoceanography: a synthesis. Ocean Dyn 56(5-6):650–661

Burgers G, van Leeuwen PJ, Evensen G (1998) On the analysisscheme in the ensemble Kalman filter. Mon Wea Rev 126:1719–1724

Cosme E, Brankart J, Verron J, Brasseur P, Krysta A (2010) Implemen-tation of a reduced rank square root smoother for high resolutionocean data assimilation. Ocean Modell 33:87–100

Evensen G (1994) Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast errorstatistics. J Geophys Res 162(10):143–10

Evensen G (2003) The ensemble Kalman filter: theoretical formulationand practical implementation. Ocean Dyn 53:343–367

Gaspari G, Cohn SE (1999) Construction of correlation functions intwo and three dimensions. Q J R Meteorol Soc 125:723–757

Gijsbers P, Hummel S, Vanecek S, Groos J, Harper A, Knapen R,Gregersen J, Schade P, Antonello A, Donchyts G (2010) FromOpenMI 1.4 to 2.0. In: Swayne DA, Yang W, Voinov AA, RizzoliA, Filatova T (eds) International environmental modelling andsoftware society (iEMSs) 2010 international congress on envi-ronmental modelling and software modelling for environmentssake, fifth biennial meeting. Ottawa, Canada. http://www.iemss.org/iemss2010/index.php?n=Main.Proceedings

Gregersen J, Gijsbers P, Westen S (2007) Openmi: open modellinginterface. J Hydroinformatics 9(3):175–191

Hamill TM (2000) Interpretation of rank histograms for verifyingensemble forecasts. Mon Wea Rev 129:550–560

Hamill TM, Whitaker JS, Snyder C (2001) Distance-dependent fil-tering of background error covariance estimates in an ensembleKalman filter. Mon Wea Rev 129:2776–2790

Heemink AW, Hanea RG, Sumihar J, Roest M, Velzen N, VerlaanM (2010) Data assimilation algorithms for numerical models.Advanced computational methods in science and engineering. In:Koren B, Vuik K (eds) Lecture notes in computational science andengineering. Springer, pp 107–142. ISBN 978-3-642-03343-8

Heemink A, velzen N, Verlaan M, Altaf MU, Beckers JM, Barth A,Van Leeuwen PJ, Nerger L, Brasseur P, Brankart JM, de Mey P,Bertino L (2012) Deliverable 1.1: first identification of commontools; SANGOMA: stochastic assimilation for the next genera-tion ocean model applications. http://www.data-assimilation.net/Documents/sangomaDL1.1.pdf

Hoteit I, Pham DT, Blum J (2002) A simplified reduced order Kalmanfiltering and application to altimetric data assimilation in TropicalPacific. J Mar Syst 36:101–127

Houtekamer PL, Mitchell HL (1998) Data assimilation using anensemble Kalman filter technique. Mon Wea Rev 126:796–811

Kalman R (1960) A new approach to linear filtering and predictionproblems. Trans ASME, Ser D J Basic Eng 82:35–45

Le Henaff PDMM, Marsaleix P (2009) Assessment of observationalnetworks with the representer matrix spectra method—applicationto a 3-d coastal model of the bay of biscay. Ocean Dyn 59:3–20

Madec G (2014) NEMO ocean engine. Note du Ple de modlisation,Institut Pierre-Simon Laplace (IPSL), France, No 27 ISSN No,pp 1288–1619

Nerger L, Hiller W (2013) Software for ensemble-based data assimila-tion systems—implementation strategies and scalability. ComputGeosci 55:110–118

OpenDA (2013) The OpenDA data-assimilation toolbox. http://www.openda.org

Raeder K, Anderson JL, Collins N, Hoar TJ, Kay JE, Lauritzen PH,Pincus R (2012) Dart/cam: an ensemble data assimilation for cesmatmospheric models. J Clim 25:6304–6317

Ridler ME, Velzen Nv, Sandholt I, Falk AK, Heemink A, MadsenH (2014) Data assimilation framework: linking an open dataassimilation library (openda) to a widely adopted model interface(openmi). Environ Model Softw 57:76–89

Sestoft P (2010) Numeric performance in C, C and Java. http://www.itu.dk/sestoft/papers/numericperformance.pdf

Tippett MK, Anderson JL, Bishop CH, Hamill TM, Whitaker JS(2003) Ensemble square root filters. Mon Wea Rev 131:1485–1490

van Velzen N, Segers aJ (2010) A problem-solving environmentfor data assimilation in air quality modelling. Environ ModelSoftw 25(3):277–288. doi:10.1016/j.envsoft.2009.08.008. http://linkinghub.elsevier.com/retrieve/pii/S136481520900231X

van Velzen N, Verlaan M (2007) COSTA a problem solving environ-ment for data assimilation applied for hydrodynamical modelling.Meteorol Z 16(6):777–793

van Velzen N, Verlaan M, Bos E (2012) The OpenDA associationannual report. http://www.openda.org/association/Annual report2012 v6.pdf

Verlaan M, Heemink AW (1997) Tidal flow forecasting using reducedrank square root filters. Stoch Hydrol Hydraul 11:349–368

Weerts AH, El Serafy GY, Hummel S, Dhondia J, Gerritsen H (2010)Application of generic data assimilation tools (datools) for floodforecasting purposes. Comput Geosci 36(4):453–463

Whitaker JS, Hamill TM (2002) Ensemble data assimilation withoutperturbed observations. Mon Wea Rev 130:1913–1924

Yan Y, Barth A, Beckers JM (2014) Comparison of different assimi-lation schemes in a sequential Kalman filter assimilation system.Ocean Modell 73:123–137

Zhang Y, Oliver DS (2010) Improving the ensemble estimate of theKalman gain by bootstrap sampling. Math Geosci 42(3):327–345

Zhang Y, Oliver DS (2011) Evaluation and error analysis: Kalman gainregularisation versus covariance regularisation. Comput Geosci15:489–508

http://www.iemss.org/iemss2010/index.php?n=Main.Proceedings

http://www.iemss.org/iemss2010/index.php?n=Main.Proceedings



http://www.openda.org

http://www.openda.org

http://www.itu.dk/ sestoft/papers/numericperformance.pdf

http://www.itu.dk/ sestoft/papers/numericperformance.pdf

http://dx.doi.org/10.1016/j.envsoft.2009.08.008

http://linkinghub.elsevier.com/retrieve/pii/S136481520900231X

http://linkinghub.elsevier.com/retrieve/pii/S136481520900231X

http://www.openda.org/association/Annual_report_2012_v6.pdf