learning stochastic closures using ensemble kalman inversion

28
1 of 28 Learning Stochastic Closures Using Ensemble Kalman Inversion TAPIO SCHNEIDER,ANDREW M. STUART,J IN-LONG WU§ California Institute of Technology, Pasadena, CA 91125, USA [Received on 17 April 2020] Although the governing equations of many systems, when derived from first principles, may be viewed as known, it is often too expensive to numerically simulate all the interactions within the first principles description. Therefore researchers often seek simpler descriptions that describe complex phenomena without numerically resolving all the interacting components. Stochastic differential equations (SDEs) arise naturally as models in this context. The growth in data acquisition, both through experiment and through limited first principles simulations, provides an opportunity for the systematic derivation of SDE models in many disciplines. However, inconsistencies between SDEs and real data at small time scales often cause problems, when standard statistical methodology is applied to parameter estimation. The incompatibility between SDEs and real data can be addressed by deriving sufficient statistics from the time-series data and learning parameters of SDEs based on these. Following this approach, we formulate the fitting of SDEs to sufficient statistics from real data as an inverse problem and demonstrate that this inverse problem can be solved by using ensemble Kalman inversion (EKI). Furthermore, we create a framework for non-parametric learning of drift and diffusion terms by introducing hierarchical, refine- able parameterizations of unknown functions, using Gaussian process regression. We demonstrate the proposed methodology for the fitting of SDE models, first in a simulation study with a noisy Lorenz ’63 model, and then in other applications, including dimension reduction starting from various deterministic chaotic systems arising in the atmospheric sciences, large-scale pattern modeling in climate dynamics, and simplified models for key observables arising in molecular dynamics. The results confirm that the proposed methodology provides a robust and systematic approach to fitting SDE models to real data. Keywords: Stochastic differential equation; inverse problem; Ensemble Kalman inversion; Gaussian pro- cess regression; hierarchical parameterization 1. Introduction 1.1 Overview and Literature Review The goal of this paper is to describe a straightforward ensemble based methodology that facilitates parameter estimation in ergodic stochastic differential equations (SDEs), using statistics derived from time-series data. SDEs arise naturally as models in many disciplines, and the wish to describe complex phenomena without explicitly representing all of the interacting components within the system make them of widespread interest. Additionally, the increasing acquisition of data provides opportunities for the learning of stochastic models in previously unforseen application domains. However, SDE models, while often accurate at providing statistical predictions, may not be compatible with available data at the small scales where the Itˆ o calculus model asserts very strong almost sure properties rarely found in Email: [email protected] Email: [email protected] § Email: [email protected]

Upload: others

Post on 11-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Stochastic Closures Using Ensemble Kalman Inversion

1 of 28

Learning Stochastic Closures Using Ensemble Kalman Inversion

TAPIO SCHNEIDER†, ANDREW M. STUART‡, JIN-LONG WU§California Institute of Technology, Pasadena, CA 91125, USA

[Received on 17 April 2020]

Although the governing equations of many systems, when derived from first principles, may be viewedas known, it is often too expensive to numerically simulate all the interactions within the first principlesdescription. Therefore researchers often seek simpler descriptions that describe complex phenomenawithout numerically resolving all the interacting components. Stochastic differential equations (SDEs)arise naturally as models in this context. The growth in data acquisition, both through experiment andthrough limited first principles simulations, provides an opportunity for the systematic derivation of SDEmodels in many disciplines. However, inconsistencies between SDEs and real data at small time scalesoften cause problems, when standard statistical methodology is applied to parameter estimation. Theincompatibility between SDEs and real data can be addressed by deriving sufficient statistics from thetime-series data and learning parameters of SDEs based on these. Following this approach, we formulatethe fitting of SDEs to sufficient statistics from real data as an inverse problem and demonstrate that thisinverse problem can be solved by using ensemble Kalman inversion (EKI). Furthermore, we create aframework for non-parametric learning of drift and diffusion terms by introducing hierarchical, refine-able parameterizations of unknown functions, using Gaussian process regression. We demonstrate theproposed methodology for the fitting of SDE models, first in a simulation study with a noisy Lorenz ’63model, and then in other applications, including dimension reduction starting from various deterministicchaotic systems arising in the atmospheric sciences, large-scale pattern modeling in climate dynamics,and simplified models for key observables arising in molecular dynamics. The results confirm that theproposed methodology provides a robust and systematic approach to fitting SDE models to real data.

Keywords: Stochastic differential equation; inverse problem; Ensemble Kalman inversion; Gaussian pro-cess regression; hierarchical parameterization

1. Introduction

1.1 Overview and Literature Review

The goal of this paper is to describe a straightforward ensemble based methodology that facilitatesparameter estimation in ergodic stochastic differential equations (SDEs), using statistics derived fromtime-series data. SDEs arise naturally as models in many disciplines, and the wish to describe complexphenomena without explicitly representing all of the interacting components within the system makethem of widespread interest. Additionally, the increasing acquisition of data provides opportunities forthe learning of stochastic models in previously unforseen application domains. However, SDE models,while often accurate at providing statistical predictions, may not be compatible with available data atthe small scales where the Ito calculus model asserts very strong almost sure properties rarely found in

†Email: [email protected]‡Email: [email protected]§Email: [email protected]

Page 2: Learning Stochastic Closures Using Ensemble Kalman Inversion

2 of 28

real data. For this reason, standard statistical methodology for parameter estimation in SDEs is oftennot suited to fitting models to real data. On the other hand, practical experience in the applied sciencesdemonstrates the effectiveness of fitting Markovian stochastic processes to sufficient statistics, typicallyderived from a long time series by averaging procedures. Ensemble methods have demonstrable suc-cess in the solution of inverse problems, based on the use of interacting particle systems driven by theparameter-to-observable map; furthermore, empirically they are robust to noisy evaluations of the map.Choosing ergodic averages as observables and viewing finite-time averages, or averages over differentinitializations, as noisy evaluations of the ergodic average, puts us in a setting where we may applyensemble methods to effectively estimate parameters in SDEs. These ensemble methods are derivative-free, side-stepping thorny technical and computational issues arising in the parameter-to-solution mapfor SDEs; and they are inherently parallelizable, and scale well to the learning of high dimensionalparameters, making them a very attractive computational tool.

Data-driven methods for extracting simplified models are now ubiquitous (Brunton & Kutz, 2019;Coifman et al., 2008; Ferguson et al., 2011; Froyland et al., 2014; Klus et al., 2018; Giannakis, 2019).A guiding example that motivates the need for reduced models of various kinds is the Navier-Stokesequation for fluid motion. It is often too expensive to numerically simulate all the interactions within thetrue systems – for example turbulence around an aircraft or cloud convection in the Earth’s atmosphere– and reduced, often stochastic, descriptions become important (see, e.g., Majda & Harlim, 2012).Stochastic models are widespread in applications, including in biology (see, e.g., Goel & Richter-Dyn,2016; Wilkinson, 2018), chemistry (see, e.g., Leimkuhler & Reich, 2004; Tuckerman, 2010; Boninsegnaet al., 2018), engineering (see, e.g., Maybeck, 1982), the geophysical sciences (see, e.g., Majda &Kramer, 1999; Palmer, 2001; Arnold et al., 2013) and the social sciences (see, e.g., Diekmann & Mitter,2014); the text (see, e.g., Gardiner, 2009) provides a methodological overview aimed at applications inthe physical and social sciences.

The statistics literature concerning parameter estimation from continuous time series often startsfrom the premise that the diffusion coefficient may be read off from small increments (see, e.g., Kutoy-ants, 2013), a property that only holds if the data are consistent with an SDE at arbitrarily fine scales. Thefact that data may be inconsistent with the SDE at small scales was understood in the finance literaturein the early part of this century, and new models were introduced to address this incompatibility (see,e.g., Zhang et al., 2005). In the papers (see, e.g., Pavliotis & Stuart, 2007; Papavasiliou et al., 2009)the setting of multiscale SDEs was used to elucidate similar phenomena arising from models in thephysical sciences. Subsequent to these works, new methods were introduced to tackle the inconsistencyat small scales, based around subsampling and other multiscale uses of the data (see, e.g., Zhang et al.,2006; Papaspiliopoulos et al., 2012; Pokern et al., 2009); see Pavliotis et al. (2012) for an overview ofthis work. The potential applicability of problems of this type goes beyond the applications in econo-metrics, finance and molecular dynamics, highlighted in the preceding references, and into problemsin the geophysical sciences (Cotter et al., 2009; Kwasniok & Lohmann, 2009; Ying et al., 2019). Acompletely different approach to circumvent the use of fine-scale properties of the time series is to com-pute sufficient statistics from the time-series, and use these to learn parameters. This approach waspioneered as a systematic methodology, and then applied, in the series of papers (see, e.g., Krumscheidet al., 2013, 2015; Kalliadasis et al., 2015), based on statistics computed from multiple realizations andmultiple initial conditions; alternatively one can use a single very long time-series. The use of statisticsof time series (e.g., moments of autocorrelation functions) for estimating model parameters is commonfor discrete time series models, such as autoregressive (AR) or autoregressive moving average (ARMA)models (see, e.g., Lutkepohl, 2013; Brockwell et al., 1991). Another approach to parameter estimationin stochastic dynamical systems, which avoids the incompatiblity at small scales, is to augment the

Page 3: Learning Stochastic Closures Using Ensemble Kalman Inversion

3 of 28

state to include parameters, and then use state estimation techniques such as the particle filter (see, e.g.,Smith, 2013) or the unscented Kalman filter (see, e.g., Julier et al., 2000; Albers et al., 2017); however,this approach is notoriously sensitive in many practical settings (see, e.g., Doucet et al., 2001) and wedo not pursue it here. In this paper we build on the approach pioneered in Krumscheid et al. (2013) andcombine it with the ensemble Kalman approach to inverse problems, a methodology introduced in theoil industry (see, e.g., Chen & Oliver, 2002; Emerick & Reynolds, 2013) and formulated as a generalmethodology for inverse problems in Iglesias et al. (2013) and Albers et al. (2019); we refer to themethodology as ensemble Kalman inversion (EKI). Within this parameter estimation methodology, weemploy ideas from Gaussian process regression (GPR) (see, e.g., Rasmussen & Williams, 2006) to pa-rameterize unknown functions; this refineable, hierarchical approach to function representation leads tonovel nonparametric methods for function learning. Our approach builds on preceding, nonhierarchicalapproaches to inversion using GPR such as that described in Xiao et al. (2016). The concept of learningthe values at some fixed nodes of a Gaussian process, known as pilot point method (see, e.g., Dohertyet al., 2010), has also been extensively explored by the groundwater modeling community in the contextof inverse problems.

1.2 Our Contributions

Our contributions in this paper are as follows:

• we formulate parameter estimation in ergodic SDEs as a classical inverse problem for the parameter-to-data map, which is available to us through noisy evaluations, and we apply the EKI to thisformulation;

• within the EKI we demonstrate the utility of hierarchical parameterizations of unknown functions,using GPR;

• we demonstrate the methodology on a simulation study that employs a noisy (SDE) version of theLorenz ’63 model;

• we study reduction of the Lorenz ’63 ODE model to a two dimensional SDE in co-ordinatescomputed by applying PCA to the ODE data;

• we study the multiscale Lorenz ’96 ODE model, seeking an SDE closure in the slow variablesalone;

• we apply the methodology to fit a stochastic delay differential equation (SDDE) to El Nino–SouthernOscillation data;

• we apply the methodology to fit an SDE that describes fluctuations in the dihedral angle of abutane molecule.

In considering these simulation studies and real-data examples, we demonstrate the effectivenessof the methodology to find parameters, and we evaluate the accuracy, or lack of fidelity, of variousstochastic models derived from data. In section 2, we formulate the inverse problem of interest, andintroduce four example problems to which we will apply our methodology. Section 3 describes theensemble Kalman based methodology we employ to solve the inverse problem, as well as a discussionof the novel hierarchical Gaussian process based representation that we employ to represent, and learn,unknown functions. In section 4, we describe numerical results relating to each of the four exampleproblems. We conclude in section 5.

Page 4: Learning Stochastic Closures Using Ensemble Kalman Inversion

4 of 28

2. Problem Formulation

The aim of this work is to estimate parameters θ ∈Θ in the SDE

dxdt

= f (x;θ)+√

Σ(x;θ)dWdt

(2.1)

where x ∈ Rn, f : Rn×Θ 7→ Rn and Σ : Rn×Θ 7→ Rn×n. For all of the examples Θ ⊆ Rp, but thealgorithms we use extend to include the nonparametric setting in which p = ∞. We showcase a GPRbased method for hierarchical function representation which is refineable. We assume that the SDE isergodic and let E denote expectation with respect to the stationary process resulting from this ergodicity.If x(·;θ) ∈X := C(R+;Rn) denotes a solution of the SDE started in a statistical stationary state andF : X 7→ RJ is a function on the space of solution trajectories, then define G : Θ 7→ RJ by

G (θ) = EF (x(·;θ)).

We wish to solve the inverse problem of finding θ from noisy evaluations of G (θ). The noisy evaluationsarise from the fact that we will use finite-length trajectories to approximate the expectation definingG (θ); averages over initial conditions and/or realizations of the noise could also be used. The followingremark highlights the various observables F that we will use in this paper.

REMARK 2.1 We will use mth−moments of vector x at time t = 0:

Fm(x(·)) = Π j∈Mx j(0),

where M is a subset of cardinality m comprising indices (repetition allowed) from {1, · · · ,n}, leadingto the ergodic average Gm. We will also use Fac(x(·)) = x(t)⊗ x(0), leading through ergodic averagingto the auto-correlation function Gac of the stationary process. And finally we will use Gpsd to denoteparameters of a polynomial fit to the logarithm of the power spectral density; recall that the powerspectral density is the Fourier transform of the auto-correlation function of the stationary process. Allof the functions Gm,Gac and Gpsd can be approximated by time-averaging.

We now describe four examples that will be used to illustrate the methodology.

EXAMPLE 2.1 (Lorenz 63 System) The Lorenz equations (see, e.g., Lorenz, 1963) are a system of threeordinary differential equations taking the form

x = f (x), (2.2)

where x = [x1,x2,x3]> and f : R3 7→ R3 is given by

f1(x) = α(x2− x1),

f2(x) = x1(ρ− x3)− x2,

f3(x) = x1x2−βx3.

(2.3)

We are interested in the noisy version of these equations, in the form

x = f (x)+√

σW . (2.4)

This SDE will be used in a simulation study to illustrate our methodology.

Page 5: Learning Stochastic Closures Using Ensemble Kalman Inversion

5 of 28

We will also use the Lorenz 63 model (2.2), (2.3) written in a new coordinate system computed bymeans of PCA, and introduced in Selten (1995) and Palmer (2001). This amounts to introducing newcoordinates a = Ax, in which we obtain

a = g(a), (2.5)

where x = [x1,x2,x3]> and g : R3 7→ R3 is given by

g1(a) = 2.3a1−6.2a3−0.49a1a2−0.57a2a3,

g2(x) =−62−2.7a2 +0.49a21−0.49a2

3 +0.14a1a3,

g3(x) =−0.63a1−13a3 +0.43a1a2 +0.49a2a3.

(2.6)

The co-ordinate a3 contains around 4% of the total variance of the system. In Palmer (2001), this wasused as an argument to seek a stochastic dimension reduction in the model, in discrete time, in thevariables a1,a2 alone. We reinterpret this idea in continuous time and use our methodology to evaluatethe idea, using data from (2.5), (2.6) to study the fidelity possible when fitting an SDE of the form

a1 = 2.3a1−0.49a1a2 +ψ1(a2)+√

σ1(a2)W ,

a2 =−62−2.7a2 +0.49a21 +ψ2(a1)+

√σ2(a1)W ,

(2.7)

where functions ψ1(·), ψ2(·), σ1(·), σ2(·) are represented by Gaussian process regression as describedin detail subsection 3.2.

EXAMPLE 2.2 (Lorenz 96 System) The Lorenz 96 multiscale system (see, e.g., Lorenz, 1996) describesthe evolution of two sets of variables, denoted by xk (slow variables) and y j,k (fast variables):

dxk

dt=−xk−1(xk−2− xk+1)− xk +F−hcyk, k ∈ {1, . . . ,K},

1c

dy j,k

dt=−by j+1,k(y j+2,k− y j−1,k)− y j,k +

hJ

xk, ( j,k) ∈ {1, . . . ,J}×{1, . . . ,K}

xk+K = xk, y j,k+K = y j,k, y j+J,k = y j,k+1.

(2.8)

The coupling term hcyk describes the impact of fast dynamics on the slow dynamics with yk being theaverage

yk =1J

J

∑j=1

y j,k. (2.9)

We work with the parameter choices as in Schneider et al. (2017). Specifically, we choose K = 36,J = 10, h= 1, and F = c= b= 10. By assuming a spatially homogeneous (with respect to j) equilibriumin the fast dynamics, fixing k and xk as in Fatkullin & Vanden-Eijnden (2004), we obtain

yk =hJ

xk. (2.10)

We will seek a stochastic closure for the slow variables {xk} encapsulating systematic deviations from,and random fluctuations around, this simple balance:

Xk =−Xk−1(Xk−2−Xk+1)−Xk +F− h2cJ

Xk +ψ(Xk)+√

σW ,

Xk+K = Xk.

(2.11)

Using data from (2.8) we will fit a parameterized ψ(·) and the noise level σ .

Page 6: Learning Stochastic Closures Using Ensemble Kalman Inversion

6 of 28

EXAMPLE 2.3 (El Nino–Southern Oscillation (ENSO)) ENSO (see, e.g., Rayner et al., 2003) is a well-documented phenomenon which describes recurring pattern changes in central and eastern tropical Pa-cific Ocean temperatures. Figure 1 (time-series and histogram) shows the true sea surface temperature(SST) data (mean value has been removed) from year 1870 to 2019. It is postulated that the mechanismof ENSO can be illustrated by a delayed oscillator model with two time delays determined by proper-ties of the SST in the central tropical Pacific Ocean. Specifically, the two time delays correspond tothe eastward traveling Kelvin wave and the westward traveling Rossby wave (which also becomes aneastward traveling Kelvin wave after being reflected by the western coastline of Pacific Ocean) (see,e.g., Tziperman et al., 1994). Thus the time delays can be viewed as known and estimated quantitativelyfrom the corresponding wave speed.

(a) Time series

2 1 0 1 2SST

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prob

abilit

y de

nsity

(b) Histogram

FIG. 1: Illustration of ENSO data from year 1870 to 2019 (the time interval between two adjacent datapoints is one month).

The aim is to fit a stochastic delay differential equation (SDDE) (see, e.g., Buckwar, 2000; Erneux,2009) to the ENSO data shown in Figure 1 (time-series and histogram). To be concrete we will fit amodel of the following form:

dxdt

(t) = a tanh(x(t− τ1))−b tanh(x(t− τ2))− cx(t)+√

σdWdt

(t). (2.12)

Here delay τ1 represents the effect of the eastward traveling Kelvin wave, and delay τ2 represents theeffect of the westward traveling Rossby wave. Since good estimates for τ1 and τ2 exist, we will simplyfit the four parameters a,b,c,σ to data from the left-hand panel in Figure 1.

EXAMPLE 2.4 (Butane Molecule Dihedral Angle) In many problems arising in molecular dynamics, itis of interest to determine reduced models describing the behaviour of certain functionals derived fromthe molecular conformation. An example of such a functional is the dihedral angle in a simple modelof a butane molecule (see, e.g., Schlick, 2010). Fig. 2 shows how the dihedral angle is defined fromthe four atoms which comprise the butane molecule. Fig. 3 shows the time series and histrogram forthis angle, derived from a molecular dynamics model which we now describe. The model comprises asystem of twelve second order SDEs for the 4 atomic positions in R3, of Langevin type:

m0d2xdt2 + γ0

dxdt

+∇V (x) =

√2γ0

β0

dWdt

. (2.13)

Page 7: Learning Stochastic Closures Using Ensemble Kalman Inversion

7 of 28

The potential V , mass m0, damping γ0 and inverse temperature β0 characterize the molecule. From thetime series of x ∈ C(R+;R12) generated by this model we fit a simplified model for the the dihedralangle. This simplified model takes the form

d2φ

dt2 + γ(φ)dφ

dt+∇Ψ(φ) =

√2σγ(φ)

dWdt

. (2.14)

We will fit parameterized versions of γ and Ψ to data for φ generated by studying the time-series forthe dihedral angle defined by x from (2.13). Related work, fitting SDE models for the dihedral angle,may be found in Papaspiliopoulos et al. (2012) and Pokern et al. (2009). For a broader introduction tothe subject of finding Markov models for biomecular dynamics see (see, e.g., Djurdjevac et al., 2010;Ferguson et al., 2011; Zhang et al., 2017; Schutte & Sarich, 2013).

FIG. 2: The definition of butane moledule dihedral angle.

(a) Time series

3 2 1 0 1 2Dihedral angle (rad)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Dens

ity

(b) Histogram

FIG. 3: Illustration of the true butane dihedral angle data (the time interval between two adjacent datapoints is 10−6 ns).

Page 8: Learning Stochastic Closures Using Ensemble Kalman Inversion

8 of 28

3. Algorithms

In subsection 3.1, we describe the ensemble-based derivative-free optimization method that we use to fitparameters, the EKI algorithm. In subsection 3.2, we discuss how we use GPR to design hierarchical,refineable parameterizations of unknown functions that we wish to learn from data.

3.1 Ensemble Kalman Inversion (EKI)

Recall that the data we are given, y∈RJ , is a noisy evaluation of the function G (θ). We use the notationG(θ) to denote this noisy evaluation; recall that we typically envision the noisy evaluation comingfrom time-averaging. Appealing to central limit theorem type results that quantify rates of convergencetowards ergodic averages, we assume that G (θ)−G(θ) is Gaussian. Then the inverse problem can beformulated as follows: given y ∈ RJ , find θ ∈Θ so that

y = G(θ)+η , η ∼ N(0,Γ ). (3.1)

Estimating Γ may achieved by using time-series of different lengths, as explained in Cleary et al. (2020)in the specific case of G derived from moments, Fm.

The ensemble Kalman inversion (EKI) algorithm we employ is as described in Iglesias et al. (2013)and Albers et al. (2019). The algorithm propagates a set of J parameter estimates {θ ( j)

n }Jj=1 through

algorithmic time n, using the update formula

θ( j)n+1 = θ

( j)n +CθG

n(CGG

n +Γ)−1(

yn−G(θ( j)n )). (3.2)

The matrix CGGn is the empirical covariance of {G(θ

( j)n )}J

j=1, while matrix CθGn is the empirical cross-

covariance of {θ ( j)n }J

j=1 with {G(θ( j)n )}J

j=1. The EKI algorithm as stated preserves the linear span of

the initial ensemble {θ ( j)0 }J

j=1 for each n and thus operates in a finite dimensional vector space, even ifΘ is an infinite dimensionsal vector space.

The natural objective function associated with the inverse problem (3.1) is

12

∥∥Γ− 1

2(y−G(θ)

)∥∥2. (3.3)

The two primary reasons for using EKI to solve the SDE-based inverse problem that is ou focus here are:(a) it does not require derivatives, which are difficult to compute in this setting of SDEs, and nonethelessprovably, in the linear case (see, e.g., Schillings & Stuart, 2017), and approximately, in the nonlinearcase (see, e.g., Garbuno-Inigo et al., 2020) behaves like a projected gradient descent with respect toobjective (3.3); (b) it is robust to noisy evaluations of the forward map as shown in Duncan et al. (2020).More generally the EKI approach also has the advantage of being inherently parallelizable and of scalingwell to high dimensional problems.

3.2 Gaussian Process Regression (GPR)

Throughout this paper we apply the EKI algorithm in a finite-dimensional vector space Θ ⊆ Rp. Toestimate an unknown function, we use Gaussian process regression. More precisely, we will use themean m(x) of a Gaussian process and optimize over the mean values of the process at a fixed set of designpoints, over the covariance of the noise in the observations used to define the Gaussian process, and over

Page 9: Learning Stochastic Closures Using Ensemble Kalman Inversion

9 of 28

the parameters describing the covariance function of the Gaussian process. This leads to a parameterdependent function m(x;θ), the construction of which we now detail. Note that the construction involvesprobabilistic considerations but, once completed, leads to a class of deterministic functions m(x;θ) overθ ∈Θ . The key advanatge of the GPR construction is that it leads to a hierarchical parameterizationthat has proved very useful in many machine learning tasks. Adapting it to statistical estimation moregenerally is potentially very fruitful, and is one of the ideas we use here. We are essentially proposing touse Gaussian process parameterizations beyond the simple regression setting, into more general functionlearning problems.

The set of parameters θ contains the following elements:

(i) noisily observed values of m′ = {m(x(i))} at some fixed nodes x(i), denoted θ ′ = {θi};

(ii) observation errors Σobs = diag(σ(i)) at the fixed nodes x(i), based on the assumption that θ ′ =m′+

√Σobs ξ for some ξ ∼ N(0, I);

(iii) hyper-parameters (σ , `), respectively an amplitude and a length-scale, of the kernel k(x,x′) usedin the GP regression.

The mean function m(x) minimizes

I(m) :=12‖m‖2

K +J

∑j=1

12σ2

(i)|m(x(i))−θi|2

over K, the reproducing kernel Hilbert space associated with covariance function k(x,x′;σ , `). By therepresenter theorem (see, e.g., Rasmussen & Williams, 2006), it follows that the minmizer lies in thelinear span of {k(x,x(i);σ , `)}I

i=1} and may be found by solving a linear system of dimension equal tothe number of design points (the dimension of θ ′). The result is, however, a complex nonlinear functionof θ = (θ ′,Σobs,σ , `). It is these parameters that are then learnt from data. Throughout this paper, wetake Σobs to be a constant (which enters the vector θ ) multiple of the identity.

4. Numerical Results

To demonstrate the capability of the proposed methodology, we apply it to four different examples,including classical chaotic systems (Lorenz 63 system in subsection 4.1, Lorenz 96 system in sub-section 4.2), climate dynamics (ENSO in subsection 4.3), and molecular dynamics (butane moleculedihedral angle in subsection 4.4). All these numerical studies confirm that the proposed methodologyserves well as a systematic approach to the fitting of SDE models to data. In all cases, the data areinitially presented in two figures, one showing the ability of the EKI method to fit the data, and a secondshowing how well the fitted SDE performs in terms of reproducing the invariant measure. We then showother figures which differ between the different cases and are designed to illustrate the nature of thefitted SDE model.

Page 10: Learning Stochastic Closures Using Ensemble Kalman Inversion

10 of 28

4.1 Lorenz 63 System

4.1.1 Noisy Lorenz 63: Simulation Study. The first set of experiments are simulation studies in whichdata from (2.4) is used to fit parameters within the following model:

dx1

dt= α(x2− x1)+

√σ

dW1

dtdx2

dt= x1(ρ− x3)−gL(x2)+

√σ

dW2

dtdx3

dt= x1x2−βx3 +

√σ

dW3

dt.

(4.1)

Unlike the other examples in this paper, this initial simulation study involves data that come directlyfrom an SDE within the model class being fitted. Specifically, the data are obtained by simulating themodel with a given set of parameters: α = 10, ρ = 28, β = 8/3, and σ = 10, as well as the choicegL(x2) = x2. The purpose of this example is to illustrate our computational methodology in a perfect testsetting. Using EKI we will fit θ defined in two different ways:

(i) Fix gL(x2) = x2 and learn θ = (α,σ). Results are presented in Figs. 4 to 5.

(ii) Parameterize gL(x) by θ1 and learn θ = (θ1,σ). Results are presented in Figs. 6 to 8.

Truth EKI

x1 x2 x3 x1x1 x2x2 x3x3 x1x2 x1x3 x2x3

0

20

40

60

80

(a) ODE

x1 x2 x3 x1x1 x2x2 x3x3 x1x2 x1x3 x2x3

0

20

40

60

80

(b) SDE

FIG. 4: First two moments of state x for the Lorenz 63 system found by using EKI to estimate α (ODEcase) and σ ,α (SDE case).

The data in this case are a finite-time average approximation of {G1(x),G2(x)}, i.e., the first andsecond moments of the state vector x are used as observational data. Therefore, the data vector y has 9elements in total. In case (i) we fit both an ODE (constrain σ = 0) and an SDE (learn σ ) to the givendata. Thus, we are fitting 1 parameter in the ODE case and 2 parameters in the SDE setting, in bothcases using a data vector y of dimension 9. The comparison of the output of the EKI algorithm with the

Page 11: Learning Stochastic Closures Using Ensemble Kalman Inversion

11 of 28

true data, in the case of both ODE and SDE fits, is presented in Fig. 4. In the ODE case the algorithmfails to match the second moment (left panel), while in the SDE case all moments are well-matched.This has a significant effect on the ability to reproduce the invariant measure (Fig. 5). In the left panel(ODE), the fit is very poor whereas in the right panel (SDE) it is excellent. The fit of the SDE is notsurprising since the data is generated directly from within the model class to be fit; the behaviour of thefit to an ODE gives insight into the method when the model is misspecified.

Truth EKI

20 10 0 10 20x1

0.00

0.01

0.02

0.03

0.04

0.05

Prob

abilit

y de

nsity

(a) x1 (ODE)

20 0 20x2

0.00

0.02

0.04

0.06

Prob

abilit

y de

nsity

(b) x2 (ODE)

0 20 40x3

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(c) x3 (ODE)

20 10 0 10 20x1

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(d) x1 (SDE)

20 0 20x2

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(e) x2 (SDE)

0 20 40x3

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(f) x3 (SDE)

FIG. 5: Invariant measures of fitted models compared with those of the true noisy Lorenz 63 system:ODE case, fit α; SDE case, fit σ ,α .

We now perform exactly the same set of experiments, fitting both an ODE and an SDE, but allowingthe function gL to be learnt as well. The function gL is parameterized by the mean of a GP (with un-known values specified at 5 fixed nodes, and constant observation error and hyper-parameters). Thus weare fitting 8 parameters for the ODE and 9 parameters for the SDE, using a data vector y of dimension9. When we do this, we are able to substantially improve the fit of the ODE in both data space, seenin Fig. 6a, and as measured by the fit to the invariant measure, as see in Fig. 7a-c. Nonetheless the fitachieved by using an SDE remains substantially better, as seen in Fig. 6b and Fig. 7d-f. The ensem-ble mean of Gaussian process learnt in the ODE and SDE models are presented in Fig. 8. Both GPssuccessfully capture the linear trend in the middle range of x2, while the GP learnt in the ODE modelexhibits more oscillations around the middle part and more rapid deviation from the linear trend at bothends.

Page 12: Learning Stochastic Closures Using Ensemble Kalman Inversion

12 of 28

Truth EKI

x1 x2 x3 x1x1 x2x2 x3x3 x1x2 x1x3 x2x3

0

20

40

60

80

(a) ODE

x1 x2 x3 x1x1 x2x2 x3x3 x1x2 x1x3 x2x3

0

20

40

60

80

(b) SDE

FIG. 6: First two moments of state x for the Lorenz 63 system by using EKI to estimate σ and the linearfunction gL(x2).

4.1.2 Deterministic Lorenz 63: Dimension Reduction. We fit a two-dimensional SDE model of theform (2.7) to data generated from the first two components of the three-dimensional ODE model (2.5),(2.6). The four unknown functions (ψ1(·), ψ2(·), σ1(·), σ2(·)) are parameterized by the mean of Gaus-sian process regression. Each GPR involves the mean values at five fixed nodes, a stationary observationerror, and the stationary hyper-parameters (σG P , `). The precise form of the data in this case is a finitetime average approximation of {Gm(a),Gac(a)}. Specifically, the first four moments of the state vectora are used in this case. In addition, nine equally spaced data points are used from a finite time averageapproximation of the autocorrelation function Gac(a) for both states a1 and a2, from an interval of 10time units, and we obtain 18 data points in total from the autocorrelation function. Thus we are fitting32 parameters for the SDE by using a data vector y of dimension 27. The true moment data and autocor-relation data are presented in Fig. 9, alongside the results obtained by using EKI to fit the SDE modelto this data, showing a relatively good match to the data. When measured by the ability to reproducethe invariant measure, the results demonstrate a fairly strong match between the marginal invariant den-sities on a1 and a2 from the original three-dimensional ODE and from the fitted two-dimensional SDE(Fig.10). The fit to the autocorrelation functions of a1 and a2 is also quite good (Fig. 11). Fig. 12 com-pares the trajectories of the three-dimensional ODE (2.5), (2.6), projected on a1 and a2, with those ofthe fitted SDE (2.7); the difference in smoothness of the solutions of ODEs and SDEs is apparent at thatlevel, although it can be seen that the fitted SDE model shows a similar pattern of irregular switchingbetween the two distinct components of the attractor in the true system.

Despite the seemingly simple formulation of this example as outlined in section 2, it is a ratherdifficult problem. On the one hand, the standard Lorenz 63 system (2.5), (2.6) is well known as a

Page 13: Learning Stochastic Closures Using Ensemble Kalman Inversion

13 of 28

Truth EKI

20 10 0 10 20x1

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(a) x1 (ODE)

20 0 20x2

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(b) x2 (ODE)

0 20 40x3

0.00

0.01

0.02

0.03

0.04

0.05

Prob

abilit

y de

nsity

(c) x3 (ODE)

20 10 0 10 20x1

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(d) x1 (SDE)

20 0 20x2

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(e) x2 (SDE)

0 20 40x3

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(f) x3 (SDE)

FIG. 7: Invariant measures of fitted models (with linear function gL(x2) and σ as unknowns) and thetrue noisy Lorenz 63 system.

Truth EKI

−30 −20 −10 0 10 20 30x2

−30

−20

−10

0

10

20

30

g L(x

2)

(a) ODE

−30 −20 −10 0 10 20 30x2

−30

−20

−10

0

10

20

30

g(x

2)

(b) SDE

FIG. 8: The linear function gL(x2) learnt in (a) ODE model and (b) SDE model.

Page 14: Learning Stochastic Closures Using Ensemble Kalman Inversion

14 of 28

Truth EKI

a1 a2 a1a1 a2a2 a1a2 a1a1a1 a2a2a2 a1a1a1a1 a2a2a2a2

0

20

40

60

80

100

120

140

(a) Moments

1 2 3 4 5 6 7 8 9Time

0.0

0.2

0.4

0.6

0.8

Auto

corre

latio

n

(b) Autocorrelation (a1)

1 2 3 4 5 6 7 8 9Time

0.6

0.4

0.2

0.0

0.2

0.4

0.6

Auto

corre

latio

n

(c) Autocorrelation (a2)

FIG. 9: Comparison of observation data between the true system and the EKI-fitted SDE model for thedeterministic Lorenz 63 system.

Page 15: Learning Stochastic Closures Using Ensemble Kalman Inversion

15 of 28

classical chaotic system. On the other hand, chaos cannot arise in the reduced-order 2D model of theform (2.7) in a deterministic setting where the noise levels σi are set to zero, by the Poincare–Bendixsontheorem. Palmer (see, e.g., Palmer, 2001) showed that a reduced-order 2D model of the Lorenz 63system in discrete time could exhibit qualitative features of the original 3D system, if subjected toadditive noise, but it remained an unsolved problem to fit a stochastic 2D model that quantitativelycaptures detailed chaotic behavior of the original 3D chaotic system. This motivated us to introduce bothmore unknown parameters and more observed statistics as data, compared to the previous simulationstudy (4.1.1). In so doing, we have demonstrated some success in fitting an SDE model to data from theODE. We face similar challenges in the remaining examples in this section, where the data are generatedfrom a more complex model than that to which it is fitted, or indeed is simply observed data.

Truth EKI

40 20 0 20 40X

0.00

0.01

0.02

0.03

0.04

Prob

abilit

y de

nsity

(a) a1

40 20 0 20 40X

0.00

0.01

0.02

0.03

0.04

0.05

Prob

abilit

y de

nsity

(b) a2

FIG. 10: Invariant measures of the deterministic Lorenz 63 system and the fitted reduced-order SDEmodel.

4.2 Lorenz 96 System

We generate data from the slow variable {xk}Kk=1 in (2.8), and use it to fit a parameterized function ψ(·)

and parameter σ appearing in (2.11). In the experiments presented below, we take K = 36. The formof the data in this case is finite-time averaged approximations of {G1(x),G2(x)}, i.e., observations ofthe first and second moments of the state vector x = {xk}n

k=1, for some n 6 K. The model discrepancyterm ψ(Xk) in (2.11) is parameterized by the mean of a GP with mean values at 7 fixed nodes, andconstant observation error and hyper-parameters. Thus, we are fitting 10 parameters for the ODE and11 for the SDE, using a data vector y of dimension 44 (when only observing the first 8 slow variables)or of dimension 702 (when observing all 36 slow variables). Our numerical results illustrate severalinteresting properties of stochastic closures for the Lorenz 96 multiscale system:

a) we show that for the relatively large time scale separation of c = 10 it is possible to fit both anaccurate ODE (σ = 0) and SDE (σ > 0), in the sense that both ODE and SDE fitted models (2.8)

Page 16: Learning Stochastic Closures Using Ensemble Kalman Inversion

16 of 28

(a) a1 (truth) (b) a2 (truth)

(c) a1 (EKI) (d) a2 (EKI)

FIG. 11: Autocorrelation of the deterministic Lorenz 63 system and the fitted reduced-order SDE model.

(a) a1 (truth) (b) a2 (truth)

(c) a1 (EKI) (d) a2 (EKI)

FIG. 12: Time series of the deterministic Lorenz 63 system and the fitted reduced-order SDE model.

Page 17: Learning Stochastic Closures Using Ensemble Kalman Inversion

17 of 28

accurately reproduce the invariant measure of the full system (2.8), using only n = 8;

b) we show that with weaker scale separation of c = 3, using n = 8 leads to an ODE fit which is verypoor, as measured by reproduction of the invariant measure, whereas the SDE fit, whilst not as goodas for c = 10, is passable;

c) we show that by increasing n to equal K = 36, the ODE fit does not improve, but for the SDE itbecomes excellent.

Results for case a) are presented in Figs 13 to 14. It can be seen in Fig. 13 that both the fitted ODEmodel and SDE model show almost perfect agreements with the true system in data space. Furthermore,both fitted ODE model and SDE model agree well with the true system when we compare the invariantmeasure with that of the underlying data-generating model, in Fig. 14. Although the agreement isslightly better for the fitted SDE model in Fig. 14b, the performances of fitted ODE and SDE modelsare quantitatively close to each other. The good performance of the fitted ODE model can be explainedby invoking the averaging hypothesis to propose a closed ODE model of the form (2.11). The dataimplying the form of the closure ψ is as presented in Fig. 15a. Specifically, the scattering of the trueclosure term is moderate for c = 10, indicating that a deterministic closure of slow variables may wellachieve good agreement with the true system.

Truth EKI

0 10 20 30 40Index of yi

5

0

5

10

y i

(a) ODE

0 10 20 30 40Index of yi

5

0

5

10

y i

(b) SDE

FIG. 13: Comparison of observation data between the true Lorenz 96 system (c = 10) and the fittedODE and SDE models.

Results for case b) are presented in Figs 15 to 17. The averaging hypothesis does not apply so cleanlyin this case where c = 3, as can be seen by comparing the moderate scattering in the data when c = 10(Fig. 15a), with that obtained when c = 3 (Fig. 15b). In the second setting, it turns out that the fittedODE model is far less satisfactory than the fitted SDE model. As seen in Fig. 16, the agreement withthe true system in data space is still comparable between the fitted ODE and SDE models. However,studying the ability of the two models to fit invariant measures in Fig. 17, we see that the fitted ODEmodel concentrates around several discrete values, indicating a lack of chaotic behavior in long time.

Page 18: Learning Stochastic Closures Using Ensemble Kalman Inversion

18 of 28

Truth EKI

10 5 0 5 10 15X

0.00

0.02

0.04

0.06

0.08

0.10

Prob

abilit

y de

nsity

(a) ODE

10 5 0 5 10 15X

0.00

0.02

0.04

0.06

0.08

0.10

Prob

abilit

y de

nsity

(b) SDE

FIG. 14: Comparison of invariant measures between the true Lorenz 96 system (c = 10) and the fittedODE and SDE models.

On the contrary, the fitted SDE model better captures the pattern of the invariant measure of the truesystem, although there is still room for improvement in this case.

(a) c = 10, h = 1 (b) c = 3, h = 10/3

FIG. 15: True closure term of slow variables with different temporal scale separations between slow andfast variables.

Results for case c) are presented in Figs. 18 to 20. With the goal of improving the model fit, weaugment the data to include all first and second moments of the slow variables. Figure 18 shows thatthe SDE fit achieves better agreement than the ODE fit in data space. As presented in Fig. 19, the fittedODE model still tends to concentrate toward a small number of discrete values in the long-time behavior,

Page 19: Learning Stochastic Closures Using Ensemble Kalman Inversion

19 of 28

Truth EKI

0 10 20 30 40Index of yi

6

4

2

0

2

4

6

8

y i

(a) ODE

0 10 20 30 40Index of yi

6

4

2

0

2

4

6

8

y i

(b) SDE

FIG. 16: Comparison of data between the true Lorenz 96 system (c = 3) and the fitted ODE and SDEmodels. Only first 8 slow variables are used to compute the statistics.

Truth EKI

10 5 0 5 10 15X

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Prob

abilit

y de

nsity

(a) ODE

10 5 0 5 10 15X

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

Prob

abilit

y de

nsity

(b) SDE

FIG. 17: Comparison of invariant measures between the true Lorenz 96 system (c = 3) and the fittedODE and SDE models. Only first 8 slow variables are utilized to compute the statistics as data.

Page 20: Learning Stochastic Closures Using Ensemble Kalman Inversion

20 of 28

while the invariant measure of fitted SDE model demonstrates excellent agreement with the true system.The issue of the fitted ODE/SDE models in failing to maintain, or to maintain, correct chaotic behaviorfor a long time can be more clearly seen in the time series presented in Fig. 20.

Truth EKI

0 10 20 30Index of Xi

2.0

2.2

2.4

2.6

X i

(a) ODE

0 10 20 30Index of Xi

1.9

2.0

2.1

2.2

2.3

X i

(b) SDE

FIG. 18: Comparison of data between the true Lorenz 96 system (c = 3) and the fitted ODE and SDEmodels. All 36 slow variables are used to compute the first and second moments as data (only the firstmoment data is presented here).

Truth EKI

10 5 0 5 10 15X

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Prob

abilit

y de

nsity

(a) ODE

10 0 10X

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Prob

abilit

y de

nsity

(b) SDE

FIG. 19: Comparison of invariant measures between the true Lorenz 96 system (c = 3) and the fittedODE and SDE models. All 36 slow variables are utilized to compute the statistics as data.

Page 21: Learning Stochastic Closures Using Ensemble Kalman Inversion

21 of 28

(a) 8 observed (ODE) (b) 8 observed (SDE) (c) 36 observed (SDE)

FIG. 20: Comparisons of time series of slow variable X1 by using (a) 8 observed slow variables withODE, (b) 8 observed slow variables with SDE, and (c) 36 observed slow variables with SDE. The timerange over which we collect data for EKI is T = 100.

4.3 El Nino–Southern Oscillation (ENSO)

In this subsection we use the time-series data for sea surface temperature (SST) T shown in Fig. 1a tofit an SDDE of the form (2.12), learning the four parameters a,b,c,σ . Recall that the delays τ1 and τ2may be viewed as known properties of Rossby and Kelvin waves. The precise form of the data in thiscase is a finite-time average approximation of {{Gm(T )}4

m=1,Gac(T ),Gpsd(T )}. Thus we are fitting 4parameters in the SDDE by using a data vector y of dimension 16. The true moment data are presented inFig. 21a. To use the autocorrelation function Gac(T ) as data, we sample 9 points from it with an intervalof six months as presented in Fig. 21b. We also use the coefficients of a second-order polynomial fitto the logarithm of the power spectral density; the three coefficients are presented in Fig. 21c. Resultsdemonstrating the fit are presented in Figs. 21 to 23.

Truth EKI

x xx xxx xxxx

0.5

0.0

0.5

(a) Moments

0 1 2 3 4 5 6 7 80.2

0.0

0.2

0.4

(b) Autocorrelation

1 2 3

10

0

10

20

(c) PSD

FIG. 21: The comparison of observed quantities, including (a) the first four moments, (b) the autocorre-lation function, and (c) the coefficients of fitted second-order polynomial of power spectrum density.

It can be seen in Fig. 21 that the fitted SDE model shows a good agreement with true ENSO statistics,with the exception of the two highest moments. Looking at the invariant measures presented in Fig. 22,

Page 22: Learning Stochastic Closures Using Ensemble Kalman Inversion

22 of 28

we see evidence that the fitted SDE model can potentially capture the long time behavior of ENSO andthus help us better understand the ENSO mechanism as well. The satisfactory performance of the fittedSDE model is also confirmed by further studying the time series in Fig. 23, where we can see that thesimulated time series from the fitted SDE model captures well the statistical behavior of ENSO data.

Truth EKI

3 2 1 0 1 2 3SST

0.0

0.1

0.2

0.3

0.4

0.5

Prob

abilit

y de

nsity

FIG. 22: The comparison of the invariant measures of SST between true ENSO data and fitted SDEmodel.

4.4 Butane Molecule Dihedral Angle

We fit the second-order Langevin equation (2.14) model to data derived from (2.13). The precise formof the data in this case is a finite-time average approximation of{{Gm(φ)}4

m=1,Gac(φ),Gpsd(φ)}, where φ denotes the dihedral angle. Thus, similarly to the previoussubsection, we use the first four moments of φ , and the true moment data are presented in Fig. 24a.Furthermore, Gac(φ) is approximated using 9 points sampled from the autocorrelation function withan interval of 50 fs, and these sampled data points are presented in Fig. 24b. Finally, we also use thecoefficients of a second-order polynomial that fits to the logarithm of the power spectral density, and thethree coefficients are presented in Fig. 24c. On the other hand, we are learning scalars γ and σ in (2.13),together with the potential Ψ constructed from Gaussian basis functions (with length scale fixed as 0.5)centered at 9 points evenly distributed in [−π,π]. Thus, we are fitting 11 parameters for the SDE byusing a data vector y of dimension 16. Results showing the fitted SDE are presented in Figs. 24 to 27.

Figure 24 shows that the fitted SDE model achieves very good agreement with all the statistics oftrue dihedral angle data. More importantly, we can see in Fig. 25 that the fitted SDE model also leadsto an invariant measure that agrees well with that of the true dihedral angle data.

In order to further validate the fitted SDE model, and in particular to show that it successfullycaptures the transition behavior and frequency information of the true data, we further study the timeseries (Fig. 26) and autocorrelation function (Fig. 27) by simulating the fitted SDE model for a long

Page 23: Learning Stochastic Closures Using Ensemble Kalman Inversion

23 of 28

(a) Truth

(b) SDE

FIG. 23: The comparison of time series of SST between true ENSO data and one realization of the fittedSDE model.

Truth EKI

0.0

0.5

1.0

1.5

(a) Moments

0 1 2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

(b) Autocorrelation

1 2 310

5

0

5

10

15

(c) PSD

FIG. 24: The comparison of observed quantities of dihedral angle φ , including (a) the first four moments,(b) autocorrelation function and (c) coefficients of fitted second order polynomial of power spectraldensity.

Page 24: Learning Stochastic Closures Using Ensemble Kalman Inversion

24 of 28

Truth EKI

3 2 1 0 1 2 3Dihedral angle

0.00

0.25

0.50

0.75

1.00

1.25

1.50

Prob

abilit

y de

nsity

FIG. 25: Invariant measures of butane molecule dihedral angle.

time. Figs. 26 and 27 clearly show that the fitted SDE model successfully reproduces the statisticalbehavior of the true dihedral angle computed from the much more expensive full molecular dynamicssimulation.

(a) Truth (b) EKI

FIG. 26: Time series of butane molecule dihedral angle.

(a) Truth (b) EKI

FIG. 27: Autocorrelation of butane molecule dihedral angle.

Page 25: Learning Stochastic Closures Using Ensemble Kalman Inversion

25 of 28

5. Conclusions

Although computing power has increased dramatically in the past several decades, so too has the com-plexity of models that we’d like to simulate. It is still infeasible to resolve all the interactions withinthe true system in many applications. SDEs arise naturally as models in many disciplines, even if thefirst principles governing equations are deterministic, because stochasticity can effectively model un-resolved variables. However, standard statistical methodology for parameter estimation is not alwayssuitable for fitting SDE models to real data, due to the frequently observed inconsistency between SDEsand real data at small time scales. In this work, we exploit the idea of using sufficient statistics found byfinite-time averaging time series data. Using these statistics, we demonstrated that an SDE model, withthe unknown functions being parameterized by Gaussian process regression (GPR), can be fitted to thedata using enemble Kalman inversion (EKI).

Ensemble methods are particularly well-suited to this problem for several reasons: they are derivative-free thereby sidestepping computational and technical issues that arise from differentiating SDEs withrespect to parameters; they are inherently parallelizable; and they scale well to high-dimensional prob-lems. The novel hierarchical GPR-based function approximation that we use meshes well with the EKImethdology.

Future directions of interest in this area include the derivation of a systematic approach to the de-termination of sufficient statistics, analysis of the EKI algorithm for learning in the context of theseproblems, and analysis of the use of GPR-based function representation in nonlinear inverse problemsand the further use of the methodology to new problems arising in applications.

Acknowledgements The authors thank Dr. Yvo Pokern at University College London for providing thebutane dihedral angle data. All authors are supported by the generosity of Eric and Wendy Schmidt byrecommendation of the Schmidt Futures program, by Earthrise Alliance, Mountain Philanthropies, thePaul G. Allen Family Foundation, and the National Science Foundation (NSF, award AGS1835860).A.M.S. is also supported by NSF (award DMS-1818977) and by the Office of Naval Research (awardN00014-17-1-2079).

REFERENCES

ALBERS, D. J., LEVINE, M., GLUCKMAN, B., GINSBERG, H., HRIPCSAK, G. & MAMYKINA, L. (2017) Per-sonalized glucose forecasting for type 2 diabetes using data assimilation. PLoS Computational Biology, 13.

ALBERS, D. J., BLANCQUART, P.-A., LEVINE, M. E., SEYLABI, E. E. & STUART, A. (2019) Ensemble kalmanmethods with constraints. Inverse Problems, 35, 095007.

ARNOLD, H., MOROZ, I. & PALMER, T. (2013) Stochastic parametrizations and model uncertainty in theLorenz’96 system. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engi-neering Sciences, 371, 20110479.

BONINSEGNA, L., NUSKE, F. & CLEMENTI, C. (2018) Sparse learning of stochastic dynamical equations. TheJournal of Chemical Physics, 148, 241723.

BROCKWELL, P. J., DAVIS, R. A. & FIENBERG, S. E. (1991) Time Series: Theory and Methods. SpringerScience & Business Media.

BRUNTON, S. L. & KUTZ, J. N. (2019) Data-driven science and engineering: Machine learning, dynamicalsystems, and control. Cambridge University Press.

BUCKWAR, E. (2000) Introduction to the numerical analysis of stochastic delay differential equations. Journal ofComputational and Applied Mathematics, 125, 297–307.

CHEN, Y. & OLIVER, D. (2002) Ensemble randomized maximum likelihood method as an iterative ensemble

Page 26: Learning Stochastic Closures Using Ensemble Kalman Inversion

26 of 28

smoother. Mathematical Geosciences, 44, 1–26.CLEARY, E., GARBUNO-INIGO, A., LAN, S., SCHNEIDER, T. & STUART, A. M. (2020) Calibrate, emulate,

sample. arXiv preprint arXiv:2001.03689.COIFMAN, R. R., KEVREKIDIS, I. G., LAFON, S., MAGGIONI, M. & NADLER, B. (2008) Diffusion maps,

reduction coordinates, and low dimensional representation of stochastic systems. Multiscale Modeling &Simulation, 7, 842–864.

COTTER, C., PAVLIOTIS, G. et al. (2009) Estimating eddy diffusivities from noisy lagrangian observations.Communications in Mathematical Sciences, 7, 805–838.

DIEKMANN, A. & MITTER, P. (2014) Stochastic Modelling of Social Processes. Academic Press.DJURDJEVAC, N., SARICH, M. & SCHUTTE, C. (2010) On Markov state models for metastable processes. Pro-

ceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: PlenaryLectures and Ceremonies Vols. II–IV: Invited Lectures. World Scientific, World Scientific, pp. 3105–3131.

DOHERTY, J. E., FIENEN, M. N. & HUNT, R. J. (2010) Approaches to highly parameterized inversion: Pilot-pointtheory, guidelines, and research directions. US Geological Survey Scientific Investigations Report, 5168, 36.

DOUCET, A., DE FREITAS, N. & GORDON, N. (2001) An introduction to sequential Monte Carlo methods.Sequential Monte Carlo Methods in Practice. Springer, pp. 3–14.

DUNCAN, A., STUART, A. M. & WOLFRAM, M.-T. (2020) In preparation. arXiv preprint arXiv:.EMERICK, A. & REYNOLDS, A. (2013) Investigation of the sampling performance of ensemble-based methods

with a simple reservoir model. Computational Geosciences, 17, 325–350.ERNEUX, T. (2009) Applied Delay Differential Equations, vol. 3. Springer Science & Business Media.FATKULLIN, I. & VANDEN-EIJNDEN, E. (2004) A computational strategy for multiscale systems with applications

to Lorenz 96 model. Journal of Computational Physics, 200, 605–638.FERGUSON, A. L., PANAGIOTOPOULOS, A. Z., KEVREKIDIS, I. G. & DEBENEDETTI, P. G. (2011) Nonlinear

dimensionality reduction in molecular simulation: The diffusion map approach. Chemical Physics Letters,509, 1–11.

FROYLAND, G., GOTTWALD, G. A. & HAMMERLINDL, A. (2014) A computational method to extract macro-scopic variables and their dynamics in multiscale systems. SIAM Journal on Applied Dynamical Systems, 13,1816–1846.

GARBUNO-INIGO, A., HOFFMANN, F., LI, W. & STUART, A. M. (2020) Interacting Langevin diffusions: Gra-dient structure and ensemble Kalman sampler. SIAM Journal on Applied Dynamical Systems, 19, 412–441.

GARDINER, C. (2009) Stochastic Methods, vol. 4. Springer Berlin.GIANNAKIS, D. (2019) Data-driven spectral decomposition and forecasting of ergodic dynamical systems. Applied

and Computational Harmonic Analysis, 47, 338–396.GOEL, N. S. & RICHTER-DYN, N. (2016) Stochastic Models in Biology. Elsevier.IGLESIAS, M. A., LAW, K. J. & STUART, A. M. (2013) Ensemble Kalman methods for inverse problems. Inverse

Problems, 29, 045001.JULIER, S., UHLMANN, J. & DURRANT-WHYTE, H. F. (2000) A new method for the nonlinear transformation

of means and covariances in filters and estimators. IEEE Transactions on Automatic Control, 45, 477–482.KALLIADASIS, S., KRUMSCHEID, S. & PAVLIOTIS, G. A. (2015) A new framework for extracting coarse-grained

models from time series with multiscale structure. Journal of Computational Physics, 296, 314–328.KLUS, S., NUSKE, F., KOLTAI, P., WU, H., KEVREKIDIS, I., SCHUTTE, C. & NOE, F. (2018) Data-driven

model reduction and transfer operator approximation. Journal of Nonlinear Science, 28, 985–1010.KRUMSCHEID, S., PAVLIOTIS, G. A. & KALLIADASIS, S. (2013) Semiparametric drift and diffusion estimation

for multiscale diffusions. Multiscale Modeling & Simulation, 11, 442–473.KRUMSCHEID, S., PRADAS, M., PAVLIOTIS, G. & KALLIADASIS, S. (2015) Data-driven coarse graining in

action: Modeling and prediction of complex systems. Physical Review E, 92, 042139.KUTOYANTS, Y. A. (2013) Statistical Inference for Ergodic Diffusion Processes. Springer Science & Business

Media.

Page 27: Learning Stochastic Closures Using Ensemble Kalman Inversion

27 of 28

KWASNIOK, F. & LOHMANN, G. (2009) Deriving dynamical models from paleoclimatic records: Application toglacial millennial-scale climate variability. Physical Review E, 80, 066104.

LEIMKUHLER, B. & REICH, S. (2004) Simulating Hamiltonian dynamics, vol. 14. Cambridge University Press.LORENZ, E. N. (1963) Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20, 130–141.LORENZ, E. N. (1996) Predictability: A problem partly solved. Proc. Seminar on Predictability, vol. 1. Proc.

Seminar on Predictability, vol. 1.LUTKEPOHL, H. (2013) Introduction to Multiple Time Series Analysis. Springer Science & Business Media.MAJDA, A. J. & HARLIM, J. (2012) Filtering Complex Turbulent Systems. Cambridge University Press.MAJDA, A. J. & KRAMER, P. R. (1999) Simplified models for turbulent diffusion: theory, numerical modelling,

and physical phenomena. Physics Reports, 314, 237–574.MAYBECK, P. S. (1982) Stochastic Models, Estimation, and Control. Academic Press.PALMER, T. N. (2001) A nonlinear dynamical perspective on model error: A proposal for non-local stochastic-

dynamic parametrization in weather and climate prediction models. Quarterly Journal of the Royal Meteoro-logical Society, 127, 279–304.

PAPASPILIOPOULOS, O., POKERN, Y., ROBERTS, G. O. & STUART, A. M. (2012) Nonparametric estimation ofdiffusions: a differential equations approach. Biometrika, 99, 511–531.

PAPAVASILIOU, A., PAVLIOTIS, G. & STUART, A. (2009) Maximum likelihood drift estimation for multiscalediffusions. Stochastic Processes and their Applications, 119, 3173–3210.

PAVLIOTIS, G. A., POKERN, Y. & STUART, A. M. (2012) Parameter estimation for multiscale diffusions: Anoverview. Statistical Methods for Stochastic Differential Equations, 124, 429.

PAVLIOTIS, G. & STUART, A. (2007) Parameter estimation for multiscale diffusions. Journal of StatisticalPhysics, 127, 741–781.

POKERN, Y., STUART, A. M. & WIBERG, P. (2009) Parameter estimation for partially observed hypoellipticdiffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 49–73.

RASMUSSEN, C. E. & WILLIAMS, C. (2006) Gaussian processes for machine learning, vol. 1. MIT Press, 39,40–43.

RAYNER, N., PARKER, D. E., HORTON, E., FOLLAND, C. K., ALEXANDER, L. V., ROWELL, D., KENT, E.& KAPLAN, A. (2003) Global analyses of sea surface temperature, sea ice, and night marine air temperaturesince the late nineteenth century. Journal of Geophysical Research: Atmospheres, 108.

SCHILLINGS, C. & STUART, A. M. (2017) Analysis of the ensemble Kalman filter for inverse problems. SIAMJournal on Numerical Analysis, 55, 1264–1290.

SCHLICK, T. (2010) Molecular modeling and simulation: an interdisciplinary guide, vol. 21. Springer Science &Business Media.

SCHNEIDER, T., LAN, S., STUART, A. M. & TEIXEIRA, J. (2017) Earth system modeling 2.0: A blueprint formodels that learn from observations and targeted high-resolution simulations. Geophysical Research Letters,44, 12–396.

SCHUTTE, C. & SARICH, M. (2013) Metastability and Markov State Models in Molecular Dynamics, vol. 24.American Mathematical Soc.

SELTEN, F. (1995) An efficient empirical description of large-scale atmospheric dynamics. Ph.D. thesis, Turbu-lentie en Werveldynamica.

SMITH, A. (2013) Sequential Monte Carlo Methods in Practice. Springer Science & Business Media.TUCKERMAN, M. (2010) Statistical mechanics: theory and molecular simulation. Oxford University Press.TZIPERMAN, E., STONE, L., CANE, M. A. & JAROSH, H. (1994) El Nino chaos: Overlapping of resonances

between the seasonal cycle and the pacific ocean-atmosphere oscillator. Science, 264, 72–74.WILKINSON, D. J. (2018) Stochastic Modelling for Systems Biology. CRC Press.XIAO, H., WU, J.-L., WANG, J.-X., SUN, R. & ROY, C. (2016) Quantifying and reducing model-form uncer-

tainties in Reynolds-averaged Navier–Stokes simulations: a data-driven, physics-informed Bayesian approach.Journal of Computational Physics, 324, 115–136.

Page 28: Learning Stochastic Closures Using Ensemble Kalman Inversion

28 of 28

YING, Y., MADDISON, J. & VANNESTE, J. (2019) Bayesian inference of ocean diffusivity from lagrangiantrajectory data. Ocean Modelling, 140, 101401.

ZHANG, L., MYKLAND, P. A. & A IT-SAHALIA, Y. (2005) A tale of two time scales: Determining integratedvolatility with noisy high-frequency data. Journal of the American Statistical Association, 100, 1394–1411.

ZHANG, L. et al. (2006) Efficient estimation of stochastic volatility using noisy observations: A multi-scaleapproach. Bernoulli, 12, 1019–1043.

ZHANG, W., HARTMANN, C. & SCHUTTE, C. (2017) Effective dynamics along given reaction coordinates, andreaction rate theory. Faraday Discussions, 195, 365–394.