parameter estimation by ensemble kalman filters with transformed data

Universität Stuttgart - Institut für WasserbauLehrstuhl für Hydromechanik und Hydrosystemmodellierung

Prof. Dr.-Ing. Rainer Helmig

Diplomarbeit

Parameter Estimation by Ensemble Kalman Filterswith Transformed Data

Submitted by

Anneli SchönigerMatrikelnummer 2221942

Stuttgart, March 31st, 2010

Examiners: Jun.-Prof. Dr.-Ing. W. NowakProf. Dr. rer.nat. Dr.-Ing. A. Bárdossy

External Advisor: Prof. Dr. Harrie-Jan Hendricks Franssen

I hereby certify that I have prepared this thesis independently, and that only those sources, aids and ad-visors that are duly noted herein have been used and/or consulted.

Stuttgart, March 31st 2010

(Anneli Schöniger)

Universität Stuttgart - Institut für Wasserbau

Lehrstuhl für Hydromechanik und Hydrosystemmodellierung Jungwissenschaftlergruppe Stochastic Modelling of Hydrosystems

Prof. (jun.) Dr.-Ing. Wolfgang Nowak, M.Sc.

Thesis Topic: “Parameter Estimation by Ensemble Kalman Filters with Transformed Data”

Spatial variability in conjunction with scarce data introduces parameter and prediction uncer-tainty in simulations of flow and transport in heterogeneous porous media. A very recent de-velopment is the use of Ensemble Kalman Filters (EnKFs) to condition random heterogene-ous parameter fields on measurement data. This way, one obtains an ensemble of condi-tional parameter fields along with their respective model predictions, allowing for a relatively rigorous uncertainty quantification at very low computational costs. The largest remaining drawback of EnKFs are that they are optimal (i.e., accurate in the sense of Bayesian updat-ing) only for multi-Gaussian dependence between data and parameters. This thesis will investigate non-linear data transformations to move data closer to Gaussian-ity. For example, water saturation is physically bounded between zero and unity, such that a beta-to-normal transformation can improve the situation, leading to a higher degree of EnKF accuracy. Similar techniques are promising for hydraulic heads between two Dirichlet boundaries or concentrations between zero and the solubility limit. Other data types may be non-negative and treatable with log transforms. The expected gain of such transformations is a more accurate processing of information, and hence a more accurate quantification of pa-rameter and prediction uncertainty. The expected improvement is substantial, but not com-plete, because univariate normality is not sufficient to ensure multivariate normality.

Individual work steps:

• Developing an automatic tool to determine adequate Gaussian anamorphism transforms (parametric or empirical) for arbitrary data types

• Implementation of EnKF and numerical test cases (MATLAB-based FEM code)

• Comparing EnKF performance with and without transformed data

• Testing the robustness of the transform with respect to:

◦ Sampling error (too small ensemble sizes)

◦ Conceptual error (e.g., assessed for inaccurate boundary conditions)

◦ Extreme data values (at the uncertain margins of the transform)

• Investigate the remaining degree of non-multi-normality in the multivariate dependence after transformation, e.g., by copula analysis.

Collaborations and connections:

• This thesis is a cooperation with Prof. Harrie-Jan Hendricks Franssen (FZ Jülich), mir-rored by similar work on soil moisture with remote sensing data in his group.

• For the copula analysis, cooperation with Prof. Bàrdossy is intended.

• The EnKF is black-box compatible with arbitrary simulation codes and data types. At the same time, the proposed improvement will move EnKFs closer to parameter estimation in highly non-linear multiphase flow and transport problems. This offers an excellent op-portunity for exchange or collaboration with, e.g., the IRTG NUPUS.

Contact:

Jun.-Prof. Dr.-Ing. Wolfgang Nowak, M.Sc. Institut für Wasserbau/SimTech Pfaffenwaldring 7a 70569 Stuttgart Email: [email protected] Phone: 0711/685-60113

Parameter Estimation by Ensemble Kalman Filters with Transformed Data

Uncertain hydrogeological parameters compromise the reliability of predictions for con-taminant spreading in the subsurface. In this work, an inverse stochastic modeling frame-work is used for parameter estimation. This allows to include available measurementdata and then quantify the uncertainty of model prognoses and determine exceedanceprobabilities as a basis for decision-making. Assimilation of available data by EnsembleKalman Filters (EnKFs) has been successfully applied to state variable estimation inatmospheric and oceanic sciences [Evensen, 2007]. Recent research has prepared theuse of EnKFs for parameter estimation in groundwater applications [Nowak, 2009]. Thelargest remaining drawback of EnKFs is their optimality only for multivariate Gaussiandistributed variables. This is a major limitation to the application in subsurface pa-rameter estimation since flow and transport variables generally do not show Gaussiandependence on hydraulic conductivity. This study investigates the use of non-linear,monotonous transformations that render arbitrary marginal distributions of state vari-ables Gaussian. This transformation step is included in the EnKF without interferingwith its traditional analysis scheme. Transformation approaches have recently beenpresented by Béal et al. [2010] and Simon and Bertino [2009] in the context of state es-timation; this study will transfer the methodology to parameter estimation. Moreover,critical issues like non-stationarity of state variables, implementation of physical boundsof state variable values and clustering of distributions at these bounds are addressed.Results show that 1. an implicit pseudo-linearization is achieved by Gaussian anamor-phosis, and 2. the linearized dependence of transformed state variables from the param-eters increases the efficiency of the updating step. This yields a more accurate prognosisof flow and transport in heterogeneous aquifers. The proposed approach (combiningnumerically efficient EnKFs for parameter estimation with Gaussian anamorphosis ofdata) is an attractive alternative in handling strongly non-linear model behavior, giventhat existing linearization-free methods are computationally demanding.

Parameterschätzung mit Ensemble Kalman Filtern angewandt auftransformierte Daten

Unsichere hydrogeologische Parameter beeinträchtigen die Vorhersagegüte von Schad-stoffausbreitung im Untergrund. In dieser Arbeit werden Parameter mithilfe inverserstochastischer Modellierung geschätzt. Dieser methodische Rahmen erlaubt es, vor-handene Messdaten einzubeziehen und die Unsicherheit von Modellprognosen zu quanti-fizieren sowie Überschreitungswahrscheinlichkeiten zu bestimmen, die als Entscheidungs-grundlage dienen. Die Assimilation vorhandener Daten mit dem Ensemble KalmanFilter (EnKF) wurde bereits erfolgreich bei der Zustandsschätzung in Meeres- undAtmosphärenwissenschaften angewendet [Evensen, 2007]. Kürzlich veröffentlichteForschungsarbeiten haben den Weg für den Einsatz des EnKF bei der Parameter-schätzung in Grundwassermodellen geebnet [Nowak, 2009]. Die größte Schwachstelle desEnKF dabei ist, dass nur für multivariat-normalverteilte Variablen optimale Ergebnissezu erwarten sind. Dies stellt eine maßgebliche Einschränkung der Anwendbarkeit aufdie Schätzung von Bodenparametern dar, da Strömungs- und Transportvariablen im All-gemeinen keine Gaußsche Abhängigkeit von der hydraulischen Leitfähigkeit aufweisen.In dieser Diplomarbeit wird die Anwendung nicht-linearer, monotoner Transformationenerforscht, die beliebige Randverteilungen von Zustandsvariablen in die Normalverteilungumwandeln. Dieser Transformationsschritt wird in den Filter eingebettet, ohne dessengrundsätzlichen Ablauf zu modifizieren. Transformationsansätze wurden jüngst von Béalet al. [2010] und Simon and Bertino [2009] im Rahmen der Zustandsschätzung vorgestellt;die vorliegende Arbeit überträgt die Methodik auf die Parameterschätzung. Weiterhinwerden kritische Punkte wie Nicht-Stationarität von Zustandsvariablen,Implementierung von physikalischen Wertegrenzen und Clustering von Ver-teilungsfunktionen an diesen Grenzen untersucht. Die Ergebnisse zeigen, dass 1. eineimplizite pseudo-Linearisierung durch die Gaußsche Anamorphose erreicht wird, und2. die linearisierte Abhängigkeit der transformierten Zustandsvariablen die Effizienzdes Updating-Schritts erhöht. Das führt zu einer genaueren Vorhersage von Strömungund Transport in heterogenen Aquiferen. Die vorgeschlagene Methode (Kombinierender numerisch effizienten EnKFs für Parameterschätzung mit der Gaußschen Anamor-phose von Daten) ist eine attraktive Alternative, um mit stark nicht-linearem Modell-verhalten umzugehen, da existierende Simulationstechniken ohne Linearisierung mitgroßem Rechenaufwand verbunden sind.

Acknowlegements

I hereby thank the German Research Foundation (DFG) for the funding within theInternational Research Training Group “Non-Linearities and Upscaling in Porous Media”(NUPUS).

iii

Contents

1 Motivation 1

2 Approach 4

3 Flow and Transport in the Subsurface 63.1 Conceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.2 Balance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Geostatistics 94.1 Probability Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.1 Univariate Probability Functions . . . . . . . . . . . . . . . . . . . 94.1.2 Multivariate Probability Functions . . . . . . . . . . . . . . . . . . 10

4.2 Statistical Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Statistics of Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4 Spatial Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5 Spatial Interpolation and Simulation . . . . . . . . . . . . . . . . . . . . . 13

5 Data Assimilation 145.1 History of Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Ensemble Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.1 Analysis Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.2 Ensemble Kalman Filter for Parameter Estimation . . . . . . . . . 17

5.3 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Gaussian Anamorphosis in the Ensemble Kalman Filter 196.1 Classification of Transformation Techniques . . . . . . . . . . . . . . . . . 19

6.1.1 Direct Transformation . . . . . . . . . . . . . . . . . . . . . . . . 196.1.2 Indirect Transformation Techniques . . . . . . . . . . . . . . . . . 20

6.2 Anamorphosis Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . 246.2.1 Interpolation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 246.2.2 Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.3 Definition of Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.3.1 Handling of Clustered Data . . . . . . . . . . . . . . . . . . . . . . 316.3.2 Extrapolation towards Population Bounds . . . . . . . . . . . . . . 32

iv

6.3.3 Fitting the Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.4 Exactness of Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.4.1 Performance Test Procedure . . . . . . . . . . . . . . . . . . . . . . 366.4.2 Performance of Interpolation Techniques . . . . . . . . . . . . . . . 396.4.3 Performance of Expansion in Hermite Polynomials . . . . . . . . . 426.4.4 Performance of Other Regression Techniques . . . . . . . . . . . . 446.4.5 Methods of Choice Based on Performance Tests . . . . . . . . . . . 47

6.5 Implementation in Ensemble Kalman Filter . . . . . . . . . . . . . . . . . 496.5.1 Accounting for Properties of State Variables . . . . . . . . . . . . . 496.5.2 Comparability of Observations and Simulated Measurements . . . 526.5.3 Transformation of Measurement Error . . . . . . . . . . . . . . . . 526.5.4 Parameter Updating Step . . . . . . . . . . . . . . . . . . . . . . . 546.5.5 Model Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 Application to Synthetic Test Case 567.1 Numerical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.2 Description of Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.3 Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 Results and Discussion 598.1 Filtering Procedure with Transformed Data . . . . . . . . . . . . . . . . . 598.2 Effects of Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.2.1 Pseudo-Linearized Dependence of States on Parameters . . . . . . 638.2.2 Bivariate Dependence Structures of State Variables . . . . . . . . . 668.2.3 Qualitative Differences in the Updating Step . . . . . . . . . . . . 69

8.3 Transformation of Different Data Types . . . . . . . . . . . . . . . . . . . 718.3.1 Drawdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.3.2 Hydraulic Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.3.3 Solute Concentration . . . . . . . . . . . . . . . . . . . . . . . . . 818.3.4 Suitability of State Variable Types for Gaussian Anamorphosis . . 85

8.4 Comparison with Particle Filter as Reference Solution . . . . . . . . . . . 86

9 Summary, Conclusion and Outlook 91

Bibliography 93

v

Nomenclature

αl/ αt Longitudinal/ transversal dispersivity

C Copula

c Solute concentration

D Diffusion tensor / dispersion tensor

ε Vector of measurement errors

F Cumulative density function (CDF)

F (x) Cumulative distribution function of the original variable

G (z) Standard Gaussian cumulative distribution function

h Hydraulic head

Kf Permeability

λ Correlation length

µ Mean

N Sample size / ensemble size

nmeas Number of observation points

p Fluid pressure

f Probability density function (PDF)

φ Porosity

ψ Anamorphosis function

q Specific discharge

Qϑϕ Covariance / covariance matrix

R Measurement error covariance matrix

ρ Fluid density

vi

r Rank correlation

s Vector of parameters

σ2 Variance

Θ, Φ Random variables

ϑ, ϕ Random variable values

t Time

v Seepage velocity

x Vector of coordinates

x Original variable

yo Vector of observations

yu Vector of simulated state variables

z Transformed variable

vii

List of Figures

1.1 Histograms of relevant variables for groundwater models: Bars representrelative frequency, solid line shows normal distribution that correspondsto mean and variance calculated from the sample. Data sets taken fromtwo different measurement locations . . . . . . . . . . . . . . . . . . . . . 2

6.1 Building empirical CDF. Intervals of non-exceedance probability assignedto each sample data point designated by double arrows for an exemplarysample size N = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.2 Graphical Gaussian Transformation: For any value x, the correspondingGaussian value z can be found by F (x) = G (z) . . . . . . . . . . . . . . . 22

6.3 Empirical anamorphosis function . . . . . . . . . . . . . . . . . . . . . . . 236.4 Histograms of untransformed and transformed variable and normal prob-

ability plot for transformed variable . . . . . . . . . . . . . . . . . . . . . 236.5 Dealing with clustered data at the lower bound of the fictitious sample.

Ellipsis highlights the discontinuity at the clustered data point . . . . . . 336.6 Defining minimum and maximum values for the Gaussian transform . . . 346.7 Beta-distributions that shall represent different data types. Parameters

a, b are given in parenthesis . . . . . . . . . . . . . . . . . . . . . . . . . 386.8 Perfect anamorphosis functions, depending on parameters of beta distri-

bution function that the sample is taken from . . . . . . . . . . . . . . . . 386.9 Deviations from perfect transformation, depending on sample size. Note

that the lowest and highest value of the original variable depends on therandomly drawn sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.10 Performance of linear interpolation techniques, depending on sample size . 406.11 Performance of interpolation and regression techniques, depending on

sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.12 Oscillations of Hermite polynomials . . . . . . . . . . . . . . . . . . . . . . 446.13 Properties of expansion in Hermite polynomials depending on sample size 456.14 Performance of CDF smoothing techniques, depending on sample size . . 466.15 Regression techniques to smooth empirical CDF . . . . . . . . . . . . . . 476.16 Anamorphosis function. Illustration of clustering, ensemble and physical

bounds and extension towards ± infinity (for any x < xmin: z = zmin; forany x > xmax: z = zmax) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8.1 Synthetic truth: Log-conductivity field and drawdown field . . . . . . . . 598.2 A priori ensemble statistics of log-conductivity and drawdown . . . . . . . 60

viii

8.3 Gaussian anamorphosis of drawdown data. Upper row shows transforma-tion of the ensemble at the measurement location closest to the well, lowerrow summarizes the transformation at the other measurement locations . 61

8.4 Drawdown ensembles before (top) and after (bottom) updating at themeasurement location closest to the pumping well. The observed value ismarked by the thick, red line . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.5 A posteriori ensemble statistics of log-conductivity and drawdown . . . . 638.6 Synthetic fields with marked measurement locations: Numbers indicate

the pairs of strongly correlated state variable and log-conductivity . . . . 648.7 Dependence of drawdown on log-conductivity (Locations 1 and 2) . . . . . 648.8 Dependence of head on log-conductivity (Locations 3 and 4) . . . . . . . . 658.9 Dependence of concentration on log-conductivity (Locations 5 and 6) . . . 668.10 Empirical copula density for drawdown at locations 1 and 2 (left) and

theoretical Gaussian copula density (right) with same rank correlation . . 678.11 Empirical copula density for heads at locations 3 and 4 (left) and theo-

retical Gaussian copula density (right) with same rank correlation . . . . 688.12 Empirical copula density for concentration at locations 5 and 6 (left) and

theoretical Gaussian copula density (right) with same rank correlation . . 688.13 Influence function of measurement 1 (drawdown) on the parameter field . 698.14 Influence function of measurement 3 (head) on the parameter field . . . . 708.15 Influence function of measurement 5 (concentration) on the parameter field 718.16 Ratio of diagonal of measurement covariance matrix and measurement

error variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728.17 Synthetic log-conductivity and drawdown field and best estimates result-

ing from different transformation methods in the EnKF . . . . . . . . . . 738.18 A priori ensemble variance of log-conductivity and drawdown field and

conditional variances resulting from different transformation methods inthe EnKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.19 Statistics of drawdown residuals from different assimilation methods . . . 768.20 Synthetic log-conductivity and head field and best estimates resulting

from different transformation methods in the EnKF . . . . . . . . . . . . 788.21 A priori ensemble variance of log-conductivity and head field and con-

ditional variances resulting from different transformation methods in theEnKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.22 Statistics of head residuals from different assimilation methods . . . . . . 808.23 Prior and conditioned ensemble in Gaussian space with data clustering at

the lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.24 Synthetic log-conductivity and concentration field and best estimates re-

sulting from different transformation methods in the EnKF . . . . . . . . 838.25 A priori ensemble variance of log-conductivity and concentration field and

conditional variances resulting from different transformation methods inthe EnKF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.26 Statistics of concentration residuals from different assimilation methods . 85

ix

8.27 Synthetic log-conductivity and head field (upper row) and best estimatesresulting from different transformation methods in the EnKF and theparticle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8.28 A priori ensemble variance of log-conductivity and drawdown field (up-per row) and conditional variances resulting from different transformationmethods in the EnKF and the particle filter . . . . . . . . . . . . . . . . . 88

8.29 Statistics of drawdown residuals resulting from different transformationmethods in the EnKF and the particle filter (PF) . . . . . . . . . . . . . . 89

x

List of Tables

6.1 Overview of suggested transformation methods. CDF stands for cumula-tive distribution function of the original data; ANA represents anamor-phosis function that links original data with Gaussian transformed data . 30

6.2 Statistics of untransformed and transformed variable in comparison withtheoretical values for a Gaussian variable . . . . . . . . . . . . . . . . . . 35

7.1 Model parameters used for synthetic test case. K, logK stand for conduc-tivity and log-conductivity, respectively. h, d, c represent the state vari-ables head, drawdown and concentration. ε symbolizes measurement er-ror. For concentration data, the measurement error standard deviation iscomposed of an absolute and a relative part and results in a measurement-specific standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.1 RMSE of updated fields with regard to synthetic truth. Comparison be-tween assimilation of untransformed data and updating with transformeddata and scaled measurement error variance. Note that negative per-centage of reduction means an increase in RMSE in the transformed runcompared with untransformed one . . . . . . . . . . . . . . . . . . . . . . 75

8.2 RMSE of fields obtained from the three different EnKF assimilation meth-ods with regard to synthetic truth and particle filter (PF) . . . . . . . . . 89

xi

1 Motivation

The prediction of contaminant transport in the subsurface has gained much importanceas groundwater represents an essential part of urban water supply and contaminationsources multiply with the current development of industry and urbanization. Possiblehealth risks resulting from contaminated wells have to be prevented.

To make responsible prognoses, the underlying processes of flow and transport in anaquifer have to be identified and implemented in numerical models. Soil structures canbe very heterogeneous due to their geological formation over time (e.g. sedimentation,erosion, fracturing). Although the spatial variability of parameters like conductivityis commonly accepted, modelers tend to assume homogeneous (upscaled) conditions toreduce computational effort as well as measurement costs.

As opposed to the deterministic approach, where parameters as well as boundary condi-tions are assumed to be well known, stochastic models are able to quantify the uncertaintythat comes along with the reduction of complexity from reality to the upscaled model.In order to reduce uncertainty inherent to a prognosis, all available data have to beincluded in the model while unknown parameters should be estimated. Mainly two dif-ferent approaches for inverse modeling have emerged, inverse parameter estimation andstochastic simulation. Hendricks Franssen et al. [2009] provide a detailed comparison ofrepresentative methods of both approaches.

Classical inverse estimation has been discussed excessively in literature [e.g., Poeter andHill, 1997, Carrera et al., 2005]: An objective function built of partial derivatives ofstate variable measurements with respect to the parameters is minimized. As result, theconfiguration of parameters that fits best the given observation data is obtained.

The second approach to parameter estimation has been developed within the frameworkof Monte Carlo simulations [Robert and Casella, 2004]. An ensemble of parameter fieldsis used as equally likely input for a numerical model. The fields are conditioned onthe available observations and simulations on those conditioned parameter fields thenproduce an ensemble of state variable predictions. The mean of the ensemble prediction isa good estimator for the expected contamination spreading and, as an additional benefitof this method, the quantity of uncertainty belonging to this prognosis is obtained aswell. With these results, health risks can be quantified by exceedance probabilities oflegal limits which forms a basis for monetary decision-making.

1

For data assimilation within the context of inverse modeling, the Ensemble Kalman Filter(EnKF) [Evensen, 2007] has risen attention. It has been successfully applied to statevariable estimation in atmospheric and oceanic sciences [Bertino et al., 2003, Béal et al.,2010] and has become popular because of its ease of implementation and comparativelylow computational costs.

Recent research has modified the procedure of EnKFs to make them suitable for pa-rameter estimation and thus for their application to stochastic modeling of subsurfaceflow[Nowak, 2009]. The largest remaining drawback of EnKFs is their theoretical deriva-tion for multivariate Gaussian distributed variables: EnKFs do not perform optimalBayesian updates for data of arbitrary, non-Gaussian distribution [Evensen, 2007].

State variables in subsurface flow and transport do not generally show Gaussian de-pendence on the parameter conductivity. The type of distribution and dependence isgoverned by physical processes and imposed boundary conditions [e.g., Nowak et al.,2008, Bellin and Tonina, 2007]. Figure 1.1 shows histograms of flow and transport vari-ables at two different arbitrarily chosen measurement locations. It can be clearly seenthat different histogram shapes result from the different data types and locations. Onlya fraction of the data follows approximately a normal distribution which is indicated bythe solid lines.

0 0.2 0.40

2

4

6

Rel

ativ

e fr

eque

ncy

Histogram of Heads

−0.15 −0.1 −0.05 00

5

10

15

20Histogram of Drawdowns

−0.2 0 0.2 0.4 0.6 0.80

1

2

3Histogram of Concentrations

0.7 0.8 0.9 10

2

4

6

8

10

Head−0.1 −0.05 0

0

10

20

30

Drawdown0.6 0.8 1

0

5

10

15

20

Concentration

loc.#1

loc.#2

Figure 1.1: Histograms of relevant variables for groundwater models: Bars representrelative frequency, solid line shows normal distribution that corresponds to mean andvariance calculated from the sample. Data sets taken from two different measurementlocations

2

Therefore, the non-optimality of updating is a major limitation to the use of EnKFs insubsurface parameter estimation.

The aim of this thesis is to analyze and mitigate the effects of non-Gaussian data de-pendence on the performance of parameter estimation EnKFs. Non-linear, monotonoustransformations of marginal distributions will be investigated to move arbitrarily dis-tributed data closer to Gaussianity in the univariate sense. This transformation stepwill be included in the EnKF without interfering with its usual procedure.

Data transformations have been examined in previous works [Béal et al., 2010, Simonand Bertino, 2009], but have only been applied to state variable estimation so far andcritical issues like non-stationarity of state variables and physical boundaries have notbeen satisfyingly discussed yet - please be referred to Chapter 2 for an overview of issuesthat will be tackled within this study.

It is expected that the linearized univariate dependence of the transformed state variablesfrom the subsurface parameters will be more efficiently exploited when performing theconditioning step as suggested in this study. This would result in a more accurateprognosis of flow and transport in heterogeneous soils. It shall be demonstrated thatparameter estimation by ensemble Kalman filters with transformed data can rightlybe considered an attractive, computationally efficient alternative to existing conditionalsimulation techniques that are able to handle strong non-linear model behavior.

3

2 Approach

The focus of this study is laid on finding appropriate transformations to render arbitrarydata almost uni-variate Gaussian. The development of a generally applicable transforma-tion function will be presented in the main part of the study. Subsequently, a numericaltest case will illustrate the improvement that can be achieved by performing an ensembleKalman filter with transformed data. The approach for both parts is described in thefollowing.

Data of arbitrary distributions are transferred into a non-parametric space by rank trans-formation [Conover and Iman, 1981]. Using the inverse Gaussian cumulative distributionfunction to transform a ranked data set into an almost-Gaussian one is known in theliterature as Gaussian anamorphosis [e.g., Chilès and Delfiner, 1999]. Different methodsto obtain the empirical anamorphosis function are presented in section 6.1. In the nextstep, a continuous anamorphosis function is fitted to cover any possible measurementvalue (section 6.2), even outside the empirical anamorphosis function (section 6.3). Thedifferent transformation techniques are examined and evaluated with regard to their ben-efits and drawbacks. Their exactness is assessed via performance tests with syntheticdata that follow known distributions (section 6.4). Depending on the distribution of thedata and the ensemble size, the most suitable techniques (i.e. most accurate and moststable) is identified.

The additional step of data transformation in the procedure of the EnKF allows for theinclusion of additional a priori knowledge about state variables, e.g. physical bounds.This information can be included in the transformation and thus guarantees that thesimulated measurements will take on physical values. Bounded state variables are proneto data clustering at the bounds and non-unique transformation, which has not yet beenaddressed in detail in the literature, but is discussed and taken care of in section 6.3.

Section 6.5 focuses on the implementation of data transformation in the EnKF. It hasto be decided whether the transformation functions shall be valid all over the domain(Global approach, Simon and Bertino [2009]) or if they are constructed to be valid at aspecific measurement location (Local approach, Béal et al. [2010]). The latter approachis used here because it accounts for non-stationarity of flow and transport state vari-ables: Different, location-specific anamorphosis functions are built from the ensemble ofrealizations at each measurement location.

The transformation step is included in an existing parameter estimation EnKF code,written in MATLAB. It is designed as an automatic tool that chooses an appropriate

4

transformation for each type of variable according to the user’s specifications or defaultvalues depending on ensemble size and recommended type of transformation.

Chapter 7 presents an application that allows for an assessment of the improvementachieved by Gaussian anamorphosis. A groundwater flow and transport model is chosenas test case. State variables like hydraulic heads, drawdowns and concentrations aresimulated by a finite element code with a random, heterogeneous conductivity field asinput parameter and certain imposed boundary conditions. “Real” measurements aretaken from this synthetic truth to cancel out the influence of model bias. By runningan ensemble Kalman filter code, the measurements of a specific type are assimilated andthe stochastic parameter field is calibrated on these available data.

The filter’s performance with untransformed and transformed data is compared withregard to the quality of the prognosis in Chapter 8. Criteria for the latter are thereduction of prediction error as well as the increase in prediction confidence. Statistics ofthe residuals at the measurement locations resulting from the different EnKF assimilationmethods will be evaluated. The effects of transformation on the dependence structureof state variables and consequently on the updating procedure are explained in Section8.2. Gaussian anamorphosis applied to the variables drawdown, head and concentrationwill be presented and discussed in Section 8.3. Finally, prediction accuracy of the EnKFapplied to transformed data will be verified in a comparison with results obtained fromthe particle filter which is considered to be the reference solution for stochastic parameterestimation (Section 8.4).

A summary, conclusion and outlook is given in Chapter 9 with special attention to thequestion, how the performance of the EnKF could be significantly further improved,e.g. by multivariate transformations using copulas to ensure multi-Gaussian dependencebetween flow and transport state variables and subsurface parameters. Transformingmarginal distributions is expected to be a substantial, but not yet complete improvementtowards optimality of the filter, because multi-Gaussianity does not necessarily resultfrom Gaussian marginal distributions [Gómez-Hernández and Wen, 1998].

The following Chapters 4, 3 and 5 present the basic (geo-)statistical, physical and math-ematical concepts assumed for this investigation of non-linear data transformations to-wards Gaussianity with application to subsurface flow and transport.

5

3 Flow and Transport in the Subsurface

Gaussian anamorphosis in the ensemble Kalman filter shall be demonstrated and testedwith transformed groundwater flow and transport data. To clarify the nomenclature andmodel concepts applied in Chapter 7, some basic definitions will be given here.

3.1 Conceptual Model

The domain chosen for our test case will be a horizontal 2D-segment of an aquifer withheterogeneous conductivity. For simplicity, recharge through rainfall and percolationwill not be taken into account. A fully saturated domain is assumed, thus we are model-ing a one-phase (groundwater), two-component (water, solute) problem. The followingtransport processes will be considered:

n Diffusion - elliptical spreading of a solute motivated by Brownian motionn Advection - directional transport induced by the flow fieldn Dispersion - spreading of a plume in a flow field because of heterogeneities andupscaling effects (from pore scale to representative elementary volume scale, Bear[1972])

Reactions as well as adsorption / absorption will be neglected.

3.2 Mathematical Model

In the following, a mathematical model is presented that translates the conceptual modelinto a set of differential equations which can be discretisized and numerically solved (thiswill be tackled in Section 7.1).

For the simplicity of illustration, the complexity of the flow and transport model isreduced: A few assumptions on the properties of both fluid and soil matrix are made toobtain a stationary, depth-averaged model. The general methodology is not affected bythese simplifications.

6

3.2.1 Assumptions

Assumptions on the behavior of the fluid = groundwater:

n Incompressible, spatially constant densityn Creeping flow, thus inertial forces can be neglected

Assumptions with regard to the soil matrix:

n Incompressible, with constant porosityn Locally isotropic conductivityn Isothermal system, no heat balance requiredn No external forces, depth-averaged approach

Assumptions concerning the solute:

n Isotropic diffusionn Conservative tracer: no reaction / adsorption / absorption

3.2.2 Balance Equations

Mass Balance

The continuity equation can be derived from balancing mass fluxes that enter or leavea representative elementary volume and their changes over time. In differential form, itreads as follows:

∂ (φρ)∂t

+∇ · (ρq) = φ (qin − qout) (3.1)

with porosity φ, fluid density ρ, time t, specific discharge q and sources / sinks qin/out.∇·(...) represents the divergence. Respecting the assumptions made previously, Equation3.1 can be reformulated to a steady state, two-dimensional flow equation:

∇ · q = qin − qout. (3.2)

Momentum Balance - Darcy’s Law

In 1856, Henry Darcy found from experiments an empirical relation between the flowrate and the pressure gradient present in an aquifer [Darcy, 1856]. It has been shownby Neuman [1977] that this equation can also be analytically derived from the Navier-Stokes equation by picking the relevant assumptions from the list in Section 3.2.1 and

7

assuming a Newtonian fluid. Darcy’s law implies that the flow rate in porous media isproportional to the prevailing hydraulic head gradient ∇h = ∇ (p+ ρgz) with p beingthe fluid pressure; conductivity Kf acts as constant of proportionality:

q = −Kf∇h (3.3)

With the assumption of zero gravity, the head gradient is equivalent to the pressure gra-dient. Combined with the continuity equation (Equation 3.2), we obtain the simplifiedgroundwater flow equation used to mathematically describe hydrostatic (∇h = 0) andhydrodynamic conditions in the subsurface:

∇ · (Kf∇h) = qin − qout (3.4)

Transport Equation

In an analogous way to the continuity equation, a mass balance for the solute can beexpressed. It is first shown in its general form, known as advection-diffusion equation(ADE):

φ∂c

∂t−(φ

ρ

∂ρ

∂t

)c+∇ · (φvc)−∇ · (φD∇c) = φr + qin (cin − c) (3.5)

with c being the concentration of the solute to be balanced, ∇ (...) representing thegradient, D being the diffusion tensor, r being a reaction term and qin, cin describinga fluid flux entering the system with a certain solute concentration. For porous media,the seepage velocity v = q

φ is used to describe effective velocity in the pore space.

Macroscopic dispersion is an implicit part of the model since conductivity is chosen to beelement-wise (locally) constant, but varying from element to element which representsheterogeneities in the soil. To account for hydrodynamic dispersion, the diffusion tensor isreplaced by a dispersion tensor [Scheidegger, 1961] that parameterizes both the influenceof diffusion and dispersion in any spatial direction (with I symbolizing the identitymatrix, De being the effective diffusion coefficient and αl, αt being the longitudinal andtransverse dispersivity, respectively):

D = vvT

‖v‖(αl − αt) + I (De + αt‖v‖) (3.6)

Again including our assumptions, Equation 3.5 can be reduced to∂c

∂t+ v · ∇c−∇ · (D∇c) = 0 (3.7)

if only a sink (e.g., a well) is present in the domain, but no source. By solving thegroundwater flow equation for h, calculating q according to Equation 3.3 and dividingby φ, v can be determined and subsequently, Equation 3.7 can be solved. A continuoussolute injection is chosen as boundary condition to create a stationary model: The timederivative vanishes and we can predict the shape of the stationary plume in our domain.

8

4 Geostatistics

Uncertainty inherent to a model prediction can be quantified if the probability of pa-rameters, state variables or events is known. To describe the probabilistic behavior ofthese variables, probability functions have to be found. General definitions and mostpopular examples are given in the following section.

4.1 Probability Functions

4.1.1 Univariate Probability Functions

A distribution function or cumulative density function (CDF) is defined by

F (ϑ) =ˆ ϑ

−∞f(ϑ′)dϑ′ (4.1)

where f (ϑ) is the probability density function (PDF) with its properties f (ϑ) ≥ 0 for allϑ and

´∞−∞ f (ϑ) dϑ = 1. While F yields the non-exceedance probability of the random

variable Θ for a value ϑ, f can be interpreted as the probability density with which thevalue ϑ is drawn from the total population.

Gaussian Distribution

The univariate Gaussian distribution is most often used to describe a random variable’sprobability because of its “mathematical attractiveness” [Johnson and Wichern, 1988]:It satisfactorily describes many natural and technical processes and it shows convenientstatistical properties, e.g. symmetry and maximum entropy if only the first and secondmoment are known [Cover and Thomas, 2006]. The Gaussian probability function isdefined as

f (ϑ) = 1σ√

2πexp

(−1

2

(ϑ− µσ

)2)

(4.2)

with mean µ and variance σ2. The Gaussian probability function with µ = 0 and σ2 = 1is referred to as standard-Gaussian.

9

Log-normal Distribution

If the logarithm of a variable follows a Gaussian distribution, this variable can be con-sidered log-normally distributed. The probability function

f (ϑ) = 1√2πσϑ

exp

(−(ln (ϑ)− µ)2

2σ2

)(4.3)

is defined only for values ϑ > 0.

Beta Distribution

Another popular univariate distribution is the beta distribution which proves to be veryflexible in its shape, depending on the parameters α, β. As opposed to the distributionsmentioned above, the beta distribution is bounded on both sides with support on theinterval [0, 1]:

f (ϑ) = 1B (α, β)ϑ

α−1 (1− ϑ)β−1 , (4.4)

with B being the beta function [Abramowitz and Stegun, 1964]. Note that the betadistribution can be scaled to cover any finite interval [a, b].

4.1.2 Multivariate Probability Functions

In most natural processes, state variables do not depend on one single parameter, buton several different ones. The probability of a certain combination of parameters thatleads to a certain variable value can be described by multivariate distributions.

The probability of two events happening together is given by the joint PDF f (ϑ, ϕ).The conditional PDF or posterior PDF f (ϑ|ϕ) is used to determine the probabilityof event ϑ, assuming that event ϕ has happened. If any information about event ϕ isignored, we obtain the marginal PDF or prior PDF of ϑ: f (ϑ) =

´∞−∞ f (ϑ, ϕ) dϕ. This

leads to Bayes’ theorem, the probability distribution

f (ϑ|ϕ) = f (ϑ) f (ϕ|ϑ)f (ϕ) . (4.5)

Please be referred to Berger [1985] for details on Bayesian analysis. In the context ofparameter estimation, the conditional probability of a particular unobserved state vectoryu, given known observations yo can be written as

f (yu|yo) = f (yu) f (yo|yu)f (yo)

. (4.6)

10

4.2 Statistical Moments

For high dimensional systems, complete distribution functions are too much data to han-dle. Information can instead be reduced to characteristics of the distributions, namelystatistical moments. Commonly used univariate moments are

n Expected value (first moment): The average value, if an infinite number of samplesare drawn from the random variable Θ

µ = E [Θ] =ˆ ∞−∞

ϑf (ϑ) dϑ (4.7)

n Variance (second central moment): The mean squared deviation from the expectedvalue

σ2 = E[(Θ− E [Θ])2

]=ˆ ∞−∞

(ϑ− E [Θ])2 f (ϑ) dϑ, (4.8)

with standard deviation σ =√σ2

To characterize bivariate distributions, the covariance of two random variables Θ and Φcan be drawn on:

Qϑϕ = E [(Θ− E [Θ]) (Φ− E [Φ])] =¨ ∞

−∞(ϑ− E [Θ]) (ϕ− E [Φ]) f (ϑ, ϕ) dϑdϕ (4.9)

In case the random variables Θ and Φ are independent, the joint probability becomesf (ϑ, ϕ) = f (ϑ) f (ϕ) and the covariance equals to zero. This does not hold the otherway around: A covariance of zero does not imply independence - this is only the case ifwe are dealing with a bivariate Gaussian distribution.

4.3 Statistics of Random Fields

Random fields are functions of random variables that depend on the location x =(x, y, z). A conductivity field can be seen as one realization of a random field f (ϑ (x)).To be able to derive moments at a location x, several realizations of the random fieldneed to be available (e.g., from stochastic simulations):

n Sample mean: Average value of N independent realizations from the given randomfield

µ (x) ' ϑ (x) = 1N

N∑i=1

ϑi (x) (4.10)

11

n Sample variance: Variance of the sample of realizations

σ2 (x) ' 1N − 1

N∑i=1

(ϑi (x)− ϑ (x)

)2(4.11)

n Sample covariance: Covariance between the random variables at two differentlocations x1,x2 of the random field

Qϑϑ (x1,x2) '1

N − 1

N∑i=1

(ϑi (x1)− ϑ (x1)

) (ϑi (x2)− ϑ (x2)

)(4.12)

n Sample correlation: Normalized covariance

Cor (ϑ (x1) , ϑ (x2)) = Qϑϑ (x1,x2)σ (x1)σ (x2)

(4.13)

For multivariate analysis, the variance-covariance matrix [Wackernagel, 2003] can beconstructed. This matrix is filled with bivariate covariances (auto-covariances along themain diagonal):

Qϑϑ,ij = 1N − 1

(ϑ (xi)− ϑ (xi)

) (ϑ (xj)− ϑ (xj)

)(4.14)

4.4 Spatial Dependence

Subsurface parameter fields cannot be sufficiently characterized by marginal distributionsbecause of strong spatial dependence: Within the correlation length λ of a parameter,the random variable at a specific location depends to a certain degree on the randomvariable at surrounding locations. A tool to capture spatial dependence is the variogram[e.g., Chilès and Delfiner, 1999]. For varying separation distances h, the variance of thedifference between values at two locations separated by h is determined and plotted asexperimental semi-variogram:

γ (h) = 12V ar [ϑ (x+ h)− ϑ (x)] = 1

2E[(ϑ (x+ h)− ϑ (x))2

](4.15)

The variogram is founded on the intrinsic hypothesis which assumes that the incrementϑh = ϑ (x+ h) − ϑ (x) is a second-order stationary random function (the mean of theincrement is constant all over the domain or shows a linear drift and the variance of theincrement only depends on the separation vector h).

Theoretical variogram models can be fitted to data points experimentally determinedfrom Equation 4.15. They are a useful tool to describe an averaged correlation dependingon the separation distance. This expected value is sufficient to fully describe Gaussian

12

spatial dependence, but if a random field exhibits a different type of dependence, othertools need to be found to exploit information on the spatial behavior of different quantilesof the field’s distribution. An example for such an alternative tool are copulas [Bárdossyand Li, 2008]: Independent from the respective marginal distributions, they reveal mul-tivariate spatial structures in all quantiles. Their statistical meaning is expressed bySklar’s theorem [Sklar, 1959]:

f (ϑ1, ϑ2, ..., ϑn) = C (F (ϑ1) , F (ϑ2) , ..., F (ϑn)) (4.16)The copula C joins together the marginal distributions F (ϑi) to a multivariate dis-tribution function of random variables f (ϑ1, ..., ϑn) or put the other way around, thecopula itself is “cleaned” of the influence of marginal distributions on the actual spatialstructure and is thus able to reveal structures that might be hidden in a variogram.

4.5 Spatial Interpolation and Simulation

If parameters or state variables are of interest at locations other than measurementpoints, available data have to be interpolated. Different approaches can be chosen from:Either the data is directly interpolated, e.g. by kriging, or parameters are inverselyestimated and state variables are subsequently determined from simulations based onthe conditioned parameter field.

Kriging originates from mining statistics and is known to be a Best Linear UnbiasedEstimator (BLUE) for the conditional mean value of a random field at location x∗ andits covariance [e.g., Kitanidis, 1997]. It considers the spatial configuration of observationsin the vicinity of x∗ and observed values ϑα at the n surrounding locations:

ϑ∗ =n∑

α=1λαϑ (xα) (4.17)

The weights λα depend on the location x∗ and are evaluated with the help of covariancefunctions (constructed from the variogram model) between the different observationsand the point to be estimated.

Not only single points can be interpolated with the help of covariance functions, but anentire domain can be simulated based on the chosen geostatistical model. There is avariety of simulation methods applied in the geostatistics community, among them thespectral methods [Dietrich and Newsam, 1993] that will be implemented to efficientlygenerate random but spatially correlated conductivity fields in this study.

If variables depend on a primary parameter and the physical relationship can be wellapproximated by simulation, inverse methods might be preferred to pure geostatisticalinterpolation. Simulations implicitly account for physical processes, non-linearities andphysical boundaries while kriging only reflects spatial configuration [Schwede and Cirpka,2010]. Chapter 5 will present data assimilation techniques for inverse modeling that takeadvantage of both simulation and interpolation.

13

5 Data Assimilation

The term data assimilation refers to the process of combining an uncertain model predic-tion of a spatially dependent state variable with a set of discrete measurements [Evensen,2007]. The challenge of giving the best possible estimation of the variable field consistsin extracting (or filtering) as much information as possible from typically noisy obser-vations. Measurements naturally include an unknown measurement error with assumedvariance which shall be taken into consideration during data assimilation. Therefore,the optimal assignment of weights for the individual measurements depending on theirspatial configuration and measurement error has to be found, which basically resemblesthe process of kriging (Section 4.5).

The procedure of performing inverse modeling in order to estimate a random field ofstate variables can be summarized as follows:

1. A discrete, stochastic state space model is formulated to predict a model state2. The predicted model state will be improved by assimilating observed measurements3. The prediction confidence is determined as a basis for risk management

In data assimilation, a weight of zero for an observation is equivalent to the assumptionof a perfect simulation while a maximum weight fully accepts the noisy observation.Filtering methods have to be found that choose an appropriate value in between toreasonably combine both the imperfect simulation and the imperfect observation.

The steps named above are derived for time-independent models. For dynamical models,the assimilation method needs to include observations whenever they become available(e.g., in weather forecasting). Sequential data assimilation estimates the unobservedvariables in sequential time steps, therefore information is propagated forward in timewhile backward integration is not necessary.

The filtering process can be divided into a predictive step and a conditioning step: Thestate at time tk is integrated according to the dynamic model to obtain a model stateprediction for the next time step tk+1 at which observations are available; then thepredicted state will be conditioned on these observations.

The Kalman filter, invented by Kalman, 1960, allows to evaluate the evolution of theprediction error over time and has been further developed until today for different fieldsof application, e.g. weather forecasting. A short overview of the history of Kalman filtersis provided in the following section.

14

5.1 History of Kalman Filters

The Kalman filter is a set of equations to compute the best conditional estimate of thestate variable given an a priori estimate and observed measurements (compare Bayes’theorem, Equation 4.6). Those equations represent a least squares problem [Sorenson,1970] and are solved by evaluating cross- and auto-covariances between estimated statevariables and observations. As the covariance is an optimal measure only for lineardependence, the Kalman filter is an optimal, unbiased estimator for linear models ofmultivariate Gaussian variables [Evensen, 2007]. Non-Gaussian marginal functions ormultivariate dependence structures can originate from strong non-linear model behaviorand compromise the results of the filter.

The extended Kalman filter was developed to address this problem by linearly approx-imating the error propagation and applying a closure assumption. Both the Kalmanfilter and the extended Kalman filter are computationally and storage-wise very costlyfor high-dimensional dynamical models. An additional drawback remains the question-able applicability to non-linear models due to divergence issues [Pearson et al., 1997].

As a Monte Carlo alternative to the deterministic filters listed above, the ensembleKalman filter (EnKF) [Evensen, 1994] was intended to overcome those two drawbacks: Itprovides a more appropriate closure scheme for non-linear models and is computationallymore efficient.

5.2 Ensemble Kalman Filter

Within the framework of Monte Carlo simulations, the forecasted probability densityfunction of the state variable is approximated by a large ensemble of simulated modelstates. The analysis scheme for a dynamical model is discussed below; in Section 5.2.2differences between state estimation and parameter estimation are pointed out to preparethe ground for the application of the EnKF for conductivity estimation as performed inChapter 7.

5.2.1 Analysis Scheme

The set of equations needed to update an ensemble of model states shall be presentedhere. An error-free model is assumed, whose prediction quality is affected by uncertaininitial conditions and parameters as well as measurement errors. The deviation of thesimulated state at a certain location from the given observation shall be corrected bythe filter not only exactly at this location, but the information about the observed valueshall also be spread in the surrounding area.

15

Note that the observed state variable does not necessarily have to be of the same variabletype as the variable to be estimated. It could as well be a different state variablethat shows a strong (preferably linear) relationship to the estimated state variable (e.g.surface phytoplankton and nitrate concentration in the ocean [Béal et al., 2010]). Theoperator H maps the estimated state ϑk at time tk onto the observed state dk whichcan be expressed in matrix notation:

dksim = Hϑk (5.1)

Perturbed measurements are generated by adding a Gaussian distributed measurementerror ε with zero mean and a prescribed variance to the unconditioned simulated statevector ϑu in order to ensure comparability with real, noisy observations d. The deviationof simulated states from observed states for every realization i at time tk can then bewritten as

∆ki = dk −

(Hϑku,i + εi

). (5.2)

In the next step, this deviation (or innovation) is rated with regard to the amountof trust that shall be put into the observation. This clearly depends on the size ofmeasurement error - measurements with a high measurement error shall not be takentoo seriously. Also the covariance of the measurements has an impact on the worthof data - measurements that strongly depend on each other shall not be overestimatedin their individual importance. Therefore, Equation 5.2 is divided by the measurementerror covariance matrixR and the estimated measurement covariance matrixHQϑϑH

T ,with Qϑϑ being the estimated ensemble covariance matrix:

Qkϑϑ = E

[(ϑku − E

[ϑku

]) (ϑku − E

[ϑku

])T ](5.3)

R consists of the measurement error variances on the main diagonal (ensemble vari-ance determined according to Equation 4.11) and off-diagonal entries of zero becausemeasurement errors are assumed to be uncorrelated.

Finally, this normalized innovation has to be translated into the state space of the randomvariable that shall be estimated. The relationship between the observed variable at eachmeasurement location and the estimated state variable at each point of the domain isgiven by the cross-covariance matrix:

Qkϑd = Qk

ϑϑHT (5.4)

The ensemble Kalman filtering step can then be formulated and performed for eachensemble member to obtain a conditioned state variable vector ϑc,i:

ϑkc,i = ϑku,i +Qkϑd

(Qkϑϑ +R

)−1 [dk −

(Hϑku,i + εi

)](5.5)

The influence function K = Qϑd (Qϑϑ +R)−1 is called Kalman gain and is responsiblefor the importance that is assigned to the measurements and for the spatial range of

16

influence of the innovation term. Note that the Kalman gain is formally equivalent tosimple co-kriging. The EnKF converges to the result of the classical Kalman filter withincreasing ensemble size and is derived to be an error covariance minimizing scheme formulti-Gaussian state variables and a linear observation model H [Burgers et al., 1998].

The error covariance matrix is defined with regard to the ensemble mean as the truestate is not known:

Qkϑϑ,c = E

[(ϑkc − E

[ϑkc

]) (ϑkc − E

[ϑkc

])T ](5.6)

Thus, the ensemble mean is interpreted as the best estimate and its variance is taken asa measure for the a posteriori prediction uncertainty.

5.2.2 Ensemble Kalman Filter for Parameter Estimation

Because of its ease of implementation and low computational costs [Evensen, 2003], theEnKF has gained much popularity in state variable estimation and has been recentlyprepared for its use in parameter estimation (Quasi-linear Kalman ensemble generator[Nowak, 2009]).

Instead of conditioning a time variant state variable on state variable observations, time-invariant uncertain parameters are estimated with the help of state observations. Equa-tion 5.5 is transferred into

sc,i = su,i +Qsy

(Qyy +R

)−1 (yo −

(yu,i + εi

)), (5.7)

with su representing the parameter vector to be conditioned and yo, yu being the statevariable vector (observed and simulated, respectively).

Note that in case of parameter estimation, the operator H does not establish a re-lationship between different types of state variables that suffer from uncertain initialconditions, but maps uncertain parameters onto dependent states, e.g. in the form of aflow and transport model.

5.3 Particle Filter

The particle filter [Gordon et al., 1993] is an ensemble-based alternative to the EnKFthat is computationally very costly, but comes with advantageous characteristics: Noassumptions on the shape of the prior PDFs are made and this filter is able to handlearbitrarily nonlinear models. The particle filter is a direct numerical implementationof Bayes’ theorem (Equation 4.6) and is therefore optimal and accurate for an infiniteensemble size.

17

In contrast to the EnKF, the particle filtering step does not introduce an innovationterm and condition the parameter field according to it, but it assigns weights to eachof the realizations and gives a weighted mean of the ensemble as best estimate. Hence,the particle filter is a resampling method that does not perform local changes in theparameter fields.

The normalized weights are determined as the Bayesian probability that the values dare observed given the simulated state yu:

wi =f(d | yu,i

)∑Nj=1 f

(d | yu,j

) (5.8)

The weighted mean as best estimate at a specific location in the domain is calculatedaccording to

µweighted ' ϑweighted =N∑i=1

wiϑi (5.9)

and the weighted variance results in

σ2weighted '

N∑i=1

wi(ϑi − ϑweighted

)2(5.10)

where ϑ stands for either the parameter s to be estimated or the state y simulated tomatch the given observation d.

In this study, the particle filter will be used as reference solution with regard to data as-similation for nonlinear models because it does not rely on the assumptions of univariateor multivariate Gaussianity or linearity.

18

6 Gaussian Anamorphosis in the EnsembleKalman Filter

As explained above, the Ensemble Kalman Filter is only optimal for multivariate Gaus-sian state and parameter variables. Yet, it has been widely used as a reasonably accuratemethod in cases with non multi-Gaussian variables [Evensen, 2007]. The objective ofthis study is to show that the Ensemble Kalman Filter can be moved closer to opti-mality by applying it to transformed variables that follow at least a Gaussian marginaldistribution. Therefore, appropriate transformation methods have to be found that turnarbitrary variables into Gaussian variables in the univariate sense. The following sec-tions will present different transformation types and point out their major benefits anddrawbacks.

6.1 Classification of Transformation Techniques

6.1.1 Direct Transformation

The most direct and quite empirical approach to turn a skewed distribution into anapproximately Gaussian distribution is to treat the variable with an appropriate math-ematical expression, e.g., apply the natural logarithm to a positively skewed variable orsquare a negatively skewed variable.

With regard to groundwater flow, a direct transformation might prove useful to handledrawdown data. These data are positively skewed, as many values equal to or close tozero can be found and only few high values build the right tail towards infinity. Totransform drawdown data d into a Gaussian sample, the natural logarithm could beapplied:

z = log (d) (6.1)

This log-transformation is a special case of a family of transformations introduced byBox and Cox [1964] which can be parameterized with

z =

(d+λ2)λ1−1λ1

, λ1 6= 0log (d+ λ2) , λ1 = 0.

19

Such a transformation is able to produce acceptable results if the input data are boundedon one side (parameter λ2 > −dmin ensures positiveness for any boundary value dmin).For variables with two physical boundaries, this class of transformations is insufficient.

Bertino et al., 2003, tried to transform concentration data into Gaussian values by ap-plying the natural logarithm, but the properties of the transformed data are not closeenough to Gaussianity for our needs. Therefore, it is shown that indirect transformationsare more flexible with regard to the input data and produce output data that can evenbe considered standard normally distributed.

6.1.2 Indirect Transformation Techniques

An arbitrarily distributed variable and its Gaussian transform are linked by their cumu-lative distribution functions (CDFs). Van der Waerden [1965], used this fact to examinetest statistics and Krzysztofowicz [1997], gave a detailed description of the intuitiveanalytic relationship

z = G−1 [F (x)] (6.2)

for a random variable x, its cumulative distribution function F (x), the Gaussian vari-able z and the standard normal distribution function G (z). As G (z) is per definitionmonotonously increasing, the inverse G−1 exists. The operation

ψ (x) = G−1 [F (x)] (6.3)

is called Gaussian transformation or Gaussian anamorphosis function [e.g., Chilès andDelfiner, 1999].

In order to find the Gaussian anamorphosis function, the cumulative distribution func-tion F has to be determined in a first step. Subsequently, the Gaussian inverse dis-tribution function has to be evaluated. This stepwise procedure will be referred toas indirect transformation in this study. There are different approaches to indirectlybuild the Gaussian anamorphosis function; they can be divided into parametric andnon-parametric methods.

Parametric Methods

For state variables like concentration, pressure heads or drawdowns, theoretical distri-bution functions can often be inferred from groundwater flow and transport processesand imposed boundary conditions. This has been reviewed in Chapter 1. With theknowledge that our variable theoretically follows a certain distribution, we can estimatethe parameters of that distribution function with the help of the Maximum LikelihoodMethod. By applying the inverse Gaussian CDF to the theoretical cumulative frequen-cies of our sample according to Equation 6.2, we can obtain the anamorphosis functionψ (x) quite easily and with negligible computational effort.

20

Non-Parametric Methods

If fitting a parametric distribution function is not appropriate because the data do notseem to follow any specific theoretical distribution, applying non-parametric methodswill be an alternative. Those distribution-free methods produce more robust resultsin the sense that fewer assumptions are made: Information about the shape of theunderlying distribution is not needed. Instead, characteristics of the data are drawnfrom the sample itself, which requires a large sample size. When applied to a MonteCarlo process, this should not be a hurdle. With the rising popularity of Monte Carlomethods, non-parametric statistics have become promising methods because of theirwide applicability. As examples for distribution-free methods, histograms can be namedas well as indicator kriging [Journel, 1983].

With regard to Gaussian transformation, non-parametric methods can be used to buildthe anamorphosis function. This procedure will be divided into three main steps, ac-cording to Simon and Bertino [2009].

1. Find Gaussian values z corresponding to each of the data points x: constructionof empirical anamorphosis function (discussed in this section)

2. Fit a continuous function to the empirical anamorphosis function: interpolation orregression of empirical anamorphosis function (please see Section 6.2)

3. Define the tails of the continuous anamorphosis function: extrapolation and dealingwith clustered data (please see Section 6.3)

Let us now focus on the first step. According to Equation 6.2, z can be determined withthe help of the cumulative frequency of x. For a sample from the original variable x,the empirical CDF has to be built. The following estimator after Johnson and Wichern[1988] is used:

Fj =j − 1

2N

(6.4)

with Fj being the estimated cumulative frequency for sorted data points with ranksj = 1...N and N being the sample size. With this estimator, each data point of thesample influences the same non-exceedance probability interval length below and above,see Figure 6.1.

With the knowledge of the empirical CDF of our variable x, we can determine the valuesof the Gaussian transform z by rank transformation [Conover and Iman, 1981]. Figure6.2 visualizes the procedure on an exemplary case. After having found the Gaussianequivalent to each of our data points, we can now plot the empirical anamorphosisfunction, see Figure 6.3. Some characteristics of both the untransformed and the trans-formed variable can be drawn from the histograms in Figure 6.4. While, in the exemplarycase, the histogram of the original variable is highly positively skewed, the histogram

21

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Original variable x

Non

−ex

ceed

ance

pro

babi

lity

Empirical CDF

Figure 6.1: Building empirical CDF. Intervals of non-exceedance probability assigned toeach sample data point designated by double arrows for an exemplary sample size N =10

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Original variable x

Non

−ex

ceed

ance

pro

babi

lity

Empirical CDF

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

Gaussian variable z

Gaussian CDF

z’’x’ x’’ z’

Figure 6.2: Graphical Gaussian Transformation: For any value x, the correspondingGaussian value z can be found by F (x) = G (z)

22

0 0.2 0.4 0.6 0.8 1−3

−2

−1

0

1

2

3Empirical Anamorphosis Function

Original variable x

Gau

ssia

n va

riabl

e z

Figure 6.3: Empirical anamorphosis function

0 0.25 0.5 0.75 10

5

10

15

20

Original variable x

Fre

quen

cy

−2 −1 0 1 2Transformed variable z

Histograms of Untransformed and Transformed Variable

−2 −1 0 1 20.0030.01 0.02 0.05 0.10

0.25

0.50

0.75

0.90 0.95 0.98 0.99

0.997

Transformed variable z

Pro

babi

lity

Normal Probability Plot

Figure 6.4: Histograms of untransformed and transformed variable and normal proba-bility plot for transformed variable

23

of the transformed variable reminds strongly of a standard-Gaussian distribution, beingsymmetrical to the mean around zero and characteristically bell-shaped. The normalprobability plot in Figure 6.4 forms a straight line which is another indicator that thetransformed variable very likely follows a Gaussian distribution.

6.2 Anamorphosis Function Fitting

There are several possible interpolation and regression methods to fit a continuous func-tion to the discrete values of the empirical anamorphosis function. They differ in severalaspects: Their theoretical background and underlying assumptions - this will be lookedat in detail in the following paragraphs. Furthermore, they can be characterized by theirdegree of exactness, robustness and computational effort which shall be investigated inSection 6.4: Representative performance tests will be carried out to assess the qualityof the transformation techniques.

Before discussing these techniques, it shall be mentioned that they can be applied attwo successive stages in the process of defining an anamorphosis function: Either thedata points of the empirical cumulative distribution function are used, or the ones ofthe empirical anamorphosis function. Inverting the cumulative frequency to obtain thecorresponding Gaussian value is a non-linear transformation, thus the two proceduresdiffer in the resulting anamorphosis curve. It will be exposed later on that both proce-dures have their justification depending on the chosen technique. Let us now considerthose differing interpolation and regression techniques and their advantages as well asdisadvantages.

6.2.1 Interpolation Techniques

This subsection will present techniques that can be summarized under the concept ofinterpolation, which is a special type of curve-fitting. Interpolation respects all empiricaldata points, thus the fitted curve will go right through all of the data points of ourempirical anamorphosis function. Due to this property, interpolation will be clearlyrecommended if we trust that the sampled data represents reality well, i.e. the datahave a very low degree of uncertainty.

Interpolation techniques can be classified based on the degree of the polynomial that isused to interpolate between the data points. Linear interpolation connects two adjacentdata points with a linear function and results in a non-differentiable, piecewise linearfit to the empirical data. To obtain a differentiable function, a higher-order polynomialcan be fitted to the sample. Usually, only a polynomial of order n− 1 is able to satisfythe constraint of honoring all n data points. This can yield oscillations at the outerdata points. If such a high-order fit is actually a good fit has to be decided based on

24

the prior knowledge about the sample’s population. A different type of differentiablefunction can be built by spline interpolation: Higher-order polynomials are used toconnect two data points and the transitions between the piecewise functions are smooth.Spline interpolation can be considered a local method in the sense that if one datapoint is changed, this will only affect close neighbors, not the whole fit. In contrast, apolynomial fit is a global method as one exchanged data point alters the entire fittedfunction.

Direct Gaussian Anamorphosis

Linear interpolation can be done between all available data points, either of the empiricalCDF or the empirical anamorphosis function. The latter procedure is known as directGaussian anamorphosis [e.g., Chilès and Delfiner, 1999]. Any value x within the rangeof our untransformed variable can be transformed according to

z = zi +zi+1 − zixi+1 − xi

(x− xi) (6.5)

for any xi < x < xi+1, with xi, xi+1 the two points of the empirical anamorphosisfunction that enclose x. This method guarantees that the information of all the datapoints is included in the transformation, which is advantageous if the sample size isrepresentative for the whole underlying unknown distribution. For small sample sizes,outliers could strongly reduce the quality of the transformation as the importance ofthose outliers is overestimated.

To overcome this problem, Béal et al. [2010], suggest defining a number of equidistant in-tervals and then linearly interpolating between the edges of the intervals. This proceduresmoothens the continuous anamorphosis function and attenuates the impact of outliers,while using only a fraction of the available information. More detailed consideration ofthe impact of sample size on the performance of the different fitting techniques will bepresented in Section 6.4.

Coming back to the question, which one of the two empirical functions interpolationshould be applied to, the support for the assumption of linear behavior between datapoints should be considered. On the one hand, linearly interpolating the empirical CDFis a commonly used method and seems justified, as there is no information available otherthan the sample data points to suggest a shape of the empirical cumulative distributionfunction different from the linear connection of those points. On the other hand, linearinterpolation of the empirical anamorphosis function automatically puts an assumptionof linearity into the process of anamorphosis that is not supported by any knowledge.Within the scope of this study, interpolation without any justifying support will beapplied to the earliest possible stage of the whole transformation procedure - this can beinterpreted as “rather interpolate the input than the output of an operation”. Like that,it is possible to distinguish effects of interpolation (e.g. by varying the sample size) and

25

those of non-linear transformation. Hence, I suggest interpolating the empirical CDFrather than the empirical anamorphosis function.

Cubic Spline Interpolation

It can be highly advantageous for further applications if a continuous and differentiableanamorphosis function is built. As opposed to linear interpolation, spline interpolationcan fulfill this demand. Additionally, we can introduce a request for monotonicity,because both our empirical functions are monotonously increasing. The method of choicewill then be monotone piecewise cubic interpolation [Fritsch and Carlson, 1980]. Thismethod preserves monotonicity while respecting all data points. Also horizontal linesegments would be preserved. The interpolation uses the cubic Hermite basis functions

H1 (x) = 3(xi+1 − xxi+1 − xi

)2− 2

(xi+1 − xxi+1 − xi

)3,

H2 (x) = 3(

x− xixi+1 − xi

)2− 2

(x− xixi+1 − xi

)3,

H3 (x) = − (xi+1 − xi)[(

xi+1 − xxi+1 − xi

)3−(xi+1 − xxi+1 − xi

)2],

H4 (x) = (xi+1 − xi)[(

x− xixi+1 − xi

)3−(

x− xixi+1 − xi

)2]

and constructs a piecewise function F (x) to interpolate each interval xi < x < xi+1:

F (x) = FiH1 (x) + Fi+1H2 (x) + F ′iH3 (x) + F ′i+1H4 (x) . (6.6)

When applied to the empirical CDF, Fi, Fi+1 denote the Gaussian cumulative frequenciescorresponding to xi, xi+1; F ′i, F ′i+1 are the derivatives of the function F (x) at the edgesof the interval. In order to find an interpolated value F (x), the derivatives in Equation6.6 have to be determined. An algorithm for this purpose has been developed by Fritschand Carlson [1980]. Please be referred to their work for a detailed derivation.

Cubic interpolation of the CDF will be preferred to linear interpolation, if differentia-bility is required for further use of the transformation function (a differentiable CDFautomatically results in a differentiable anamorphosis function).

6.2.2 Regression Techniques

If we do not trust that our sample is representing reality in a sufficient manner, whichcan be the case for small sample sizes, I recommend using a regression technique instead

26

of interpolation. The concept of regression takes into account that data can have errorsor can be misinterpreted due to a lack of neighboring data points. A best-fit functionis obtained by minimizing the deviations from the sample points (for example with theleast squares method, Gauss [1963]). This way, still all data points have an influenceon the shape of the function, but outliers have a little less impact than in interpolation.Compared to a polynomial interpolant, regression can be done with a polynomial of muchlower degree which might be preferred, for example to reduce the risk of oscillations.Finally it shall be mentioned that functions fitted by the regression methods presentedin the following paragraphs are differentiable.

Cubic Spline Regression

Piecewise regression is a combination of interpolation using splines and regression. Asopposed to interpolation, only a certain number of data points will be honored by thistechnique, but in contrast to usual regression, these few data points have to be exactlymet. The selected data points of the sample, that shall be respected, are called knots.Between the knots, least squares regression will be done to fit a polynomial of the desiredorder; transitions at the knots are smooth as expected from spline techniques. In thisstudy, cubic spline regression will be used as one of several regression techniques. Pleasebe referred to Poirier [1973] for a derivation.

Expansion in Hermite Polynomials

Whenever fitting a function to sparse data, one is usually groping in the dark. Toshed some light on the fitting process, every available information on the shape of thefunction should be considered. Within the scope of this study, the function to estimate- the anamorphosis function - is relating a Gaussian equivalent to any input type ofinput variable. This fact should be reflected by the fitting process in order to use theinformation we have. This can be done by expanding Hermite polynomials which arerelated to the Gaussian probability density. Their definition is given by Rodrigues’sformula

Hn (z) = 1√n! g (z)

dn g (z)dzn

(6.7)

with Hn the polynomial of order n, the standard normal variable z and its Gaussian pdfg (z). The polynomial of order n+1 is related to the one of order n, so they can be builtaccording to a recurrence relation:

Hn+1 (z) = − 1√n+ 1

z Hn (z)−√

n

n+ 1 Hn−1 (z) , n > 0. (6.8)

Please be referred to Rivoirard [1994] for derivations and details on various applicationsof expansion in Hermite polynomials.

27

Because of their relation to the normal distribution, they have the property of orthogo-nality. That means the inner product (which corresponds here to the covariance) of twopolynomials of different order equals zero. Because of this convenience and the possi-bility to establish a direct analytical relationship between any random variable x andits Gaussian equivalent z, expansion into Hermite polynomials has been widely used ingeostatistics, for example in disjunctive kriging [Rivoirard, 1994].

To fit the empirical anamorphosis function with Hermite polynomials, we take advantageof the fact that any function of a standard normal random variable f (z) can be expandedinto Hermite polynomials [e.g., Rivoirard, 1994] following the formula

f (z) =∞∑n=0

fnHn (z) . (6.9)

In our case, the original variable x of unknown distribution can be written as x = f (z).Once we have expanded our variable in Hermite polynomials, we can invert the equationand solve for z = f−1 (x) to find the Gaussian transform for any value of the sample.Because of the direct relation that can be established between x and z, the expansioninto Hermite polynomials as a curve fitting technique only makes sense for the empiricalanamorphosis function, not for the empirical CDF. The coefficients of the expansion, fn,can be calculated with the following expression:

fn = E [f (z) Hn (z)] =ˆf (z) Hn (z) g (z) dz (6.10)

As the integral can only be solved numerically and we do not know the continuousfunction x = f (z), but only discrete sample points i = 1...N (N representing thesample size or the number of interval edges), the integral is approximated by the sum

fn =N∑i=1

ˆ zi+1

zi

xiHn (z) g (z) dz (6.11)

and can be simplified to

fn =N∑i=2

(xi−1 − xi)1√nHn−1 (zi) g (zi) , (6.12)

with g (z0) = 0 and g (zN+1) = 0 the lower and upper bound of the Gaussian pdf[Rivoirard, 1994]. The coefficients will be evaluated by using the data points of theempirical anamorphosis function. Thus, the quality of the fit directly depends on thesample size. Another constraint to the goodness-of-fit is the considered order of poly-nomials. It has to be decided when to truncate the expansion. As a measure of quality,the properties of the Hermite polynomials can be used [Wackernagel, 2003, Ortiz et al.,2005]:

V ar [f (z)] =∞∑n=1

(fn)2 ≈p∑

n=1(fn)2 , (6.13)

28

thus the statistics of the coefficients determine the order p to which the Hermite poly-nomials should be expanded. Equation 6.13 should be satisfied to a certain degree ofexactness, that is to a certain number of significant digits.

Despite the advantageous theoretical properties of Hermite polynomials, they also bearproblems that can occur when fitting data that show clustering close to or at the samplebounds. Oscillations cannot be prevented, as polynomials of a high order are not ableto produce tangential behavior. It has to be decided for the individual case whetherexpansion in Hermite polynomials still proves useful.

Methods to Smooth the Empirical CDF

In cases when expansion in Hermite polynomials is not convenient or insufficiently ac-curate, alternative regression methods can be drawn on. Here, the idea of rather fittingthe empirical CDF than the empirical anamorphosis function (as discussed previously in6.2.1) shall be pursued. A supposedly simple approach consists in fitting a polynomialto the empirical CDF. Depending on the sample, finding a satisfying fit can be more orless tedious. To make a well-founded fit, it should also be taken care of the slopes at theend points such that the asymptotes of the function equal to zero and one, respectively.Another constraint has to be satisfied, namely monotonicity. These requirements makeit hard to find a polynomial fit of low order and acceptable quality, i.e. with reasonablylow deviations from the sample data points.

If we assume that our empirical CDF deviates significantly from the distribution of a“perfect” sample due to small sample size, we might want to redistribute the assignedcumulative frequencies a little. The steps of the CDF curve can be smoothed at each datapoint by applying a kernel smoother [Bowman and Azzalini, 1997]. Kernel smoothingcan be interpreted as a smooth version of a histogram: Instead of assigning cumulativefrequencies to bars, a kernel smoothing density estimate assigns a polynomial to aninterval of a specified bandwidth. The resulting CDF curve is a superposition of thesmoothed interval densities and is therefore differentiable.

There are many other possibilities to build a continuous CDF that might reflect realitybetter than the empirical CDF of a small sample. This shall not be deepened within thisstudy; instead, the methods which have been exposed so far will be summarized in Table6.1 and further examined and assessed with regard to their exactness and computationaleffort in Section 6.4.

6.3 Definition of Tails

The third and last step towards a continuous anamorphosis function consists in thedefinition of the tails, to the left of the lower sample bound and to the right of the upper

29

Iden

tifier

Classificatio

nFitting

Technique

Fitting

App

liedto...

Logtran

sform

ation

Dire

ct,p

aram

etric

Parameter

estim

ationby

MLM

Orig

inal

data

Dist

ributionfunc

tionfittin

gIndirect,p

aram

etric

Parameter

estim

ationby

MLM

Orig

inal

data

ANA

-pointwise

Indirect,n

on-param

etric

Line

arpo

intw

iseinterpolation

Empiric

alANA

CDF-p

ointwise

Indirect,n

on-param

etric

Line

arpo

intw

iseinterpolation

Empiric

alCDF

ANA

-intervals

Indirect,n

on-param

etric

Line

arinterpolationof

intervals

Empiric

alANA

CDF-intervals

Indirect,n

on-param

etric

Line

arinterpolationof

intervals

Empiric

alCDF

ANA

-Hermite

polyno

mials

Indirect,n

on-param

etric

Expa

nsionin

Hermite

polyno

mials

Empiric

alANA

CDF-s

plineregressio

nIndirect,n

on-param

etric

Cub

icsplin

eregressio

nEm

piric

alCDF

CDF-k

erne

lsmoo

thing

Indirect,n

on-param

etric

Kerne

lsmoo

thing

Empiric

alCDF

CDF-p

olyn

omialfi

tIndirect,n

on-param

etric

Cub

icsplin

einterpolation

Empiric

alCDF

Table6.1:

Overview

ofsugg

estedtran

sform

ationmetho

ds.CDF

stan

dsforcu

mulativedistrib

utionfunc

tionof

theoriginal

data;A

NA

represents

anam

orph

osis

func

tionthat

links

original

data

with

Gau

ssiantran

sform

edda

ta

30

sample bound. Several assumptions have to be made at this point. First of all, we needto decide whether the smallest value of the sample should be treated as the smallestpossible value, or if there is reason to believe that even smaller values could occur. If thelatter is true, then we have to extend our anamorphosis function to the smallest possiblyoccurring value. Now the next question arises: Which Gaussian value should correspondto the minimum value of the original variable? In Gaussian theory, minus infinity isthe smallest “value”, so wanting to turn our variable x into a perfect Gaussian randomvariable z requires that we take zmin = −∞ into account. For numerical evaluation,this is not an option, therefore we have to find an appropriate value for zmin that canbe numerically processed. The analogous thoughts apply for the highest value of thesample, with a theoretical zmax equal to plus infinity.

If it is known that the lowest and highest possible values of x are part of the sample,there will be no need to extrapolate the empirical anamorphosis function. In this case,the challenge consists in the transformation of clustered data at the bounds, becauseif the actual bounds of the population are represented in the sample, those values willmost likely occur manifold (If the lowest and highest possible values of x are taken onlyonce each, the sample will be exactly treated as before, there is no special considerationnecessary).

6.3.1 Handling of Clustered Data

In this subsection, I will introduce a way of dealing with clustered data at the boundsand demonstrate my suggestion with the help of an example: Imagine a sample thatconsists of 100 values between zero and one; 97% of the data are distinct, 3% are as-sumed to be equal to the minimum of the population, zero. When building the em-pirical CDF, it has to be decided which cumulative frequency should be assigned tothe lower bound. As 3% of the sample take the value zero, we know that the cumula-tive frequency (determined according to Equation 6.4) of zero lies somewhere betweenF1 (0) = 1− 1

2100 = 0.005 (that accounts for the fact that zero is the first value of the sorted

sample) and F3 (0) = 3− 12

100 = 0.025 (which corresponds to the value with rank j = 3) -this leads to a jump in the empirical CDF. One possibility to avoid this difficulty wouldbe extending the infinitesimally small interval around the bound to an interval with anumerically processable small length and linearly interpolating between both of the cu-mulative frequencies. This method is not advisable because of the fact that it producesdifferent transformed values for the same original value. It will be explained later on(see Section 6.5.2) why this property is not acceptable for the use in Ensemble KalmanFilter applications. Searching for an alternative to deal with clustered data, an easy andstable method is to define cases: If x = 0 a discrete cumulative frequency will be usedand for any value x > 0 there will be a continuous CDF (It is assumed that - except forclustering due to boundary effects - sample values drawn randomly from a continuousdistribution are distinct). For the cumulative frequency at x = 0 , I suggest taking the

31

mean CDF value of the clustered values: F (0) = 12 (F3 + F1) = 0.015. The lower bound

of the transformed variable can be derived according to zmin = G−1 (F (0)) ≈ −2.17. Ifwe took F (0) = F3, the tail of the Gaussian variable would be cut off at an unnecessarilyclose value. Figure 6.5 shall clarify this method of choice to handle clustered data at abound.

6.3.2 Extrapolation towards Population Bounds

Extrapolation of the empirical anamorphosis function is required if the minimum andmaximum values of the original variable are not part of the sample, which is generallyassumed if no opposite information is available. To be able to determine the Gaussiantransform for any value x, the anamorphosis function needs to be extrapolated past thesample bounds. I suggest defining the empirical anamorphosis function in the same wayas described before, because the information about the sample itself has not changed.Therefore, the minimum and maximum values of the population will be treated sepa-rately from the sample. Let us denote the lower and upper sample bounds x1and xn,respectively. The theoretical minimum and maximum values of the random variablex are named xmin and xmax, to be consistent with the notation used above. Now wecan determine z1, zn by finding the corresponding Gaussian values to the cumulativefrequencies of the lowest and the highest sample value, as explained in the paragraphabove. In addition, we have to define zmin and zmax within the interval (−∞; +∞).Literature review has not revealed a commonly adopted method, but rather showed ar-bitrary choices of values for the Gaussian bounds [Simon and Bertino, 2009, Chilès andDelfiner, 1999]. To proceed in a more systematic and satisfying manner, I will define ameaningful condition that shall be satisfied by the Gaussian bounds.

The condition, called sample range condition from now on, is set up to determine theproportion to which the sample covers the actual data range and to reflect that propor-tion in the Gaussian transformed variable. In other words, if the bounds of the samplelie close to the real bounds, the transforms of both the sample bounds and the realbounds should lie close together as well. If the opposite is the case and the range of thesample is very small in comparison to the actual range the data could have, then thetransforms should show a large difference between zmin and z1 or zn and zmax. To putthis idea into a mathematical equation, I have formulated the sample range condition:

x1 − xminxn − x1

!= z1 − zminzn − z1

; (6.14)

xmax − xnxn − x1

!= zmax − znzn − z1

. (6.15)

32

0 0.05 0.1 0.15 0.2 0.25

0.05

0.1

0.15

0.2

0.25

0.3

Original variable x

Cum

ulat

ive

freq

uenc

y F

Zoom Into Cumulative Distribution Function

F1

F3

F(0)

(a) Definition of the CDF at the lower bound

0 0.1 0.2 0.3 0.4 0.5

−2.5

−2

−1.5

−1

−0.5

0

Zoom Into Interpolated Anamorphosis Function

Original variable x

Tra

nsfo

rmed

var

iabl

e z

z = G−1(F(0))

z = G−1(F4)

(b) Interpolation of empirical anamorphosis function at the lower bound

Figure 6.5: Dealing with clustered data at the lower bound of the fictitious sample.Ellipsis highlights the discontinuity at the clustered data point

33

Solving the sample range condition, unique values for zmin and zmax can be found:

zmin = z1 −(x1 − xminxn − x1

)(zn − z1) ; (6.16)

zmax = zn +(xmax − xnxn − x1

)(zn − z1) . (6.17)

This shall be illustrated with Figure 6.6. Please note that the slope of the extrapolationis equal to the average slope of the anamorphosis segment (in the sense that the straightline is characterized by the starting and ending point (x1, z1) and (xn, zn), respectively),which can be interpreted as simple linear scaling in absence of any information aboutthe correct transformation of the tails and is consistent with the principle of parsimony.

0 0.2 0.4 0.6 0.8 1−4

−2

0

2

4

Original variable x

Tra

nsfo

rmed

var

iabl

e z

Empirical Anamorphosis Function

Populationbounds

Sample bounds

zmax

− zn

zn − z

1

xn − x

1x

max − x

n

Figure 6.6: Defining minimum and maximum values for the Gaussian transform

Once minimum and maximum Gaussian values have been defined, the transformed vari-able will show properties similar to those of a standard-normally distributed variable,but not exactly the same because it is cut off at zmin and zmax (or z1 and zn) instead ofcovering the interval[−∞; +∞]. The effects of the cut-off can be seen in the propertiesof the transformed variable that was presented in Section 6.1.2. The statistics of theuntransformed variable x, the transformed variable z and theoretical Gaussian momentsare listed in table 6.2. While the odd moments mean and skewness of the transformedvariable match the Gaussian statistics, the even moments variance and kurtosis deviate alittle from the properties of a standard Gaussian distribution. This can be attributed to

34

Variable Mean Variance Skewness Kurtosis

Untransformed variable x 0.452 0.119 0.139 1.543

Transformed variable z 0.000 0.997 0.000 2.834

Gaussian random variable 0 1 0 3

Table 6.2: Statistics of untransformed and transformed variable in comparison withtheoretical values for a Gaussian variable

the missing tails towards infinity: For the odd moments, symmetry is the crucial factor,so truncation of the tails will not have a major influence and the moments will be closeto the Gaussian statistics as long as the remaining distribution is symmetric. For theeven moments, however, the sum of the deviations from the mean grows asymptoticallywith z going towards plus and minus infinity. The truncation long before infinity yieldssmaller values for even moments.

The definition of the tails directly determines the degree of asymmetry the transformedvariable will have. This is especially dramatic if one bound of the sample represents theactual population bound and the other one does not even come close to the actual limit- in this case, extrapolation on only one side of the anamorphosis function will turn thetransformed variable fairly asymmetric and thus will violate one of the main assumptionsfor Gaussian variables. Values for both the odd and the even moments will now deviatefrom the theoretical ones. At this point, it could be worthwhile to think about definingtails in such a way that symmetry is still ensured for the transformed variable. Due totime constraints, I will focus on the sample range condition and implement only thismethod in the ensemble Kalman filter.

6.3.3 Fitting the Tails

Now we have determined the points of the empirical anamorphosis function within therange of the sample, at the bounds of the sample and at the theoretical bounds of theoriginal variable’s population. These data points have to be fitted to create a continuousanamorphosis function that covers the whole range of possible values to be transformed.The thoughts on appropriate interpolation or regression techniques presented previouslyare valid at this point, too, although they lead to a different strategy now. The differenceis that we do not know any details on how to transform data that is located at the outsideof the sample bounds. The rank transformation is based on assigning a rank to each ofthe sample points and cannot be extrapolated to values that are not part of the sample.We can choose any kind of fitting as we are not able to judge which one represents thetruth better - we just do not know what the truth looks like. If any elaborated fittingtechnique is no improvement compared to linear interpolation, we will pick the latter forsimplicity.

35

Once again, it has to be decided whether to interpolate linearly between the points ofthe empirical CDF or the empirical anamorphosis function. To include the informationabout the minimum and maximum Gaussian transforms obtained from the sample rangecondition into the empirical CDF, we have to evaluate their cumulative frequencies andassign those to the minimum and maximum values of the population of the originalvariable. Those two data points are added to the empirical CDF and linearly connectedto the points of the sample minimum and maximum, respectively. By adding data toan already built empirical CDF, we bias the transformation of the added data points.Therefore, we still do not have any clue how to transform these values correctly. Bothstages, the empirical CDF and the empirical anamorphosis function, are equally uncer-tain. Thus, the reasons given above to prefer interpolating the empirical CDF do notapply here. Performance tests have been carried out and showed difficulties when linearlyinterpolating the empirical CDF. The cumulative frequency can take values very closeto zero, which numerically leads to bounds of the transformed variable equal to minusor plus infinity. For these practical reasons, the linear interpolation of the tails will bedone within the environment of the anamorphosis function as opposed to the approachchosen for the interpolation of the anamorphosis function itself (Section 6.2.1).

6.4 Exactness of Transformations

The methods described in Section 6.2 differ in their theoretical foundation as well astheir ease of implementation. In order to chose those methods that are appropriatefor a specific problem and result in the most accurate and robust transformation, it isnecessary to examine the behavior of the different suggested methods depending on thesample size and the type of data. This investigation will be done in this section.

6.4.1 Performance Test Procedure

To be able to assess the quality of the transformation, a known transformation shallbe performed: We will draw a sample from a beta-distributed random variable andtransform this sample into an almost Gaussian sample. The beta distribution familyis chosen because it accounts for physical bounds of data values and its relevance forgroundwater flow and transport variables (be referred to Section 6.5.1). The correcttransformation would be

z = G−1 (FBeta (x)) , (6.18)

with Fbeta being the beta distribution function as defined in Section 4.1.1. The transfor-mation would be perfect if we took an infinite number of sample values into account; aswe have a limited sample size, our transformation will deviate from the perfect one. Themagnitude of deviation will be taken as an indicator for the accuracy of the differentmethods.

36

As most theoretical distribution functions naturally do not consider clustering, I willfocus on the basic data transformation within the range of the sample in this section.The justification of linear extrapolation towards population bounds will not be verifiedeither because this would require a different methodology of testing and exceed the timeframe given to this section.

The performance tests will be carried out as follows: Four different distributions will beconsidered, namely a quite good-natured random variable that could correspond to ahead distribution, a uniformly distributed random variable, a positively skewed randomvariable and finally a random variable with a bimodal distribution that could approx-imate the distribution of concentration data as obtained when implementing simplesource geometries like the one described in Chapter 7. All four distribution types belongto the family of beta-distributions and are plotted in Figure 6.7. The correspondingperfect anamorphosis functions (according to Equation 6.18) are displayed in Figure 6.8;obviously, the shape of the anamorphosis function depends on the parameters of thebeta distribution the random sample is taken from.

Note that the beta distribution is chosen because it allows to include boundaries ofstate variable values; for more complex concentration source geometries or other statevariables, other distribution functions might be more appropriate and could be assessedby tests similar to those presented in this section.

Besides investigating the performance of the different direct and indirect transformationtechniques depending on the type of data, the sample size will be varied to quantify theimpact of small ensembles on the quality of transformation. To make results comparable,a standardized error will be calculated, namely the root mean square error (RMSE):

RMSE =

√√√√ 1nruns · nplot

nruns∑i=1

nplot∑j=1

(zij − z

)2(6.19)

with transformed values zij , the perfect transform z, nplot being the number of test valuestransformed during one run (for all performance tests equal to 1000) and nruns beingthe number of runs (every one with a new randomly drawn sample). The number ofruns was set to 200 which seemed to produce reasonably representative results.

Figure 6.9 shows the evolution of the deviations from the perfect transformation depend-ing on the sample size; the plots are obtained from linear interpolation of the empiricalCDF. It can be observed that with an increasing sample size, the transformation be-comes more exact and the plots of all 200 transformations gather more closely aroundthe perfect anamorphosis line.

37

0 0.25 0.5 0.75 10

0.5

1

1.5

Pro

babi

lity

dens

ity

0 0.25 0.5 0.75 10

0.5

1

1.5

2

Random variable

Beta Probability Density Functions with Varying Parameters

0 0.25 0.5 0.75 10

1

2

3

4

5

0 0.25 0.5 0.75 10

1

2

3

4

beta(2,2)

beta(0.5,0.7)

beta(1,1)

beta(0.5,1)

Figure 6.7: Beta-distributions that shall represent different data types. Parameters a, bare given in parenthesis

0 0.2 0.4 0.6 0.8 1−5

−3

−1

1

3

5

Original variable values x

Tra

nsfo

rmed

var

iabl

e va

lues

z

Perfect Anamorphosis Functions for Varying Parameters

beta(1,1)

beta(2,2)

beta(0.5,1)

beta(0.5,0.7)

Figure 6.8: Perfect anamorphosis functions, depending on parameters of beta distribu-tion function that the sample is taken from

38

0 0.2 0.4 0.6 0.8 1−5

−3

−1

1

3

5T

rans

form

ed v

aria

ble

valu

es z

0.2 0.4 0.6 0.8 1

Anamorphosis Function Plots for Varying Ensemble Size

Original variable values x0.2 0.4 0.6 0.8 1

N = 50 N = 200 N = 1000

Figure 6.9: Deviations from perfect transformation, depending on sample size. Notethat the lowest and highest value of the original variable depends on the randomlydrawn sample

6.4.2 Performance of Interpolation Techniques

The performance of the different transformation methods shall now be assessed. First ofall, I will compare linear interpolation techniques among one another. Linear interpola-tion was applied to all data points of the empirical CDF (which will be labeled ’CDF -pointwise’) or to intervals of it (labeled ’CDF - intervals’), and the obtained piecewiselinear functions were transformed according to Equation 6.2. These two resulting trans-formations shall now be contrasted with linear interpolation of all data points of theempirical anamorphosis function (labeled ’ANA - pointwise’) or intervals of it (labeled’ANA - intervals’). Figure 6.10 shows the results for those four techniques, applied tosamples from two of the four beta distributions mentioned above (adjustable parametersare (1,1) and (0.5,0.7)). Obviously, all methods improve in exactness with increasingsample size, and also the rate of improvement is similar. Thus, our first finding is that alot of improvement towards the perfect transformation can be achieved by using a largersample (for example, increasing the sample size by a factor of 2 - 2.5), especially in theregion between 50 and 500 ensemble members. The small additional improvement from500 ensemble members to 1000 is disproportionate to the large additional amount ofdata to be processed and is therefore not recommended.

Looking at the four different linear interpolation methods, it can be stated that pointwiseinterpolation of both the empirical CDF and the empirical anamorphosis function as wellas linear interpolation of CDF intervals perform almost equally well, independently fromthe chosen type of beta distribution. Linear interpolation of empirical anamorphosisintervals, however, scores worst and shows a significantly higher RMSE for the more

39

50 100 200 500 1,0000

0.05

0.1

0.15

0.2

0.25

Number of ensemble members

RM

SE

RMSE of Linear Interpolation over Ensemble Size

CDF − pointwise

CDF − intervals

ANA − pointwise

ANA − intervals

(a) Sample drawn from beta(1,1) distribution

50 100 200 500 1,0000

0.05

0.1

0.15

0.2

0.25


RM

SE

RMSE of Linear Interpolation over Ensemble Size

CDF − pointwise

CDF − intervals

ANA − pointwise

ANA − intervals

(b) Sample drawn from beta(0.5,0.7) distribution

Figure 6.10: Performance of linear interpolation techniques, depending on sample size

40

complicated beta distribution with parameters 0.5 and 0.7, supporting the decision infavor of interpolating the empirical CDF rather than the empirical anamorphosis function(see Section 6.2.1).

The resulting transformations for the beta(2,2) distribution behave very similar to theones for beta(1,1), and both the transformations for beta(0.5,1) and beta(0.5,0.7) behavevery much alike; that is why their performance plots will not be shown here.

It can be concluded that linear interpolation of empirical anamorphosis function intervalsdoes not perform well for any type of data dealt with in this study and therefore shouldnot be implemented in the ensemble Kalman filter. Béal et al. [2010] have used thismethod, but they did not assess the accuracy of their transformation. The authorsof this paper suggest using a number of intervals that should be much smaller thanthe number of data points, and the intervals should be equidistant, if possible. Basedon my investigations, I cannot support these ideas. Firstly, it has been shown thatthe linear interpolation of all data points of the empirical anamorphosis function scoresdramatically better than the interpolation of intervals. Secondly, I found that adjustinginterval lengths depending on the type of distribution of the original variable can yieldbetter results than equidistant intervals (results are not shown here).

This is true for the interpolation of CDF intervals as well, but has not been deeplyexamined, because I believe that the definition of the intervals should be made individ-ually for a specific data set. It seems logical that for data of a more complex shape likesamples from the beta(0.5,0.7) distribution, one should choose flexible interval lengths,for example with a constant number of data points in each interval. Like this, one cancatch any peculiar shape of the CDF or of the anamorphosis function. Please note that,for simplicity, I only used equidistant intervals for the calculations that underlie theperformance plots shown in this section. Nevertheless, the number of intervals has beenincreased for larger ensemble sizes.

It was also tested if cubic spline interpolation of the empirical CDF would lead to animprovement compared to linear interpolation of all points, which was not the case: Thedeviations from the perfect anamorphosis function were almost the same. Besides thefact that spline interpolation produces a differentiable anamorphosis function, there isno benefit of applying this interpolation technique.

To summarize the first part of my performance assessment, I would like to point outthat pointwise interpolation of the empirical CDF and the empirical anamorphosis func-tion perform practically equally well and - given the theoretical thoughts presented inSection 6.2.1 - linear interpolation of all data points of the empirical CDF will be pre-ferred. Linear interpolation of CDF intervals seems to be a good choice for any type ofdata. Some experimentation showed that flexible handling of interval lengths can im-prove performance and can outrun pointwise interpolation for almost all data types andsample sizes. In deep contrast, linear interpolation of anamorphosis function intervals ishighly discouraged from, based on both the theoretical concerns and the disappointingperformance.

41

6.4.3 Performance of Expansion in Hermite Polynomials

For further examination, I will compare different regression methods with pointwise lin-ear interpolation of the empirical CDF as a reference. The performance plot in Figure6.11 displays the root mean square error of the reference method as well as a methodthat ensures a differentiable anamorphosis function: expansion in Hermite polynomialsof order 13 (labeled ’ANA - Hermite polynomials’). While scoring badly for a sam-ple drawn from a beta(0.5,0.7) distribution, expansion in Hermite polynomials givesa slightly better estimate of the exact transformation for small samples taken from abeta(2,2) distribution.

Yet, we have to consider that higher order polynomials tend to oscillate. This is amajor problem, because we need to invert the expansion x = f (z) in order to obtainthe transformed variable z corresponding to x. The inverse only exists if the expansionis unequivocal, thus we can only use that part of the expansion that does not oscillate.Figure 6.12 gives an example of the expansion for a sample of size N = 100, drawn froma beta(2,2) distribution. Please note that now the transformed variable z is plotted asabscissa. The plots of the expansion for all of the 200 runs show that, depending onthe random sample, the range of values of the original variable that can be transformedis significantly reduced. Only the range, in which the expansion is monotonous, can beused. The attentive reader may have noticed that the RMSE of the reference case inFigure 6.11b is slightly lower than the one plotted in Figure 6.10b although it is basedon the same sample distribution. That can be explained by missing values that couldnot be transformed due to oscillations of the Hermite polynomials - those values werenot transformed linearly either, in order to still have comparable results.

The proportion of untransformed values is quantified in Figure 6.13a. This graph com-pares the percentage of values of the sample that could not be transformed, because theexpansion in Hermite polynomials could not be inverted for those values. It becomesclear that the fraction of untransformed values increases with decreasing sample size.This fact undoes the positive performance result mentioned above which only applied tosmall sample sizes. Also, the percentage of untransformed values is by far higher for dis-tributions with a more complex shape like the beta(0.5, 1) or beta(0.5,0.7) distributions,represented with pink and green bars.

To prevent oscillations, expansion could be done up to a lower order of polynomials,but this would reduce the accuracy of the regression. Equation 6.13 determines whichvalue the sum of the squared coefficients fn would take on for an infinite expansion; thedeviation from that value is a measure for the loss of accuracy due to truncation. Themaxima of the absolute deviations are plotted in Figure 6.13b. For small sample sizesand expansion in polynomials up to order 13, the variance of the original variable canbe reproduced up to the third digit. It has to be decided individually for specific cases,if polynomials of lower order can still be considered a reasonably good fit.

42

50 100 200 500 1,0000.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26RMSE of Expansion in Hermite Polynomials over Ensemble Size


RM

SE

CDF − pointwise

ANA − Hermite polynomials


50 100 200 500 1,0000.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


RM

SE

RMSE of Expansion in Hermite Polynomials over Ensemble Size

CDF − pointwise

ANA − Hermite polynomials

(b) Sample drawn from beta(0.5,0.7) distribution

Figure 6.11: Performance of interpolation and regression techniques, depending on sam-ple size

43

−5 −4 −3 −2 −1 0 1 2 3 4 5−0.5

0

0.5

1

1.5

Transformed variable z

Orig

inal

var

iabl

e x

Oscillations of Hermite Polynomials

p = 13,N = 100

Figure 6.12: Oscillations of Hermite polynomials

It has been shown that expansion in Hermite polynomials only improves the transfor-mation towards the perfect one for less complex distribution shapes of the distributionfunction and small samples, but at the same time, for small samples, the expansion yieldsdifficulties as a high fraction of the sample cannot be transformed due to oscillations andthe variance of the sample cannot be satisfyingly reproduced. Additionally, it has to beremarked that the inversion of the expansion in Hermite polynomials is computationallyvery costly in comparison to all of the other fitting techniques. Consequently, I dismissexpansion in Hermite polynomials as a regression technique applicable to the types ofdata that are of interest within this study.

6.4.4 Performance of Other Regression Techniques

The performance of the CDF smoothing techniques presented in Section 6.2.2 shall nowbe examined. Figure 6.14 visualizes the RMSE of the reference case, a cubic splineregression fit to the CDF, a kernel smoothing CDF estimate and a polynomial fit ofthird order to the CDF. Monotonous cubic spline regression was implemented with thehelp of a code for “Shape Prescriptive Modeling” provided by John D’Errico, 2008, onthe MATLAB CENTRAL File Exchange platform. As constraints, a monotonic increasewas prescribed and the lowest and highest value of the sample should be respected bythe fit.

For a sample from a good-natured distribution like the beta(2,2) distribution, it can beobserved that spline regression performs slightly better than the reference case, if the

44

50 100 200 500 10000

5

10

15

20

25

30


Per

cent

age

of u

ntra

nsfo

rmed

val

ues

Percentage of Untransformed Values over Ensemble Size

beta(2,2)

beta(1,1)

beta(0.5,1)

beta(0.5,0.7)

(a) Percentage of untransformed values due to oscillations of Hermite poly-nomials

50 100 200 500 10000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008


Dev

iatio

n fr

om s

ampl

e va

rianc

e

Deviation from Sample Variance over Ensemble Size

beta(2,2)

beta(1,1)

beta(0.5,1)

beta(0.5,0.7)

(b) Maximum absolute deviations from variance of original variable

Figure 6.13: Properties of expansion in Hermite polynomials depending on sample size

45

50 100 200 500 1.0000.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26


RM

SE

RMSE of Interpolation / Regression over Ensemble Size

CDF − pointwise

CDF − spline regression

CDF − kernel smoothing

CDF − polynomial fit


50100 200 500 10000.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22


RM

SE

RMSE of Interpolation / Regression over Ensemble Size

CDF − pointwise

CDF − spline regression

CDF − kernel smoothing

(b) Sample drawn from beta(0.5,1) distribution

Figure 6.14: Performance of CDF smoothing techniques, depending on sample size

46

sample size is small, i.e. for N ≤ 200. This positive effect does not occur for morecomplex distribution shapes as shown in Figure 6.14b: Here, spline regression performsworse than the reference case even for small sample sizes. Both of the other suggestedsmoothing techniques produced disappointing results as their deviations exceeded thoseof the reference case by far. This can be attributed to a bad fitting of the CDF whichis illustrated by Figure 6.15. While the kernel smoothing estimate of the CDF does

0 0.2 0.4 0.6 0.8 1Original variable

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cum

ulat

ive

freq

uenc

y

Smoothing Empirical CDF

Empirical CDF

Polynomial fit

Kernel smoothing

beta(2,2) beta(0.5,1)

Figure 6.15: Regression techniques to smooth empirical CDF

not seem to be that bad, it fails to correctly represent the shape at the bounds of thedistribution and thus leads to large deviations from the perfect transformation. Thepolynomial fit is obviously not a good choice, it cannot provide a good estimate of theCDF because it is highly constrained (horizontal slope at the bounds) while being of alow order. Implementing a higher order fit might be of help, although the results of theperformance tests do not give reason to believe that this would significantly improve theestimation of the transformation function (think of the performance of spline regressionapplied to the CDF, for example).

6.4.5 Methods of Choice Based on Performance Tests

Summarizing my findings on the performance of non-parametric methods to build acontinuous anamorphosis function, it can be stated that linear interpolation of all pointsof the empirical CDF or intervals of it leads to stable and reasonably accurate estimatesof the perfect transformation. Accuracy can significantly be improved with increasingsample sizes. For small samples, however, cubic spline regression could be an alternative

47

to create a continuous input for the anamorphosis process; its applicability depends onthe complexity of the shape of the variable’s distribution.

In general, it is recommended to fit a parametric distribution function to the empiricalCDF whenever there is reason to believe that the original variable follows a specific typeof theoretical distribution. This can be determined from the CDF itself or from a theo-retical approach, if the physical properties of the variable suggest a certain distributionas will be discussed in Section 6.5.1. To complete the performance tests, I fitted a betadistribution to the randomly generated sample (the parameters then will not be exactlyequal to the parameters specified to draw the sample) and calculated the deviations fromthe transformation with the correct beta distribution parameters. It was observed thatthose deviations were about 20 % smaller than the error made by linear interpolation ofthe CDF, independently from the sample size. Of course such a test is bound to showgood results because the same type of distribution function is fitted that the sample hasbeen drawn from. In practice, we can expect such good performance if there is reliableknowledge about the underlying distribution of a variable.

In conclusion, I suggest proceeding according to the following scheme:

1. Choice of technique to build continuous anamorphosis function:

a) If a theoretical distribution function can be inferred from the physical prop-erties of the variable, estimate the corresponding parameters and use thedistribution function as continuous CDF.

b) If non-parametric methods shall be applied instead, check for clustering:i. If there is clustering, calculate the cumulative frequencies according to

Section 6.3.1.ii. If there is no clustering, go on to c).

c) Chose an appropriate fitting technique based on sample size and empiricaldistribution shape:i. If the sample size is smaller than or equal to 200, apply cubic spline

regression to the empirical CDF or linearly interpolate the intervals of theCDF; consider using flexible interval lengths depending on distributionshape.

ii. If the sample size is larger than 200, linearly interpolate the empiricalCDF or its intervals; again, consider using flexible interval lengths.

2. Definition of the tails of the anamorphosis function:

a) If there is clustering, define the jumps at the clustering bounds according toSection 6.3.1.

b) If the population bounds are not covered by the sample, linearly extrapolatethe anamorphosis function according to the sample range condition (equations6.16 and 6.17).

c) If the population bounds are equal to the sample bounds, but there is noclustering: no special definition of the tails necessary.

48

6.5 Implementation in Ensemble Kalman Filter

Now that we have found transformations that render arbitrary data almost univariateGaussian, we have to rewrite the Ensemble Kalman Filter to handle the transformedvariables adequately. A few pitfalls have to be considered, i.e. typical properties of thevariables to be transformed and scaling of the measurement error.

6.5.1 Accounting for Properties of State Variables

The transformation of variables gives the opportunity to introduce knowledge aboutnon-stationarity of the variable’s properties or physical constraints. This will be shownin this subsection.

Non-Stationarity of State Variable Statistics

For state variables relevant in groundwater modeling, stationarity can generally not beassumed because they do not behave in the same way at different locations. This is adirect consequence of boundary conditions imposed on a state variable and has beentouched on before in Chapter 1. An illustrative example is concentration: It has a muchsmoother distribution within the plume than at the edge of it, where jumps betweenzero at the outside of the plume and a relatively high value within the plume can occur.Another factor that influences the degree of non-stationarity is the time frame that isconsidered by the model. The concentration distribution will be the smoother the furtherthe solute has traveled, this is due to dispersion. Please be referred to Figure 1.1 onceagain.

In the following, applying a Gaussian transformation to a sample of a variable x thatconsists of the simulated values of one realization at all the grid points will be calledthe global approach. This procedure requires the assumption of “full stationarity”: Noteven the commonly assumed second order stationarity of spatial data [Gelhar, 1993] thatrefers to constant mean and variance would be sufficient to justify the definition of acomplete distribution function that describes the behavior of the variable all over thedomain. As an alternative to the global approach, which is supported by Simon andBertino [2009], the local approach has been introduced by Béal et al. [2010]: To draw asample, all simulated values of the N realizations at one grid point are collected.

There have been concerns about the continuity of state variables when using the localapproach [Béal et al., 2010]. Those concerns do not apply to data assimilation techniquesthat aim at parameter estimation, because the state variables will be adjusted by a newsimulation with updated parameters rather than being updated themselves, as it wouldbe the case for state estimation. This procedure ensures the physical consistency of

49

the states because they are forced by the partial differential equation solved during thesimulation. Also, the spatial correlation of state variables will be preserved through thenew simulation, it is not affected by the local point-wise transformation.

As opposed to other data assimilation techniques, the ensemble Kalman filter offers thechance to build an empirical distribution function based on the ensemble of realizationsat one grid point. Any method that does not work with a Monte Carlo ensemble doesnot have this opportunity and can only draw a sample from spatially distributed data,and therefore implicitly has to assume stationarity. In this study, we will put the localapproach into practice to avoid (obviously unjustified) assumptions of full stationarity.As a consequence, we will need to build an anamorphosis function for every observationpoint where measurement data is available. The number of observations shall be denotedby nmeas. Each of the nmeas empirical anamorphosis functions will consist of an ensembleof N data points which are taken from the realizations.

Physical Bounds

The Ensemble Kalman Filter with transformed data shall be applied to a groundwatermodel in this study. This implies that we have to deal with an ensemble of simulationsand a number of “real” observations for assimilation. The parameter to be updated willbe log-conductivity; the state variables that will contribute to the updating process willbe heads, concentration and drawdown.

The local empirical anamorphosis functions will be constructed with the data of thewhole ensemble at each measurement location. As it cannot be guaranteed that therange of the ensemble values of one of the variables contains the actual observation atthis grid point, it is necessary to extend the transformation function up to physicalbounds as explained above. Furthermore, it might occur that the “real” measured valuelies outside the physically possible range of measurement values. This could happendue to measurement errors. Thus, we have to extend the transformation function evenfurther to take these values into account, but at the same time correct them based onour a priori knowledge about the physical range of values. Figure 6.16 shows a mixtureof clustering at the lower bound, zero, and extrapolation to the physical upper bound,one, as well as the extension towards minus or plus infinity to cover any input value x.The anamorphosis function is a convenient tool to introduce such corrections. So far,unphysical observations have been corrected manually [e.g., Clark et al., 2008], whichcompromises the random character of the measurement error. It will be shown in Section6.5.3 how the measurement error is transformed without the need for a pre-processingstep to ensure physical measurement values.

Let us now have a look at the physical bounds inherent to the variables that will beexamined in the groundwater model. The types of bounds can be categorized intopositiveness, one-sided limitation and double-sided limitation and will be discussed inthe following.

50

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−3

−2

−1

0

1

2

3

4

Original variable x

Tra

nsfo

rmed

var

iabl

e z

Anamorphosis Function

PhysicalboundsEnsemble

bounds

Figure 6.16: Anamorphosis function. Illustration of clustering, ensemble and physicalbounds and extension towards ± infinity (for any x < xmin: z = zmin; for any x > xmax:z = zmax)

Positiveness Physical state variables like drawdown, concentration or discharge takeon positive values by definition. The required non-negativity is ensured by setting thelower bound to zero. Non-negative variables can be represented, e.g., by a log-normaldistribution with a tail towards plus infinity.

Bound on one side If we encounter one head Dirichlet boundary condition in combina-tion with a Neumann boundary condition, the head distribution will be bounded on oneside by the Dirichlet value. A similar scenario is created by two head Dirichlet boundaryconditions if recharge is taken into account, here the head will never be below the lowerone of the boundary conditions. This holds analogously for domains with sources andsinks.

Bounds on both sides Two Dirichlet boundary conditions defining the main flow di-rection lead to hydraulic head distributions with two bounds. If no sources or sinks arepresent, all of the head values will be between the two boundary values. It has beenshown by Nowak et al. [2008] that head distributions in boundary-influenced domainscan be well represented by fitted beta distributions. Similarly, solute concentrationscover values between zero and the initial condition for concentration and follow a betadistribution function as derived by Bellin and Tonina [2007].

51

6.5.2 Comparability of Observations and Simulated Measurements

Updating a parameter field by applying a correction term requires that this term is verysensitive to “meaningful” differences between the values that are subtracted, but at thesame time does not introduce any differences that are not caused by physical processes.This implies that data transformation must not introduce differences that have not beenpresent before.

A special case of differences between two values is a difference equal to zero, i.e. thesimulated observation is equal to the observed measurement. This is the optimal outcomeof a simulation as the observation has been perfectly reproduced. Consequently, nocorrection should be performed in the updating step.

Both the observed measurements and the simulated measurements from all realizationsare transformed with the same anamorphosis function because we cannot determine anindependent anamorphosis function for observations: Firstly, we only have one datapoint per variable and measurement location, and secondly we assume that our modelis error-free and thus implicitly, that the real observation originates from the same pop-ulation as the simulated values of our ensemble, only corrupted by measurement error.Transforming observations and simulations with the same function ensures compara-bility between both transformed variables, but is also a source of clustering. Specialattention has to be paid here as clustering is an accumulation of equal values that haveto be transformed equally. The method to handle clustering suggested in Section 6.3.1has been developed to fulfill the requirement of an error equal to zero for measurementsthat are exactly the same.

6.5.3 Transformation of Measurement Error

To determine the discrepancy between reality and the simulation based on parametersof realization i, a random error has to be added to the simulated measurements in orderto create conditions comparable to the observed measurements yo which consist of theactual value of the state variable and a measurement error that is unknown, but assumedto be standard-Gaussian distributed. To be consistent with realistic data collection, itcannot be distinguished which fraction of the measured value is due to measurementerror. Yet, to maintain the traditional analysis scheme of the ensemble Kalman filter,the simulated measurements and their error have to be transformed separately. In orderto still create comparable transformed values yu,i + εi and yo (which is crucial forthe effectiveness of the update as discussed in the previous section), there are differentapproaches how to transform the measurement error εi.

52

Scaling of Measurement Error Variance

A first possibility would be to scale ε with the same factor that the simulated measure-ment has been transformed with:

ε = εyu,iyu,i

(6.20)

This would maintain the fraction of measurement error with respect to the total mea-sured value and turn ε into a normally distributed variable with zero mean and standarddeviation equal to the scaling factor yu,i

yu,i. The drawback of this intuitive method is the

fact that, for nonlinear anamorphosis functions, this scaling will not yield the same re-sults as the transformation of the real measurements that already include a measurementerror. Consider a short, arbitrary example to point this out: Our actual measurementreads 3, the measurement error is assumed to be equal to 1. Now, separate nonlineartransformation according to a function ˆvariable = variable

2 and scaling would yield atransformed total measurement of 12; transformation of the original total measurementresults in 16.

If strict comparability between simulated perturbed measurements and real observationsis claimed, the following condition can be formulated:

ψ (y∗ + ε∗) != ψ (y∗) + ε∗ (6.21)

This ensures for a specific value y∗ and an ε∗ assigned to it, that transformation of thesum of both yields the same result as separate transformation.

Transforming Measurement Error According to Anamorphosis Function

Equation 6.21 can be used directly as a scaling method by rewriting it:

ε∗ = ψ (y∗ + ε∗)− ψ (y∗) . (6.22)

In words, choose ε such that the transformation of the sum corresponds to the sum ofthe transformed value plus the scaled error. This can be interpreted as determining thelocal secant of the anamorphosis function. The procedure guarantees that the observedmeasurements and the simulated measurements plus their random error are treatedequally, but then the error does not follow a Gaussian distribution anymore.

Both methods will be implemented within this study and assessed with regard to theirquality of prognosis in Chapter 8.

53

6.5.4 Parameter Updating Step

Equation 5.7 presented within the context of the ensemble Kalman filter analysis scheme(Section 5.2.2) will be modified in order to incorporate Gaussian anamorphosed variables:

sc,i = su,i + Qsy

(Qyy + R

)−1 (yo −

(yu,i + εi

)), (6.23)

with the hat symbolizing anamorphosed variables. The Gaussian anamorphosis functionlinks the transformed variables with the original ones: For each realization i, the vectorof the transformed values at the k = 1...nmeas measurement locations is obtained byevaluating the individual anamorphosis functions ψk. Methods to obtain and evaluatethe anamorphosis function have been discussed excessively in sections 6.1 - 6.3. Thecovariance matrix Gyy can be calculated from the transformed measurements yu,i aswell as the cross-covariance matrix Qsy.

The determination of the transformed measurement error variance R depends on thechosen approach to transform the measurement error ε: If transformation technique 1(transformation of measurement error according to anamorphosis function) is chosen,the error variance is calculated numerically rather than analytically determined to beconsistent with ε:

Qyy + R = Cov(yu,i + εi

)= Cov

(yu,i

)+ Cov (εi) ;

−→ R = Cov (εi) . (6.24)

The off-diagonal elements will be set to zero since independence of measurement errors isassumed. Note that, within this approach, the stability and accuracy of the transformedmeasurement error variance R is strongly influenced by the ensemble size N . Therefore,it is suggested to generate an additional ensemble of perturbed measurements: To theexisting ensemble of simulated measurements, new randomly drawn measurement errorsare added. The transformation of both the original perturbed ensemble yu + ε and theadditional perturbed ensemble yu + εadd will double the amount of transformed mea-surement errors available for the calculation of the variance. Of course, any multiplyingfactor nadd can be chosen to obtain an ensemble of transformed errors of size nε:

nε = (nadd + 1)N (6.25)

For transformation technique 2 (scaling of measurement error variance), the transformedvariance is analytically defined by

R = Rdiag

(Qyy

)diag

(Qyy

) , (6.26)

which also results in non-zero elements only on the main diagonal. Here, no additionalperturbed measurements are required, which is computationally more effective.

54

6.5.5 Model Bias

The decision of transforming the simulated measurements yu and the synthetic obser-vations yo with the same anamorphosis function is justified by the assumption that weare working with an error-free model. Otherwise it could not be claimed that the realobservation stems from the same population as the simulated measurements. Despitethis assumption, the anamorphosis function is extrapolated beyond the ensemble boundsbecause an ensemble is just a sample which cannot cover the total range of the underly-ing population. Thus, statistically, it is possible that a real observation falls outside therange of the simulated ensemble. This effect is even strengthened by including a ran-dom measurement error which can widen or narrow the range covered by the simulatedensemble.

In general, model bias could be introduced by adopting an inappropriate geostatisticalmodel with a too small variance of log-conductivity or an inadequate spatial structure(e.g., assuming multi-Gaussianity although not justified). The flow and transport modelcould be inaccurate as well, if influential governing processes are neglected or unknown.A third source of model bias consists of assuming incorrect flow boundary conditions.Wrong model boundary conditions lead to two potential results: Firstly, a significantamount of realizations could produce measurement values that deviate strongly fromthe true measurement, i.e. the true measurement lies close to the ensemble bounds.This scenario will force the filter to apply heavy corrections to the parameters thatare statistically inconsistent and most likely will not result in a reasonable prognosis.For future research, it is suggested to test whether the ensemble Kalman filter withtransformed data performs significantly different in this case than the conventional filterwith untransformed data. Please note that correcting wrong boundary conditions is notthe main motivation to apply Gaussian anamorphosis, thus this should not be a criterionto evaluate the worth of transformation, although it is an indicator for its robustness.

Secondly, wrong boundary conditions could result in a prior ensemble that clearly doesnot represent a parent distribution from which the true physical system might be alegitimate member because most of the true observations lie well outside the ensemblerange or even the predefined physical range. Then the underlying physical or conceptuala priori model should be questioned and the preprocessing step of defining physicallimits should be repeated. This is clearly not a task to be solved by the anamorphosisprocedure, but lies within the responsibility of the modeler.

If boundary conditions are known to be uncertain, it could be an option to parameterizethem: They could be added to the list of parameters to be estimated and be correctedwith the help of the filter. Instead of harming the prediction, the issue of specifyingboundary conditions could then contribute to an accurate estimate of the uncertainty ofthe prognosis.

55

7 Application to Synthetic Test Case

The performance of the ensemble Kalman filter with transformed data shall be evaluatedby identifying and quantifying the impact of the anamorphosis on the resulting prognosis.This will be analyzed for different data types and two different methods of transformingmeasurement errors. The numerical implementation and the testing strategy will bediscussed in the following; results will be presented in Chapter 8.

7.1 Numerical Implementation

The ensemble Kalman filter applied to transformed data is implemented in MATLABand coupled with a MATLAB-based FEM code to solve the flow and transport model.The standard Galerkin FEM is used to numerically approximate the groundwater flowequation 3.4, for the transport equation 3.7 the streamline upwind Petrov-Galerkin FEMis applied [Hughes, 1987]. The equations are solved by the UMFPACK solver [Davis,2004]. Dirichlet and Neumann boundary conditions are prescribed to pose a well-definedproblem.

Conductivity values are assigned elementwise while the flow and transport model returnsstate variable values at the nodes of the grid. For simplicity, this will not be distinguishedin the further course of this study.

The random parameter fields for the different realizations are generated with the spectralmethod of Dietrich and Newsam [Dietrich and Newsam, 1993]. If not indicated otherwise,an ensemble of 1000 realizations is implemented to obtain satisfying statistics [Zhanget al., 2005], but at the same time keep the computational effort moderate to be able toperform several scenarios which will be presented in the following section.

7.2 Description of Test Case

A steady-state depth-averaged groundwater flow and transport model is chosen as testcase. The rectangular section of a confined aquifer expands over an area of 100m x 100m.Flow is induced by a head gradient from west to east (Dirichlet boundary conditions

56

h = 1m at the western boundary and h = 0m at the eastern boundary). ImpermeableNeumann boundaries at the northern and southern boundaries maintain the main flowdirection from west to east. A tracer plume enters the domain in the middle of thewestern boundary with a width of 30m. The state variables head and concentrationwill be simulated based on the underlying parameter field. The test case setting iscompleted by a well for pumping test analysis in the center of the domain; drawdownwill be simulated while concentration is not considered within the pumping test scenario.

Log-conductivity is assumed to follow a multi-Gaussian distribution, therefore variogrammodels to describe the covariance of log-conductivity are justified. The isotropic expo-nential model is used with a microscale smoothing parameter [Kitanidis, 1997] to gener-ate random conductivity fields. Each realization consists of such a random field and thecorresponding simulated heads, drawdowns and concentrations.

17 measurement locations are spread regularly over the domain. Measurements of head,drawdown or concentration will be taken at these locations to update the unconditionalensemble. Measurement errors are assumed to be Gaussian distributed and uncorrelated;the standard deviation depends on the data type. A summary of parameter valuesadopted in the test case is provided by Table 7.1.

Parameter Symbol Value Unit

Domain DiscretizationDomain size [Lx, Ly] [100, 100] mGrid spacing [4x, 4y] [1, 1] m

Geostatistical ModelGeometric mean of K Kg 10−5 m/s

Variance of logK σ2K 1 -

Correlation length of logK [λx, λy] [20, 20] mMicroscale smoothing d 2.5 m

Transport ModelPorosity φ 0.35 −

Dispersivities [αl, αt] [2.5, 0.25] mDiffusion coefficient De 10−9 m2/s

Measurement ErrorStandard deviation of εh σε,h 0.02 mStandard deviation of εd σε,d 0.01 mStandard deviation of εc σε,c 0.02 + 20% -

Table 7.1: Model parameters used for synthetic test case. K, logK stand for conductivityand log-conductivity, respectively. h, d, c represent the state variables head, drawdownand concentration. ε symbolizes measurement error. For concentration data, the mea-surement error standard deviation is composed of an absolute and a relative part andresults in a measurement-specific standard deviation

57

7.3 Test Procedure

A synthetic truth is generated as reference field: This random log-conductivity fieldtogether with its simulated heads, concentrations and drawdowns will be referred to as“true” field. Any parameter’s or state variable’s value is known everywhere in the domainhere, but only the measurements at the designated locations are used in the updatingstep. An unconditioned ensemble is produced by generating random log-conductivityfields as previously described and running flow and transport simulations on these fields.The updating step is then performed based on the observations of one of the three datatypes; this will be clarified in the respective paragraph of Chapter 8.

Not only one synthetic truth is used to assess the accuracy of the filter with transformeddata, but several randomly generated “truths” because results obtained for a singlesynthetic field could be compromised by numerical or statistical artefacts. This shallbe avoided by performing data assimilation for true fields with different characteristicfeatures.

Statistical analysis of the ensembles includes the mean and the variance of conductivity,head, concentration and drawdown fields as well as different measures that account forthe deviation from the true fields:

n Root mean square error:

RMSϑ =

√√√√ 1n

n∑i=1

(ϑtrue,i − ϑc,i

)(7.1)

with ϑ being the parameter logK or any of the state variables h, d, c; n stands forthe number of elements or nodes, respectively. ϑ is obtained from averaging overall realizations.

n Prediction bias:ϑbias,j = ϑc,j − dj√

Rj,j(7.2)

for a measurement location j is defined as the deviation of the simulated mea-surements from the observed one normalized by the measurement error standarddeviation. The prediction bias can be evaluated both in the transformed and theuntransformed space.

n Statistics of residuals: For the ensemble of simulated measurements at locationj, the standard deviation is calculated and normalized by the measurement errorstandard deviation; the higher order moments skewness and excess-kurtosis arealso determined

It might be noticed that for a posteriori statistics, the pure simulated measurementsϑc are used instead of a perturbed ensemble ϑc + ε; the reason is that the mean of themeasurement errors ε is set to zero and therefore does not have an influence on qualitymeasures.

58

8 Results and Discussion

8.1 Filtering Procedure with Transformed Data

The procedure of ensemble Kalman filtering applied to transformed data shall be exem-plarily discussed step by step. Parameter estimation by conditioning on drawdown datawill be considered. Observations are assumed to be gathered at 17 locations, indicatedby black rings around white crosses in the plots of the arbitrarily chosen synthetic truthin Figure 8.1 .

20 40 60 80 100

20

40

60

80

100

x [m]

y [m

]

Log−Conductivity

−14

−13

−12

−11

−10

−9

20 40 60 80 100x [m]

Drawdown

−0.2

−0.15

−0.1

−0.05

0

Figure 8.1: Synthetic truth: Log-conductivity field and drawdown field

Ensemble Generation First of all, an ensemble of 1000 parameter fields is generatedaccording to the geostatistical model for log-conductivity as described in Chapter 7.Flow simulations on these log-conductivity fields yield an ensemble of 1000 drawdownfields. The ensemble of log-conductivity fields shall be corrected with regard to theobserved data.

Prior Statistics The prior statistics of the log-conductivity and drawdown ensembleare plotted in Figure 8.2. The mean of log-K lies around -11.5, the variance around1, corresponding to the prescribed statistics for the generation of the fields. Mean andvariance of the drawdown fields result from solving the groundwater flow equation with

59

20 40 60 80 100

20

40

60

80

100

x [m]

y [m

]

Log−Conductivity

−14

−13

−12

−11

−10

−9

20 40 60 80 100x [m]

Drawdown

−0.2

−0.15

−0.1

−0.05

0

(a) A priori ensemble mean

20 40 60 80 100

20

40

60

80

100

x [m]

y [m

]

Log−Conductivity

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100x [m]

Drawdown

0

0.5

1

1.5

2x 10

−3

(b) A priori ensemble variance

Figure 8.2: A priori ensemble statistics of log-conductivity and drawdown

the boundary conditions specified in Section 7.2 and represent unconditioned statistics.As expected, the unconditioned ensemble mean is not able to satisfyingly reconstructthe synthetic truth (scales of the colorbars are maintained throughout this section).

Gaussian Anamorphosis Before the EnKF conditioning step, the drawdown data at themeasurement locations is transformed by linearly interpolating the empirical CDF. Thisis the method of choice since no theoretical distribution function is available for draw-down data and we are working with a relatively large ensemble (see recommendationsin Section 6.4.5).

The histogram of the highly skewed original data, the empirical anamorphosis functionand the resulting histogram of the transformed data can be seen on Figure 8.3. Theupper row shows exemplary plots for the measurement location at the pumping well;the lower row summarizes all of the other measurement locations (different color shadesrepresent different locations). The synthetic observations are marked with black circles

60

−8 −6 −4 −2 00

0.2

0.4

0.6

0.8

1R

elat

ive

Fre

quen

cyUntransformed Ensemble

−8 −6 −4 −2 0−4

−2

0

2

4Anamorphosis

Tra

nsfo

rmed

dra

wdo

wn

−4 −2 0 2 40

0.1

0.2

0.3Transformed Ensemble

Rel

ativ

e F

requ

ency

−0.5 −0.4 −0.3 −0.2 −0.1 0−4

−2

0

2

4

Simulated drawdown

Tra

nsfo

rmed

dra

wdo

wn

−4 −2 0 2 40

0.1

0.2

0.3

Rel

ativ

e F

requ

ency

Transformed drawdown−0.5 −0.4 −0.3 −0.2 −0.1 00

0.2

0.4

0.6

0.8

1

Rel

ativ

e F

requ

ency

Simulated drawdown

Figure 8.3: Gaussian anamorphosis of drawdown data. Upper row shows transformationof the ensemble at the measurement location closest to the well, lower row summarizesthe transformation at the other measurement locations

in the anamorphosis function plot. Extrapolation towards the physical bound of zeroor beyond is not required in this specific case, because the unconditioned ensemble ofdrawdown values encloses the observed data values. Nevertheless, the a priori ensemblehas a large variability and we can conclude that prediction confidence will profit from anupdating step to narrow the prediction ensemble. This transformation step is the onlymodification of the traditional EnKF analysis scheme and takes up only a fraction ofthe computational time needed to process the flow model runs; now the usual procedurewill be resumed.

Updating Step With these transformed data, the Kalman gain is computed and eachof the realizations is updated according to Equation 6.23; measurement errors are trans-formed based on the individual anamorphosis functions as expressed by Equation 6.22.

After this conditioning step, the log-conductivity fields have been corrected based onanamorphosed drawdown data. The new conductivity values are used as input for an-other run of the flow model. The back-transformed, updated drawdown values are de-termined by the subsequent model run, which replaces the inverse anamorphosis neededin the case of state estimation [Simon and Bertino, 2009].

61

To illustrate the success of the updating step, the simulated ensemble of drawdown valuesclose to the pumping well before and after updating is shown on Figure 8.4. The horizon-tal line indicates the observed value that shall be reproduced. Obviously, conditioninghas narrowed the ensemble considerably toward the true measurement value.

−8

−6

−4

−2

0

0 100 200 300 400 500 600 700 800 900 1000−0.3

−0.25

−0.2

−0.15

Realizations

Sim

ulat

ed d

raw

dow

n

Figure 8.4: Drawdown ensembles before (top) and after (bottom) updating at the mea-surement location closest to the pumping well. The observed value is marked by thethick, red line

A Posteriori Statistics Now that we have conditioned the log-K fields on drawdowndata, a posteriori statistics can be calculated. Figure 8.5 shows the estimated meanand the prediction variance of the updated fields. The best estimate of drawdown hasbeen significantly improved and reasonably reproduces the shape of the drawdown conein the synthetic truth. The parameter field is not matched that well, but it has to berecalled that only limited data are available and the realizations are not conditionedon direct log-conductivity values. Thus, the best estimate of the parameter field canonly reproduce large scale structures and lacks variability compared to any individualrealization, such as the synthetic truth: Local differences between the realizations areevened out and this averaged parameter field cannot be used as input to simulate theexpected drawdown field because of the non-linearity of the flow model. Instead, thebest estimate of the drawdown field is obtained by model runs on the whole ensemble ofparameter fields and offers a better foundation for interpretation.

In this context, it shall be mentioned that the updated parameter field realizationsstill follow the prescribed geostatistical model which is preserved by consistent use of

62

20 40 60 80 100

20

40

60

80

100

x [m]

y [m

]

Log−Conductivity

−14

−13

−12

−11

−10

−9

20 40 60 80 100x [m]

Drawdown

−0.2

−0.15

−0.1

−0.05

0

(a) A posteriori ensemble mean

20 40 60 80 100

20

40

60

80

100

x [m]

y [m

]

Log−Conductivity

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100x [m]

Drawdown

0

0.5

1

1.5

2x 10

−3

(b) A posteriori ensemble variance

Figure 8.5: A posteriori ensemble statistics of log-conductivity and drawdown

covariance matrices in the updating step. The mean parameter field, however, does notshare these geostatistical properties since it shows a much smoother spatial structure.

In the following section, the effects of transformation shall be examined in detail tounderstand and efficiently use its beneficial properties compared to the assimilation ofuntransformed data.

8.2 Effects of Transformation

8.2.1 Pseudo-Linearized Dependence of States on Parameters

Univariate transformation techniques have been developed in this study to obtain uni-variate approximately Gaussian-distributed state variables. The implicit working hy-pothesis is that this will install a more linear dependence to their Gaussian-distributed

63

primary parameter log-conductivity. The state-parameter dependence will now be inves-tigated for untransformed and transformed state variables of different types (drawdown,hydraulic head and solute concentration) at different measurement locations as markedin Figure 8.6. Log-conductivity locations and corresponding state variable locations areassigned the same identifying number. High (positive) correlations with log-conductivitywere chosen to demonstrate the change in dependence.

Figure 8.6: Synthetic fields with marked measurement locations: Numbers indicate thepairs of strongly correlated state variable and log-conductivity

−16 −14 −12 −10 −8−8

−6

−4

−2

0Untransformed Ensemble

Sim

ulat

ed d

raw

dow

n

−8 −6 −4 −2 0−4

−2

0

2

4Anamorphosis

Tra

nsfo

rmed

dra

wdo

wn

−16 −14 −12 −10 −8−4

−2

0

2

4

Tra

nsfo

rmed

dra

wdo

wn

Transformed Ensemble

−16 −14 −12 −10 −8−0.4

−0.3

−0.2

−0.1

0

Log−conductivity

Sim

ulat

ed d

raw

dow

n

−0.4 −0.3 −0.2 −0.1 0−4

−2

0

2

4

Simulated drawdown

Tra

nsfo

rmed

dra

wdo

wn

−16 −14 −12 −10 −8−4

−2

0

2

4

Log−conductivity

Tra

nsfo

rmed

dra

wdo

wn

Loc. 2

Loc. 1

Figure 8.7: Dependence of drawdown on log-conductivity (Locations 1 and 2)

64

Drawdown Data Figure 8.7 displays scatter plots of the simulated drawdown ensembleat locations 1 and 2 versus log-conductivity values at locations 1’ and 2’. The untrans-formed ensemble at the pumping well (location 1) shows a strong, but highly non-lineardependence on the values of log-conductivity at location 1’. This dependence structurecan be well linearized by the transformation plotted on the middle panel. The resultingdependence of the transformed drawdown on log-conductivity is now almost perfectlylinear and can be more efficiently exploited in the EnKF analysis scheme. The fartheraway from the pumping well, the more linear the dependence, but at the same time morescattered as shown in the lower row of plots.

Head Data Compared to drawdown data, heads depend on log-conductivity in a rel-atively linear manner. Therefore, such a strong improvement in linearity of the depen-dence cannot be expected. Figure 8.8displays the scatter plots for locations 3/3’ and4/4’, respectively. The lower row for the measurement close to the western boundaryshows the most non-linear behavior that might occur for head data; it is attributed tothe influence of the boundary condition (see Section 6.5.1). For measurement locationsthat are considerably influenced by boundary conditions, transformation might improvethe efficiency of data assimilation.

0 0.25 0.5 0.75 1−4

−2

0

2

4Anamorphosis

Tra

nsfo

rmed

hea

d

−16 −14 −12 −10 −8−4

−2

0

2

4Transformed Ensemble

Tra

nsfo

rmed

hea

d

−16 −14 −12 −10 −80.4

0.6

0.8

1

Log−conductivity

Sim

ulat

ed h

ead

0.4 0.6 0.8 1−4

−2

0

2

4

Simulated head

Tra

nsfo

rmed

hea

d

−16 −14 −12 −10 −8−4

−2

0

2

4

Log−conductivity

Tra

nsfo

rmed

hea

d

−16 −14 −12 −10 −80

0.25

0.5

0.75

1Untransformed Ensemble

Sim

ulat

ed h

ead

Loc. 3

Loc. 4

Figure 8.8: Dependence of head on log-conductivity (Locations 3 and 4)

Concentration Data The dependence of concentration values on log-conductivity isvisualized in Figure 8.9. Obviously, untransformed concentration data plotted over log-

65

−16 −14 −12 −10 −8−4

−2

0

2

4

Tra

nsfo

rmed

con

cent

ratio

n Transformed Ensemble

−16 −14 −12 −10 −80

0.25

0.5

0.75

1

Log−conductivity

Sim

ulat

ed c

once

ntra

tion

0 0.25 0.5 0.75 1−4

−2

0

2

4

Simulated concentration

Tra

nsfo

rmed

con

cent

ratio

n

−16 −14 −12 −10 −8−4

−2

0

2

4

Log−conductivity

Tra

nsfo

rmed

con

cent

ratio

n

−16 −14 −12 −10 −8

0.25

0.5

0.75

1

Sim

ulat

ed c

once

ntra

tion

Untransformed Ensemble

0 0.25 0.5 0.75 1−4

−2

0

2

4

Tra

nsfo

rmed

con

cent

ratio

n Anamorphosis

Loc. 5

Loc. 6

Figure 8.9: Dependence of concentration on log-conductivity (Locations 5 and 6)

conductivity scatter much more than the other state variables. The higher fraction ofindependent log-K values is caused by a strong non-local dependence structure. It is dif-ficult to identify a certain pattern within the scatter plots. Nevertheless, transformationhas a linearizing effect on the dependence, although not as impressive as for drawdowndata. Notice that the concentration value at location 5 does not depend mostly onthe log-conductivity in its vicinity, but on the values at the source of the plume. Itcan be concluded that the hydraulic properties at the source of the contaminant areconsiderably determining the propagation of the plume [de Barros and Nowak, 2010].

Summary It has been demonstrated that the Gaussian anamorphosis has a linearizingimpact on the model which links parameters and states - an especially strong effect isvisible for drawdown data. Thereby, the implicit assumption and suggested benefit offiltering anamorphosed data has been affirmed.

8.2.2 Bivariate Dependence Structures of State Variables

As explained in the motivation for this study, Gaussian marginals are only a first steptoward meeting the assumption of multi-Gaussianity as a prerequisite for the optimalperformance of the EnKF. Johnson and Wichern [1988] offer a selection of tests whethertwo variables with Gaussian marginals can be considered at least bi-Gaussian. In or-der to assess the remaining non-multi-Gaussianity after transformation, bivariate scatter

66

plots are produced to visualize dependence structures between the state variables at thelocations defined above. Bivariate test plots can only provide a first hint whether avariable is multi-Gaussian distributed; even if all involved bivariate data sets have beenfound to be bi-Gaussian, the multi-dimensional distribution does not necessarily haveto be multi-Gaussian. Yet, if bivariate plots already show non-Gaussian behavior, theassumption of multi-Gaussianity can be instantly dismissed. Note that the spatial depen-dence structure captured by copulas is not changed by transformation of the marginalssince monotone, rank-preserving transformation techniques are applied [Bárdossy andLi, 2008]. Thus, the degree of non-multi-Gaussianity after transformation depends onlyon the multivariate behavior of the different variable types.

Drawdown Data Figure 8.10 contrasts the empirical copula density as obtained fromthe drawdown ensemble at locations 1 and 2 with the theoretical Gaussian copula den-sity that corresponds to a rank correlation r = 0.49 which was calculated for theempirical data sets. The empirical copula has been determined from an ensemble ofN = 100, 000 realizations and shows similar features to the Gaussian one, although aslight non-symmetrical curvature is noticeable. This indicates that the bivariate behav-ior of drawdown data is close to bi-Gaussian, but there is still a non-Gaussian influenceon the dependence structure that should be further investigated and, if possible, trans-formed to multi-Gaussian dependence in future work.

Empirical Copula (Drawdown Data)

0 10

1Theoretical Gaussian Copula

0 10

1

0

1

2

3

4

Figure 8.10: Empirical copula density for drawdown at locations 1 and 2 (left) andtheoretical Gaussian copula density (right) with same rank correlation

Head Data For heads, a dependence structure close to bi-Gaussian is expected from theknown quasi-linearity of the head-conductivity relation and found as plotted in Figure8.11. Here, the empirical copula density of the data sets at locations 3 and 4 with a rankcorrelation r = 0.34 shows the typical symmetric features of Gaussian dependence, thushead data can be assumed to be at least bi-Gaussian distributed after transformation ofthe marginals.

67

Empirical Copula (Head Data)

0 10


0 10

1

0

1

2

3

Figure 8.11: Empirical copula density for heads at locations 3 and 4 (left) and theoreticalGaussian copula density (right) with same rank correlation

Concentration Data Different results are expected for concentration data, since thestate-parameter dependence already showed characteristics that could only be explainedby various sources of influence but not by a single local influence that could be addressedby univariate transformation. The bivariate behavior of concentration data sets with arank correlation r = 0.30 is plotted in Figure 8.12. The empirical copula density ischaracterized by a stronger non-symmetry than the dependence structures of the othervariable types. Its shape reminds of a “coffee bean” which results from measurementsthat are either part of the plume or lie outside the plume. The complex spatial de-pendence of concentration data limits the effectiveness of univariate transformation ofconcentration data; multivariate transformations are inevitable in future works in orderto transfer this data type closer to multi-Gaussianity.

Empirical Copula (Concentration Data)

0 10


0 10

1

0

1

2

3

4

Figure 8.12: Empirical copula density for concentration at locations 5 and 6 (left) andtheoretical Gaussian copula density (right) with same rank correlation

68

8.2.3 Qualitative Differences in the Updating Step

As illustrated in the previous sections, transformation has no influence on the multivari-ate dependence among the state variables, but has a direct impact on the state-parameterdependence. Therefore, qualitative differences are expected in the updating step whichtranslates the information obtained from the observations into information on the pa-rameter field. Differences in the assignment of weights to the observations and in thespatial radius of influence (as discussed in Chapter 5) shall be investigated for statevariable and parameter ensembles with N = 1, 000 realizations.

Drawdown Data Figure 8.13 displays the influence of the drawdown observation atlocation 1 on the parameter field to be updated. The scale is normalized by the largestabsolute correlation; positive values signify positive correlation while negative valuesrepresent negative correlation. As claimed before, drawdown data are rather locallydependent on the hydraulic conditions and therefore measurement 1 has a major in-fluence on the log-conductivity values in the close vicinity of the well. The influencedecreases smoothly with increasing distance from the well; negative correlations hardlyoccur. Remember that drawdown is defined as negative value, thus a positive correlationcorresponds to an inversely proportional relationship between drawdown and conductiv-ity: High absolute drawdown results from low conductivity. This spatial behavior is notsignificantly altered by transformation, only a slight tendency towards even smoothertransitions is visible on the plot to the right. The area of influence seems to be slightlylarger and more symmetric which would be expected from the relationship betweendrawdown and log-conductivity, given this specific spatial configuration.

Figure 8.13: Influence function of measurement 1 (drawdown) on the parameter field

69

Head Data Head data observations, however, show not only positive correlations withlog-conductivity values but also negative correlations. This results in a less local depen-dency on hydraulic conditions as visualized in Figure 8.14. It can be seen that the headmeasurement at location 3 has a major positive correlation with the log-conductivityvalues in the close vicinity, but is also negatively correlated with the parameters close tothe boundary of the domain. With regard to the transformed influence function, againonly small smoothing effects are visible.

Figure 8.14: Influence function of measurement 3 (head) on the parameter field

Concentration Data The complex dependence structure of concentration data on log-conductivity values yields a more complex influence function: As visualized in Figure8.15, the observation at location 5 still has a large influence on conductivity in thevicinity, but correlations quickly change to large negative values and even the easternpart of the domain is significantly influenced by this observation.

In line with the findings from the previous sections, it is concluded that concentrationdata depend in a complex and rather global than local manner on log-conductivity;this type of non-linear dependence is difficult to exploit with a linear estimator. Thisproblem can only be mitigated to a certain degree by univariate transformation: Theglobal dependence is a combination and interplay of various local dependencies thatcannot be pseudo-linearized by univariate anamorphosis that acts on the sum of alldependencies. The shape of the influence function highly depends on the measurementlocation; even more spatially variant correlations were found that are not plotted here.

70

Figure 8.15: Influence function of measurement 5 (concentration) on the parameter field

8.3 Transformation of Different Data Types

Now that the procedure of updating and the effects of transformation have been clarified,the benefits of applying the EnKF to transformed data shall be quantified. The perfor-mance will be investigated for the three different state variables and varying syntheticdata sets for log-conductivity.

8.3.1 Drawdown

The exemplary discussion of the synthetic data set in section 8.1 will be resumed here.In addition to the results obtained with transformation technique 1 (transforming mea-surement error according to the anamorphosis function, labeled “transf. data”), theperformance of transformation technique 2 (determination of transformed measurementerror by scaling the measurement error variance according to Equation 6.26, labeled“transf.2 data”) will be analyzed. Figure 8.16 contrasts the ratio of the main diagonal ofthe covariance matrices to the measurement error variance at each measurement locationfor both techniques. Transforming measurement error according to the anamorphosisfunction produces transformed perturbed measurements that include a higher portionof measurement error with respect to the total perturbed value than in the case ofuntransformed measurements.

71

2 4 6 8 10 12 14 16

10

20

30

40

50

Measurement number

diag

(Qyy

)/di

ag(R

)

Ratio of Measurement Covariance and Measurement Error Variance

untr. data

transf. data

transf. 2 data

Figure 8.16: Ratio of diagonal of measurement covariance matrix and measurement errorvariance

Best Estimate and Prediction Variance The two differing methods of transformingmeasurement errors lead to a slightly different estimate of the log-conductivity field(Figure 8.17), but a very similar mean drawdown field. Both transformation techniquesare able to reproduce the synthetic truth better than the traditional EnKF applied tountransformed data, which underestimates the conductivity in the center of the domainand therefore overestimates the drawdown at the pumping well.

The prediction variance (Figure 8.18) of the log-conductivity ensemble is partly re-duced when scaling the measurement error variance; again, both transformation methodsachieve a higher reduction of the a priori ensemble variance.

Be aware that the performance of transformation technique 1 depends on the ensemblesize and the amount of additionally generated perturbed realizations (Equation 6.25)because the measurement error variance R is directly computed from the transformedmeasurement errors (Equation 6.24); thus the number of transformed measurement er-rors influences the stability and accuracy of transformed measurement error statistics.For the test applications presented here, nadd = 10 additional ensembles were used toderive the error statistics.

72

Figure 8.17: Synthetic log-conductivity and drawdown field and best estimates resultingfrom different transformation methods in the EnKF

73

Figure 8.18: A priori ensemble variance of log-conductivity and drawdown field andconditional variances resulting from different transformation methods in the EnKF

74

Prediction Error To summarize what is indicated by the plots shown above, the RMSEof the estimates with regard to the true fields is computed. When assimilating trans-formed data and scaling the measurement error variance, the deviation from this specifictrue field in the sense of RMSE can be reduced by 8% for the estimated conductivityfield and reduced by 24% for the estimated drawdown field.

An overview of the reductions in RMSE obtained by tests with varying synthetic datasets is given by Table 8.1. For drawdown data, transformation method 2 (scaling themeasurement error variance) scored slightly better than transformation method 1 inthat it produced smaller RMSEs with regard to the true drawdown field. Due to timeconstraints, a statistically representative amount of test cases could not be set up, butthe results of these 10 different fields suggest that Gaussian anamorphosis of drawdowndata is a promising and successful method to increase the accuracy of updating.

Field RMSE Log-Conductivity RMSE DrawdownUntransf. Transf. 2 Reduction Untransf. Transf. 2 Reduction

1 0.8123 0.7478 7.9 % 0.0103 0.0078 24.4 %

2 0.8747 0.8772 -0.3 % 0.0162 0.0098 39.3 %

3 0.7815 0.8010 -2.5 % 0.0093 0.0078 15.8 %

4 0.8437 0.8145 3.5 % 0.0132 0.0091 31.0 %

5 0.7130 0.7085 0.6 % 0.0093 0.0070 24.7 %

6 0.7212 0.6887 4.5 % 0.0333 0.0388 -16.6 %

7 0.6907 0.6799 1.6 % 0.0150 0.0100 33.3 %

8 0.6249 0.6320 -1.1 % 0.1539 0.0375 75.6 %

9 0.7715 0.7481 3.0 % 0.0119 0.0113 5.6 %

10 0.9282 0.9307 -0.3 % 0.0202 0.0235 -16.4 %

Average 0.7762 0.7628 1.7 % 0.0293 0.0163 21.7 %

Table 8.1: RMSE of updated fields with regard to synthetic truth. Comparison betweenassimilation of untransformed data and updating with transformed data and scaled mea-surement error variance. Note that negative percentage of reduction means an increasein RMSE in the transformed run compared with untransformed one

Evaluation of Residuals Another criterion to judge the performance of the filter arethe residuals at the measurement locations, although it has to be kept in mind thatdeviations from the observed value are justified by measurement error and by adherenceto the prior (geo-)statistics. The mean of the residuals at one location over all realizations

75

does not necessarily have to be zero as the filter has to decide on the size of the weightassigned to each observation, based on a compromise between prior statistics and dataerror. Nevertheless, the average of the mean residuals over a certain high number of filterapplications to different true fields should boil down to zero. Because of the limited timeframe for this study, these effects cannot be proved here, but are assumed to be valid.

The first and second moment of the residuals are normalized by division through thecorresponding measurement error standard deviation. Consequently, the standard devi-ation of the residuals should lie around 1; values above 1 signify excess-uncertainty inthe reproduction of the data set, values below 1 indicate that the ensemble of simulatedvalues at this location is relatively narrow, i.e. narrower than measurement error vari-ance would suggest. Besides the mentioned statistics, skewness and excess-kurtosis ofthe residuals have been determined and are shown in Figure 8.19.

2 4 6 8 10 12 14 16

−1

0

1

Mean

2 4 6 8 10 12 14 16

0.5

1

1.5

2

Standard Deviation

2 4 6 8 10 12 14 16

0

1

2

3

4

Kurtosis

Measurement number

2 4 6 8 10 12 14 16

−1

−0.5

0

0.5

1

Skewness

Measurement number

untr. data transf. data transf. 2 data

Figure 8.19: Statistics of drawdown residuals from different assimilation methods

Overall, residuals are reduced by applying transformation techniques compared to theresult for anamorphosis of untransformed data. Especially the deviation from the obser-vation at the pumping well and the standard deviation at this location are considerablydecreased. This is a valuable finding since the traditional EnKF does not provide reason-able results for drawdown measurements directly at the well and requires the specifica-tion of an unrealistically high measurement error at this measurement location in order

76

to achieve acceptable accuracy. With the application of transformation, all availabledrawdown data can be treated equally.

The standard deviation of residuals for transformed data lies around 1 which correspondsto the measurement error standard deviation; assimilating untransformed data leads to ahigher standard deviation of about 1.3. Absolute skewness is reduced by applying trans-formation; both transformation methods result in a slightly negatively skewed updateddata set, while the traditional filter leads to a higher positive skewness. With regard toexcess-kurtosis, huge differences are noticeable for the well location: The transformeddata sets show a much higher excess-kurtosis than the untransformed one. If the residu-als were Gaussian distributed as implicitly assumed when multi-Gaussian variables areassimilated by the EnKF, a skewness and excess-kurtosis of 0 would be expected. Withregard to skewness, this is at least approximated by the transformed data assimilationmethods.

Summary It has been demonstrated that the assimilation of transformed drawdowndata improves the performance of the filter in terms of smaller deviations from the truefield in comparison to the traditional filter with untransformed state variables.

8.3.2 Hydraulic Head

The same synthetic truth as presented above shall now be estimated with the help ofhead observations.

Best Estimate and Prediction Variance Figure 8.20 shows the best estimate of log-conductivity and hydraulic heads after updating with the different assimilation methods.Since transformation method 1 and 2 show quite similar results, the results will beexemplarily discussed for method 1 only. Although the curvature in the head isolinesseems to be reconstructed better without transformation, the overall RMSE leads to adifferent conclusion: With untransformed data, the RMSE of the estimated head fieldsums up to 0.0475, which is reduced by 24% when transformation is applied. Theprediction variance of the ensembles is presented in Figure 8.21. While the variance ofthe log-conductivity ensembles is very similar, a slight decrease in variance of the headensemble is visible for transformed data.

Evaluation of Residuals The statistics of the residuals (Figure 8.22) convey a clearermessage: The mean residuals resulting from the assimilation of transformed data aremuch closer to 0 while the filter applied to untransformed data seems to consequentlyunderestimate the true head field in this specific case.

77

Figure 8.20: Synthetic log-conductivity and head field and best estimates resulting fromdifferent transformation methods in the EnKF

78

Figure 8.21: A priori ensemble variance of log-conductivity and head field and conditionalvariances resulting from different transformation methods in the EnKF

79

2 4 6 8 10 12 14 16−4

−2

0

2

Mean

2 4 6 8 10 12 14 16

0

1

2

3

Standard Deviation

2 4 6 8 10 12 14 16−1

0

1

2

3

Kurtosis

Measurement number2 4 6 8 10 12 14 16

−1

−0.5

0

0.5

1

Skewness

Measurement number

untr. data transf. data

Figure 8.22: Statistics of head residuals from different assimilation methods

Prediction Error Analysis of the filter performance for the other 9 synthetic data setsindicates that only minor improvement of updating based on head data can be achievedby Gaussian anamorphosis for both the estimated parameter field and the estimated statefield while a degradation of performance with regard to the head field is observed for asignificant fraction of the test cases. This compromising effect has also been observedby Béal et al. [2010]. A possible explanation for the increase in RMSE could be thatthe efficiency of the filter might suffer from the empirical transformation of an alreadyquite Gaussian distributed variable which introduces unnecessary numerical inaccuracy,especially in the case of small ensemble sizes.

Summary Gaussian anamorphosis of head data is not generally recommended since itrequires an (computationally not very demanding, but still additional) extra step in thefiltering procedure which in turn does not promise stable and improved results. Yet,transformation might improve the assimilation of highly boundary-influenced measure-ment data under the assumption that Gaussian anamorphosis does not compromise the(at least partially) existing multi-Gaussianity within the head field; this could be furtherinvestigated in future research.

80

8.3.3 Solute Concentration

Tracer tests are an alternative to hydraulic pumping tests in order to obtain measure-ments that allow inference of the underlying hydraulic conductivity field. As discussedabove, the dependence structure of concentration data on log-conductivity as well as thebivariate dependence among the data set are of complex shape and cannot be directlyexploited by a linear updating process. Nevertheless, a mitigating effect of univariatetransformation of concentration data can be demonstrated for the synthetic truth de-scribed above.

Handling of Clustered Data Physical boundaries of concentration values and dataclustering effects play an important role for this data type. The methods used to handlethese issues have been introduced in Section 6.3.1. A large fraction of the ensemble thattakes on a value of zero can, e.g., be found at the most south-western measurement loca-tion in the domain. This observation point can hardly be reached by the concentrationplume, therefore only a few realizations of parameter fields yield a concentration greaterthan zero. A histogram of the transformed, clustered data at this location is plotted inFigure 8.23.

−1 0 1 2 3 40

100

200

300

400

500

600

700

800

900

1000

Transformed concentration

Fre

quen

cy

a priori

a posteriori

observation

Figure 8.23: Prior and conditioned ensemble in Gaussian space with data clustering atthe lower bound

All values smaller than 10−6 are assigned a value of 0 which corresponds to a certainminimum transformed value (depending on ensemble size and number of clustered val-ues, see definition of tails in Section 6.3). The conditioned ensemble (orange) exhibits a

81

narrowed range and its mean has moved closer to the observed value. Note that with-out clustering, the transformed ensemble would cover the value range from -3.2905 to+3.2905 and show typical Gaussian symmetry as plotted in Figure 8.3. Handling clus-tered data is a trade-off between ensuring univariate Gaussianity at this measurementlocation and ensuring comparability of simulated measurements and synthetic observa-tions as discussed in Section 6.5.2.

Best Estimate and Prediction Variance The best estimate of the log-conductivityfield and the concentration field is shown in Figure 8.24. Despite the difficulties exposedabove, the assimilation of transformed data performs obviously better in reconstructingthe absence of the concentration plume at the affected observation locations in theeastern part of the domain. Also, the prediction variance of the concentration ensemble,plotted in Figure 8.25, can be significantly reduced compared to the traditional filter withuntransformed data. These are unexpectedly positive results; for different synthetic datasets, however, degradation of the prediction accuracy can also occur.

Evaluation of Residuals The statistics of the residuals support the positive effect ofGaussian anamorphosis (Figure 8.26): Residuals resulting from the assimilation of trans-formed concentration values show a mean around 0 with smaller amplitudes and a slightlysmaller standard deviation.

Prediction Error Based on the analysis of reconstructing different synthetic data sets, itcan be stated that concentration updating with transformed data can yield significantlyimproved results (RMSE of estimated concentration field reduced by 20 - 40%), butdepending on the field to be reconstructed, results can also be deteriorated by up to60%. Applying transformation method 2 overall yields slightly better results, here theRMSE can be reduced by up to 25%, while degradation is limited to about 35%. Withregard to the estimated parameter field, neither of the methods is able to significantlyreduce RMSE; in contrary, for most synthetic data sets, the estimated field scores slightlyworse than the field conditioned on untransformed data.

Summary There might be various causes for the unstable success of transformation;e.g., the way of handling clustering effects or the definition of the tails of the anamor-phosis function could have a major influence on the updating step, both in a desiredor objectionable manner. Additionally, the degree of remaining non-multi-Gaussianitymight vary from field to field and might have an non-neglectable impact on the linear fil-tering procedure. These issues should be addressed and investigated in further researchin order to exploit the dependence of concentration on log-conductivity more efficientlyand thus turn tracer tests into a reliable source of data to estimate hydraulic conditionswith the EnKF.

82

Figure 8.24: Synthetic log-conductivity and concentration field and best estimates re-sulting from different transformation methods in the EnKF

83

Figure 8.25: A priori ensemble variance of log-conductivity and concentration field andconditional variances resulting from different transformation methods in the EnKF

84

2 4 6 8 10 12 14 16

−2

−1

0

1

2

3Mean

2 4 6 8 10 12 14 16

0

1

2

3

Standard Deviation

2 4 6 8 10 12 14 16

0

10

20

Skewness

Measurement number2 4 6 8 10 12 14 16

0

200

400

600

800Kurtosis

Measurement number

untr. data transf. data

Figure 8.26: Statistics of concentration residuals from different assimilation methods

8.3.4 Suitability of State Variable Types for Gaussian Anamorphosis

All three data types were used to reconstruct the same synthetic log-conductivity fieldand its flow and transport variables. In summary, the best estimate of log-conductivitybased on head observations matched the true parameter field best, which is attributedto the almost linear dependence between state and parameter. This naturally linearrelationship should not be altered by transformation because the inherent Gaussianstructures would probably be disturbed rather than improved towards multi-Gaussianity.

Concentration observations provided the smallest information gain with regard to theparameter field. This is traced back to its complex, non-local dependence on log-conductivity which cannot be satisfyingly exploited by a linear filter. The degree ofmitigation achieved by Gaussian anamorphosis of concentration data depends on theindividual true field; a general recommendation in favor of or against transformationcannot be given at this point; further investigation, especially considering multi-variatetransformations, should follow up to allow well-founded statements.

The assimilation of drawdown observation yields relatively confident estimates that areclose to reality. This state variable shows a strong non-linear, but local dependence on

85

its primary parameter that can be even more efficiently exploited after transformation.To verify the positive effect of Gaussian anamorphosis, the performance of the EnKFwith untransformed drawdown data as well as with transformed drawdown data shall becompared with the performance of the particle filter which is considered to be a referencesolution for Monte Carlo data assimilation.

8.4 Comparison with Particle Filter as Reference Solution

The accuracy of conductivity estimation by assimilating anamorphosed drawdown datawith the EnKF shall be verified by a comparison with the solution of the particle filter,applied to the same synthetic data set as presented in Section 8.1.

The particle filter requires a large ensemble in order to find reasonably reliable weightedstatistics. The number of parameter fields is thus chosen to be N = 100, 000. Toimprove convergence of the filter, the number of included drawdown measurements isreduced and the measurement error standard deviation is now composed of the previouslydefined absolute part of 0.01m and an additional relative part of 10% which results ina measurement-specific standard deviation. The EnKF runs to be compared with theparticle filter will perform the updating step on a sub-ensemble of NEnKF = 1, 000realizations to reduce computational time and imitate realistic applications. 9 drawdownmeasurement locations are installed around the well within a radius of approximately20m, which is equivalent to the correlation length of the log-conductivity field.

Best Estimate and Prediction Variance Figure 8.27 displays the synthetic truth (samefield as presented in Section 8.1) and the best estimates of log-conductivity and draw-down. It is clearly visible that both transformation methods result in estimated fieldsvery similar to the result obtained by particle filtering. The synthetic drawdown mea-surement at the pumping well has been reconstructed successfully by applying the EnKFto transformed data while conditioning on untransformed data yields an overestimateddrawdown and, correspondingly, underestimated log-conductivity values in the vicinityof the well. The prediction variance (Figure 8.28) of the estimated ensembles under-lines the positive impact of transformation on the conditioning step: Filtering withanamorphosed variables reduces the prediction variance significantly in comparison totraditional updating. The variance of the estimated log-conductivity fields resultingfrom both transformation methods share similar features with the weighted variance de-termined by the particle filter. With regard to the prediction variance of the drawdownfield, differences between the two transformation techniques are visible: Both transfor-mation methods yield much lower variances than the anamorphosis of untransformeddata, but here transformation method 2 achieves an even higher reduction in variance,which resembles the reference solution best.

86

Figure 8.27: Synthetic log-conductivity and head field (upper row) and best estimatesresulting from different transformation methods in the EnKF and the particle filter

87

Figure 8.28: A priori ensemble variance of log-conductivity and drawdown field (upperrow) and conditional variances resulting from different transformation methods in theEnKF and the particle filter

88

Evaluation of Residuals Analyzing the first and second moment of the residuals plottedin Figure 8.29, it can be stated that EnKF assimilation of transformed data yieldsresiduals that follow the trends of the particle filter results. As emphasized before, theassimilation of measurements directly at the well are tremendously improved by Gaussiananamorphosis; the prediction confidence has also been increased.

2 4 6 8−2

−1

0

1

2Mean

Measurement number

2 4 6 8

0

1

2

3

Standard Deviation

Measurement number

untr. data transf. data PF

Figure 8.29: Statistics of drawdown residuals resulting from different transformationmethods in the EnKF and the particle filter (PF)

Prediction Error and Deviation From Reference Solution Table 8.2 lists the RMSE ofthe best estimates resulting from the different assimilation methods with regard to both,the synthetic truth and the reference solution. Results clearly demonstrate that Gaussiananamorphosis of drawdown data significantly improves the prediction accuracy of boththe estimated parameter field and the estimated drawdown field, while at the same timebeing consistent with the reference solution for stochastic parameter estimation.

EnKF Assimilated Data RMSE Log-Conductivity RMSE DrawdownSynth. Truth PF Synth. Truth PF

Untransformed 0.9326 0.1785 0.0121 0.0051

Transformed 0.9014 0.0995 0.0095 0.0022

Transformed 2 0.9149 0.1082 0.0093 0.0019

Table 8.2: RMSE of fields obtained from the three different EnKF assimilation methodswith regard to synthetic truth and particle filter (PF)

89

Summary The findings from the comparison with the reference solution confirm thatEnKF assimilation of transformed data is a computationally efficient and reasonablyaccurate alternative to particle filtering since it requires only a fraction of the ensemblesize (in this test case, an EnKF ensemble 100 times smaller than the PF ensemble wasused). Gaussian anamorphosis proved useful for inverse conductivity estimation basedon drawdown measurements, which could be applied, e.g., to hydraulic tomography.

In future research, a similar comparison of particle filtering with EnKF assimilation oftransformed concentration data might help to clarify the ambiguous results from Section8.3.3. Moreover, influencing factors on the degree of multi-Gaussianity of transformedconcentration data should be analyzed in detail; e.g., the handling of data clustering andthe definition of the tails towards physical bounds. The proposed methods are meant tobe a first approach that could be further developed toward a successful anamorphosis- given that, in the individual case, a pseudo-linearization of the complex dependencestructure can be expected at all.

With regard to updating based on head measurements, it might be worthwhile to inves-tigate transformations that only affect boundary-influenced measurements, but preservethe existing (almost) linear dependence structure in the middle of the domain. Thisapproach should also be verified with the help of the particle filter.

90

9 Summary, Conclusion and Outlook

Summary The procedure of subsurface parameter estimation by EnKFs applied totransformed data has been discussed in this study. It has been pointed out that ground-water flow and transport variables most often violate the assumption of multi-Gaussianityand therefore, optimal behavior of the linear update by an EnKF cannot be expected.To mitigate the effects of non-Gaussian distributions on the performance of filtering,univariate transformation has been suggested to render arbitrarily distributed statevariables Gaussian. It is implicitly assumed that Gaussian anamorphosis results in apseudo-linearization of dependence that can be more efficiently exploited by the EnKFupdating step.

Different parametric and non-parametric methods have been presented to constructan appropriate anamorphosis function. Moreover, a possibility to implement physicalbounds of the state variable values has been introduced and data clustering at thesebounds has been addressed. The practical implementation of Gaussian anamorphosis inthe EnKF analysis scheme has been demonstrated, including an extensive discussion ofthe transformation of measurement error which is crucial for both theoretical coherenceand practical success of the update.

The impact of anamorphosis on different variable types (drawdown, head and concen-tration) has been analyzed in detail with regard to changes in the dependence on theparameter field and changes in the influence function that controls updating. The per-formance of EnKFs applied to these transformed flow and transport state variables hasbeen assessed with numerical test cases. Finally, the substantial improvement in qualityof the prognosis achieved by Gaussian anamorphosis of drawdown data has been verifiedby a comparison with the particle filter solution which is considered to be the referencesolution for stochastic parameter estimation.

In Conclusion, the following effects have been detected:

n The implicit assumption that non-linear dependence can be pseudo-linearized byGaussian anamorphosis is valid.

n Gaussian anamorphosis of state variables is able to improve the performance ofEnKFs for parameter estimation; the degree of improvement depends on the typeof variable, the spatial configuration of observations and the true field to be recon-structed.

91

n Head data naturally show a relatively linear dependence on conductivity; not muchimprovement can be achieved by Gaussian anamorphosis.

n The dependence structure of concentration is more complex and of non-local na-ture, impeding a successful interpretation by the updating procedure even aftertransformation.

n The transformation is most effective for variables that show a strongly non-linear,but mostly local dependence on the parameters.

n Drawdown data show this type of dependence and are more accurately assimi-lated by EnKFs when transformation is applied than without transformation: Theprediction error can be reduced by more than 20%.

n This success suggests to estimate conductivity with EnKFs using transformeddrawdown data in hydraulic tomography studies.

n Conductivity estimation by EnKFs with transformed drawdown data is an attrac-tive alternative to particle filtering because it is computationally less demandingand similarly accurate.

Further Work Steps that could not yet be carried out, but are of interest for theassessment of the proposed approach:

n Studying the effects of applying transformation only to head measurements thatare strongly influenced by boundary conditions and therefore show strong non-Gaussian behavior while preserving the existing (almost) linear dependence struc-ture farther away from the boundaries.

n Evaluating the reaction of the proposed method on model bias: A promising ap-proach would be joint inference of uncertain boundary conditions together withconductivity.

n Investigating the possible further improvement of the EnKF for parameter estima-tion by transformation and assimilation of combined data types.

Outlook toward possible future research:

n Transfer of the proposed methodology back to state estimation; this includes in-verse Gaussian anamorphosis and ensuring spatially continuous back-transformedstate variables.

n Investigation of multi-variate transformations that guarantee multi-Gaussian de-pendence structures to fully exploit the filter’s potential.

92

Bibliography

M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas,graphs, and mathematical tables. Dover, 1964.

A. Bárdossy and J. Li. Geostatistical interpolation using copulas. Water ResourcesResearch, 44(7), 2008.

D. Béal, P. Brasseur, J.-M. Brankart, Y. Ourmières, and J. Verron. Characterizationof mixing errors in a coupled physical biogeochemical model of the North Atlantic:implications for nonlinear estimation using Gaussian anamorphosis. Ocean Science,6, 2010.

J. Bear. Dynamics of fluids in porous media. American Elsevier, New York, 1972.A. Bellin and D. Tonina. Probability density function of non-reactive solute concentra-tion in heterogeneous porous formations. Journal of Contaminant Hydrology, 94(1-2),2007.

J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985.L. Bertino, G. Evensen, and H. Wackernagel. Sequential data assimilation techniques inoceanography. International Statistical Review, 71(2), 2003.

A. W. Bowman and A. Azzalini. Applied smoothing techniques for data analysis: thekernel approach with S-Plus illustrations. Oxford University Press, USA, 1997.

G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the RoyalStatistical Society. Series B (Methodological), 26(2), 1964.

G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalmanfilter. Monthly Weather Review, 126(6), 1998.

J. Carrera, A. Alcolea, A. Medina, J. Hidalgo, and L. J. Slooten. Inverse problem inhydrogeology. Hydrogeology Journal, 13(1), 2005.

J. P. Chilès and P. Delfiner. Geostatistics: modeling spatial uncertainty. Wiley-Interscience, 1999.

M. P. Clark, D. E. Rupp, R. A. Woods, X. Zheng, R. P. Ibbitt, A. G. Slater, J. Schmidt,and M. J. Uddstrom. Hydrological data assimilation with the ensemble Kalman filter:Use of streamflow observations to update states in a distributed hydrological model.Advances in Water Resources, 31(10), 2008.

W. J. Conover and R. L. Iman. Rank transformations as a bridge between parametricand nonparametric statistics. The American Statistician, 35(3), 1981.

93

T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, 2006.H. Darcy. Les fontaines publiques de la ville de Dĳon. Victor Dalmont, 1856.T. A. Davis. Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontalmethod. ACM Transactions on Mathematical Software (TOMS), 30(2), 2004.

F. P. J. de Barros and W. Nowak. On the link between contaminant source releaseconditions and plume prediction uncertainty. J. Cont. Hydrology, 2010. (submitted).

C. R. Dietrich and G. N. Newsam. A fast and exact method for multidimensionalGaussian stochastic simulations. Water Resour. Res., 29(8), 1993.

G. Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic model usingMonte Carlo methods to forecast error statistics. Journal of Geophysical Research, 99,1994.

G. Evensen. The ensemble Kalman filter: Theoretical formulation and practical imple-mentation. Ocean dynamics, 53(4), 2003.

G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer Verlag, 2007.F. N. Fritsch and R. E. Carlson. Monotone piecewise cubic interpolation. SIAM Journalon Numerical Analysis, 17(2), 1980.

K. F. Gauss. Theoria motus corporum celestium. English translation: Theory of theMotion of the Heavenly Bodies, 1963.

L. W. Gelhar. Stochastic subsurface hydrology. Prentice-Hall, Englewood Cliffs, NJ,1993.

J. J. Gómez-Hernández and X. H. Wen. To be or not to be multi-Gaussian? A reflectionon stochastic hydrogeology. Advances in Water Resources, 21(1), 1998.

N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE proceedings. Part F. Radar and signal pro-cessing, 140(2), 1993.

H. J. Hendricks Franssen, A. Alcolea, M. Riva, M. Bakr, N. van der Wiel, F. Stauffer,and A. Guadagnini. A comparison of seven methods for the inverse modelling ofgroundwater flow. Application to the characterisation of well catchments. Advancesin Water Resources, 32(6), 2009.

T. J. R. Hughes. The finite element method. Prentice-Hall, Englewood Cliffs, NJ, 1987.R. A. Johnson and D. W. Wichern. Applied multivariate statistical analysis. Prentice-Hall, Englewood Cliffs, NJ, 1988.

A. G. Journel. Nonparametric estimation of spatial distributions. Mathematical Geology,15(3), 1983.

R. E. Kalman. A new approach to linear filtering and prediction problems. Journal ofbasic Engineering, 82(1), 1960.

P. K. Kitanidis. Introduction to geostatistics: applications to hydrogeology. CambridgeUniversity Press, 1997.

94

R. Krzysztofowicz. Transformation and normalization of variates with specified distri-butions. Journal of Hydrology, 197(1-4), 1997.

S. P. Neuman. Theoretical derivation of Darcy’s law. Acta mechanica, 25(3), 1977.W. Nowak. Best unbiased ensemble linearization and the quasi-linear Kalman ensemblegenerator. Water Resour. Res, 45, 2009.

W. Nowak, R. L. Schwede, O. A. Cirpka, and I. Neuweiler. Probability density func-tions of hydraulic head and velocity in three-dimensional heterogeneous porous media.Water Resources Research, 44(8), 2008.

J. M. Ortiz, B. Oz, and C. V. Deutsch. A step by step guide to bi-gaussian disjunctivekriging, 2005.

J. Pearson, R. Goodall, M. Eastham, and C. MacLeod. Investigation of Kalman filterdivergence using robust stability techniques. In IEEE Conference on Decision andControl, volume 5, 1997.

E. P. Poeter and M. C. Hill. Inverse models: A necessary next step in ground-watermodeling. Ground Water, 35(2), 1997.

D. J. Poirier. Piecewise regression using cubic splines. Journal of the American StatisticalAssociation, 68(343), 1973.

J. Rivoirard. Introduction to disjunctive kriging and non-linear geostatistics. OxfordUniversity Press, USA, 1994.

C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, 2004.A. E. Scheidegger. General theory of dispersion in porous media. Journal of GeophysicalResearch, 66, 1961.

R. L. Schwede and O. A. Cirpka. Interpolation of Steady-State Concentration Data byInverse Modeling. Ground Water, 2010.

E. Simon and L. Bertino. Application of the Gaussian anamorphosis to assimilation ina 3-D coupled physical-ecosystem model of the North Atlantic with the EnKF: a twinexperiment. Ocean Science Discussions, 6(1), 2009.

A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publications del’Institut de Statistique de l’Université de Paris, 8, 1959.

H. W. Sorenson. Least-squares estimation: from Gauss to Kalman. IEEE spectrum, 7,1970.

B. L. van der Waerden. Mathematische Statistik. Springer, Heidelberg, 1965.H. Wackernagel. Multivariate geostatistics: an introduction with applications. SpringerVerlag, 2003.

Y. Zhang, G. F. Pinder, and G. S. Herrera. Least cost design of groundwater qualitymonitoring networks. Water Resources Research, 41(8), 2005.

95

parameter estimation by ensemble kalman filters with transformed data

Documents