model calibration and uncertainty analysis

Calibration, Validation, and Uncertainty In Environmental and

Hydrological Modeling

A Somewhat Bayesian Perspective

Outline1. Definitions

2. Calibration

3. Validation

4. Uncertainty Analysis

DefinitionsLet’s say we want to model the real system S (for example, a watershed), so that we can estimate Q, (for example, streamflow, or water quality).

We have a model M meant to simulate the behavior of system S, based on our knowledge of the processes involved (their causal structure, the mathematical relationships between variables, etc.).

The model M has a number of parameters :

ρ1, ρ2, ρ3, ..., ρn

Let Φ be the set of parameters {ρ1, ρ2, ρ3, ..., ρn}.

Definitions (continued)When we run the model we set the parameters to particular values:

ρ1 = p1, ρ2 = p2, ..., ρn = pn

Let P be the set of particular parameter values we use, {p1, p2, ..., pn}, such that when we run the model we set

Φ = P

The model also depends on initial and boundary conditions Θ, which we set to I.

Definitions (continued)Let QM be our estimate for Q based on model M. We can think of M as a function operating on parameters P and initial conditions I that returns an estimate for Q.

QM = M(P, I)

And let Qobs be our measurements of Q in the real system S.

CalibrationCalibration is not a rigorously defined term in the context of modelling, but usually what we mean by calibration is something like: finding a set of parameters P that will make the model M behave similarly to the real system S under a certain range of initial and boundary conditions.

(The specific range of conditions depends on the purpose of the model; it is important for flood models to behave closely to the real system under extreme flow conditions, but not so important for SWAT models.)

Calibration (continued)Since our model M is not a perfect representation of the real system S, there will be some error ε in our estimates QM

Qobs = QM + εM = M(P, I) + εM

Naively, we may think the goal of ‘calibration’ should be to choose the parmeter set P that minimizes εM, that is, the difference between our estimate of Q and the real Q. But there are complications.

Calibration (continued)The main problem with the approach of simply minimizing εM is that εM is dependent on I, the initial and boundary conditions, so there is no guarantee that the parameter set P that minimizes εM

for a given set of initial and boundary conditions for which we have observations Qobs will minimize εM under different conditions.

Calibration (continued)Practically speaking, taking SWAT as an example: rainfall data is part of the boundary conditions of a SWAT model. Let’s say we have rainfall data and stream flow observations for 1990-2000, and we select the parameter set that minimizes the difference between our observed and estimated streamflow for that period. There is no guarantee that that same parameter set will minimize the difference under different rainfall conditions, for example, for the years 2000-2010.

Calibration (continued)This approach (minimizing εM, or a function of εM, such as RMSE) often leads to overfitting.

We can (somewhat) deal with overfitting by limiting the range of values that the parameters ρi can take to ‘realistic’ values.

We can check for overfitting, and generally how good our model is at predicting variables of interest by testing/validation (we’ll get to that later).

Calibration (continued)Depending on our goals there are ‘performance indices’ that can be used to measure how good we think our model is other than RMSE, for example the Nash-Sutcliffe Efficiency index, which hydrologists like to use.

Often when people do ‘manual calibration’ they will also use visual inspection of graphs to determine if the model behaves similarly to the real system. In some sense, ‘how similar do the graphs look’ is a (very fuzzy) performance index.

Calibration (continued)In practice environmental modellers seem to do one of a few things for ‘calibration’:

1. Identify most sensitive parameters (by sensitivity analysis or by looking in the literature), define ‘reasonable’ ranges for parameters, select an objective function (something like NSE or RMSE, or a combination of indices, or something else, this is fairly model- and application-specific) and auto-calibrate with software (that implements optimization algorithms that minimizes the objective function).

2. Do manual calibration of most sensitive parameters based on a mix of formal performance indices (NSE, etc.) and visual inspection.

3. A mix of 1 and 2.

Calibration (continued)It looks to me like autocalibration ought to be strictly better than manual calibration.

For one thing, with manual calibration we only change one parameter at a time, so it’s easy to miss some areas of improvement. Let’s say we start with parameters ρ1 = α, ρ2 = β; it’s possible that both (α*, β) and (α, β*) are worse than (α, β), but that (α*, β*) is better than (α, β); we would never discover this by manual calibration since we only change one parameter at a time.

But in practice it’s not rare for people to get better results with manual calibration, or a mix of both (starting with auto-calibration then tweaking by hand).

Calibration (continued)So in short, ‘calibration’ usually means optimizing the set of parameters P, or a subset P*, on one or more objective functions that we hope capture the system behaviour we care about.

So when we ‘calibrate’ we need to choose:

• One or more objective functions

• The parameters we want to optimize (optimizing all parameters is not feasible for models with a large number of parameters)

• The optimization procedure

ValidationThe goal of validation is to assess whether the model behaves reasonably closely to the real system (where ‘reasonably closely’ depends on the model purpose).

If we have ‘calibrated’ the model we already know how well the model reproduces system behaviour under the initial and boundary conditions of calibration, so now we are interested in seeing whether the model can succesfully reproduce behaviour under other conditions.

To do that usually we only use a subset for calibration and test on the rest.

Cross-validationDivide data into n sets (say, 12), then for each set: use the 11 other sets for calibration, then use the selected set for testing.

This is like standard ‘validation’ but you get to repeat it 5 or 10 or however many times, so it’s a bit more robust.

Not really a ‘bayesian’ method but it’s fairly standard to do this and somewhat better than the alternative of simple validation.

Example: 12-Fold Cross ValidationHYDROLOGY

Year Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 11 Fold 121974

Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up Warm-up1975

1976

1977

1978

Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Testing1979

1980

1981

Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Testing Calibration1982

1983

1984

Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Testing Calibration Calibration1985

1986

1987

Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Testing Calibration Calibration Calibration1988

1989

1990

Calibration Calibration Calibration Calibration Calibration Calibration Calibration Testing Calibration Calibration Calibration Calibration1991

1992

1993

Calibration Calibration Calibration Calibration Calibration Calibration Testing Calibration Calibration Calibration Calibration Calibration1994

1995

1996

Calibration Calibration Calibration Calibration Calibration Testing Calibration Calibration Calibration Calibration Calibration Calibration1997

1998

1999

Calibration Calibration Calibration Calibration Testing Calibration Calibration Calibration Calibration Calibration Calibration Calibration2000

2001

2002

Calibration Calibration Calibration Testing Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration2003

2004

2005

Calibration Calibration Testing Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration2006

2007

2008

Calibration Testing Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration2009

2010

2011

Testing Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration Calibration2012

2013

Some PitfallsContamination: If you decide to use the ‘training and testing’ approach for validation, you are supposed to not look at your testing data when calibrating your model, then use it for testing, once, then never use the data from that testing set again.

If you get bad results and decide that you need to rework your model, you cannot re-use the same test data.

I’ve worked on projects with poor methodology where we would train, test, then upon seeing that we were bad at predicting the test set, try a different model/different calibration method, re-train, re-test, etc. The model ended up doing moderately well at predicting the test set, but, mysteriously, was terrible at predicting observations outside the test set.

Some Pitfalls (continued)In practice what this means is that you have to make sure you have a solid calibration methodology before you start validation, especially if you don’t have a lot of data, otherwise you will be tempted to re-use the same test data, and you will contaminate your results.

And then your model may do much worse at prediction than expected, because you will have indirectly calibrated it on your testing set.

Uncertainty AnalysisIn uncertainty analysis we try to answer the question of how certain we can be about our model predictions.

Let’s say our model predicts that streamflow will be 5.0 m3/s. How surprised should we be if the measured streamflow in reality ends up being 3.0 m3/s? How about 1.5 m3/s? 17 m3/s?

The ‘best’ way to do this is to have a model that gives us a fully specified posterior distribution instead of point estimates.

Unfortunately hydrological modes usually don’t.

Sources of Errors and UncertaintyParameter uncertainty

Commensurability errors

Measurement errors

Structural errors

Random errors

Sources of Errors and UncertaintySee Environmental Modeling: An Uncertain Future?, Beven 2009 p. 40-43 for a good discussion of sources of error.

A Side Note on Structure & ParametersFrom a math perspective, there isn’t really a sharp distinction between structure and parameters. We use that language for convenience. For instance, let’s say in my model I use the function F(x) to compute infiltration rate (IR). I could change the ‘structure’ of the model to use the function G(x) instead

But I could also let the infiltration rate be:

IR = αF(x) + (1-α)G(x)

And now the ‘structure’ of the model depends on parameter α.

When α = 1, IR = F(x).

When α = 0, IR = G(x).

Bayesian Parameter EstimationSee Data Analysis: A Bayesian Tutorial chapters 2-3 for a good explanation of Bayesian parameter estimation.

Bayesian Parameter EstimationCrucially, the likelihood function depends on what distribution we assume εM to have. Deriving the likelihood function is non-trivial unless we make some strong assumptions. For example, the task is relatively straightforward if we assume normally distributed, independent errors.

Qobs(x, t) = M(x, t, P, I) + εM(x, t)

εM(x, t) = N(μ, v)

Are my model errors reasonably represented by a NID?Probably not.

But, the problem is that if we do not assume NID errors, most of standard statistics fly out the window.

There are slightly weirder error models that have well known solutions (e.g. autocorrelated gaussian).

But for almost any moderately weird error model, there is no known solution.

The difficulty is in finding a form for the error model that is both a fair representation of reality, and simple enough to be analytically tractable, but reality is complicated.

OK, this looks hard, but I still want to address uncertaintyThe short answer is, your errors are very unlikely to be NID, but it looks like Bayesian methods can still work reasonably well by assuming NID even when the reality is far from NID (see: Naive Bayes), because of magic1, and if you do assume NID you can probably get ok results.

Using an autocorrelated error model or some other relatively standard error model may or may not be better, it depends on your model, I don’t really know at this point.

1By ‘magic’ I mean: complicated mathematical reasons I do not understand.

That still sounds too complicatedYou’ll be happy to hear that many environmental modellers don’t bother with the whole business of formal Bayesian analysis, i.e. picking a reasonable error model and deriving a likelihood function, because it’s hard, and instead pick an arbitrary likelihood function.

This is the method known as GLUE (Generalized Likelihood Uncertainty Estimation).

GLUE (Generalized Likelihood Uncertainty Estimation)

Or, as I like to call it: I Can’t Believe It’s Not Bayes!

Pros:• Easy to use, if you pick an arbitrary likelihood function, which is what most people do.

• Relatively well accepted in the field. There are like 500 studies using GLUE out there, your reviewers will probably be ok with it.

Cons:• It is statistically meaningless unless you pick a formally derived likelihood function, at

which point you are back to doing Bayesian analysis, so why are you even using GLUE?

PEST (Model Independent Parameter Estimation)

As far as I can tell, PEST is a collection of algorithms that can be used for a variety of things including sensitivity analysis, parameter estimation and uncertainty analysis.

Their documentation is focused a lot more on optimization algorithms than on statistics, so I’m not quite sure yet what it actually does.

What PEST claims to doInstead of minimizing the model error when calibrating, PEST tries to minimize error variance.

Additionally, PEST can be set up with a sort of ‘target error’ to avoid overfitting. For example, if, from experience, we know that the sort of model we are working with has 10% error, we can set up PEST to aim for ~10% error when calibrating – doing any better than that would probably be overfitting.

This sort of thing is not strictly Bayesian but it does address some of the pitfalls of standard calibration techniques.

I think PEST can also be set up to do Bayesian parameter estimation but I’m not 100% sure.

A Side Note on Numerical MethodsBesides the problem of ‘conducting statistical inference correctly’, there’s the problem that most of the time conducting the inference requires solving analytically untractable mathematics, so we need to use numerical methods.

Numerical methods and algorithms is a whole separate subject from statistical inference, that is driven by the practical consideration of finding an (approximate) solution to a numerical problem in a reasonable time given limited computational resources.

So for example, we can use Monte Carlo methods to do Bayesian computations, but Monte Carlo methods are not inherently Bayesian, they are just a class of algorithms that are useful for solving certain numerical problems, including the sort of problems that come up when doing Bayesian analysis.

GLUE AgainSo, what is GLUE?

It’s a bit confusing because GLUE is both:

• A not quite Bayesian statistical method

• An implementation of said method using a Monte Carlo algorithm

Most of the literature in environmental modeling does not make a sharp distinction between statistical methods and the algorithms used to solve them, which is very confusing.

Thoughts on GLUE In practice I still suspect GLUE is better than simple calibration.

The likelihood function used in GLUE is usually somewhat arbitrary, but then so is the objective function used in calibration.

I suspect GLUE is more robust since we end up selecting multiple parameter sets (and weighing them based on the likelihood measure) instead of just one.

One must simply be careful not to mistake the likelihood measure given by GLUE, which looks like a probability, with an actual probability (so for instance, there is no reason to think that 95% of observations should fall within the ‘95% interval’ produced by GLUE).

On the other hand, one shouldn’t trust 95% intervals produced from Bayesian analysis too much either, because they usually only account for parameter uncertainty.

For more informationThe problem with GLUE, and an example of a correctly derived likelihood function:Stedinger, J. R., Vogel, R. M., Lee, S. U. & Batchelder, R. Appraisal of the generalized likelihood uncertainty estimation (GLUE) method. Water Resour. Res. 44, n/a–n/a (2008).

For more info on parameter estimation and other Bayesian methods:Data Analysis: A Bayesian Tutorial 2nd Edition. Sivia, D.S. & Skilling, J. Oxford Science Publications. (2006).

model calibration and uncertainty analysis

Documents