quantile regression as a means of calibrating and verifying a mesoscale nwp ensemble

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble

Tom Hopson1

Josh Hacker1, Yubao Liu1, Gregory Roux1, Wanli Wu1, Jason Knievel1, Tom Warner1, Scott Swerdlin1,

John Pace2, Scott Halvorson2

2U.S. Army Test and Evaluation Command

OutlineI. Motivation: ensemble forecasting and post-

processingII. E-RTFDDA for Dugway Proving GroundsIII. Introduce Quantile Regression (QR; Kroenker

and Bassett, 1978)III. Post-processing procedureIV. Verification resultsV. Warning: dynamically finding ensemble

dispersion at risk ensemble mean utility VI. Conclusions

Goals of an EPS

• Predict the observed distribution of events and atmospheric states

• Predict uncertainty in the day’s prediction• Predict the extreme events that are possible on a

particular day• Provide a range of possible scenarios for a

particular forecast

1. Greater accuracy of ensemble mean forecast (half the error variance of single forecast)

2. Likelihood of extremes3. Non-Gaussian forecast PDF’s4. Ensemble spread as a representation of forecast

uncertainty=> All rely on forecasts being calibrated

Further … -- Argue calibration essential for tailoring to local application:

NWP provides spatially- and temporally-averaged gridded forecast output

-- Applying gridded forecasts to point locations requires location specific calibration to account for local spatial- and temporal-scales of variability ( => increasing ensemble dispersion)

More technically …

Dugway Proving Grounds, Utah e.g. T Thresholds

• Includes random and systematic differences between members.

• Not an actual chance of exceedance unless calibrated.

Challenges in probabilistic mesoscale prediction

• Model formulation• Bias (marginal and conditional)• Lack of variability caused by truncation and approximation• Non-universality of closure and forcing

• Initial conditions• Small-scales are damped in analysis systems, and the model must

develop them• Perturbation methods designed for medium-range systems may not be

appropriate• Lateral boundary conditions

• After short time periods the lateral boundary conditions can dominate• Representing uncertainty in lateral boundary conditions is critical

• Lower boundary conditions• Dominate boundary-layer response• Difficult to estimate uncertainty in lower boundary conditions

RTFDDA and Ensemble-RTFDDA

Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010yliu@ucar.edu

The Ensemble Execution Module

Perturbations

observations

Member 1

Perturbations

observations

Member 2

Perturbations

observations

Member 3

Perturbations

observations

Member N

36-48h

Input to decision support

Postprocessing

Archiving and verification

RTFDDA

Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010yliu@ucar.edu

Operated at US Army DPG

since Sep. 2007

Surface and X-sections – Mean, Spread, Exceedance Probability, Spaghetti, …

Likelihood for SPD > 10m/s

Mean T & Wind

T Mean and SD

Wind Speed

Wind Rose

Pin-point Surface and Profiles – Mean, Spread, Exceedance probability, spaghetti, Wind roses, Histograms …

Real-time Operational Products for DPG

Forecast “calibration” or “post-processing”Pr

calibration

Flow rate [m3/s]

Post-processing has corrected:• the “on average” bias• as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”)

“spread” or “dispersion”

“bias”obs

ForecastPDF

Flow rate [m3/s]

Our approach:• under-utilized “quantile regression” approach• probability distribution function “means what it says”• daily variation in the ensemble dispersion directly relate to changes in forecast skill => informative ensemble skill-spread relationship

Example of Quantile Regression (QR)

Our application

Fitting T quantiles using QR conditioned on:

1) Ranked forecast ens

2) ensemble mean

3) ensemble median

4) ensemble stdev

5) Persistence

Timeforecastsobserved

Regressor set: 1. reforecast ens2. ens mean3. ens stdev 4. persistence 5. LR quantile (not shown)

Temperature [K]

climatologicalPDF

Step I: Determineclimatological quantiles

Step 2: For each quan, use “forward step-wisecross-validation” to iteratively select best subsetSelection requirements: a) QR cost function minimum, b) Satisfy binomial distribution at 95% confidenceIf requirements not met, retain climatological “prior”

Step 3: segregate forecasts into differing ranges of ensemble dispersion and refit models (Step 2) uniquely for each range

forecasts

I. II. III. II. I.Pr

Temperature [K]

ForecastPDF

posterior

Final result: “sharper” posterior PDFrepresented by interpolated quans

Measures Used:1) Rank histogram (converted to scalar measure)2) Root Mean square error (RMSE)3) Brier score4) Rank Probability Score (RPS)5) Relative Operating Characteristic (ROC) curve6) New measure of ensemble skill-spread utility

=> Using these for automated calibration model selection by using weighted sum of skill scores of each

Utilizing Verification measures near-real-time …

Problems with Spread-Skill Correlation … ECMWF spread-skill

(black) correlation << 1

Even “perfect model” (blue) correlation << 1 and varies with forecast lead-time

ECMWFr = 0.33“Perfect”r = 0.68

ECMWFr =“Perfect”r = 0.56

10 day

National Security Applications Program Research Applications Laboratory

3-hr dewpoint time seriesBefore Calibration After Calibration

Station DPG S01

42-hr dewpoint time seriesBefore Calibration After Calibration

Station DPG S01

Blue is “raw” ensembleBlack is calibrated ensembleRed is the observed value

Notice: significant change in both “bias” and dispersion of final PDF

(also notice PDF asymmetries)

PDFs: raw vs. calibrated

3-hr dewpoint rank histogramsStation DPG S01

Station DPG S01

42-hr dewpoint rank histograms

Skill Scores

• Single value to summarize performance.• Reference forecast - best naive guess;

persistence, climatology• A perfect forecast implies that the object

can be perfectly observed• Positively oriented – Positive is good

SS =Aforc −Aref

Aperf −Aref

Skill Score VerificationRMSE Skill Score CRPS Skill Score

Reference Forecasts:Black -- raw ensembleBlue -- persistence

Computational Resource Questions:

How best to utilize a multi-model simulations (forecast), especially if under-dispersive?

a) Should more dynamical variability be searched for? Orb) Is it better to balance post-processing with multi-model

utilization to create a properly dispersive, informative ensemble?

3-hr dewpoint rank histogramsStation DPG S01

RMSE of ensemble members

3hr Lead-time 42hr Lead-time

Station DPG S01

Significant calibration regressors

3hr Lead-time 42hr Lead-time

Station DPG S01

Questions revisited:How best to utilize a multi-model simulations (forecast),

especially if under-dispersive?

a) Should more dynamical variability be searched for? Orb) Is it better to balance post-processing with multi-model

utilization to create a properly dispersive, informative ensemble?

Warning: adding more models can lead to decreasing utility of the ensemble mean (even if the ensemble is under-dispersive)

Summary Quantile regression provides a powerful framework for improving the whole (potentially non-gaussian) PDF of an ensemble forecast – different regressors for different quantiles and lead-times

This framework provides an umbrella to blend together multiple statistical correction approaches (logistic regression, etc., not shown) as well as multiple regressors

As well, “step-wise cross-validation” based calibration provides a method to ensure forecast skill no worse than climatological and persistence for a variety of cost functions

As shown here, significant improvements made to the forecast’s ability to represent its own potential forecast error (while improving sharpness):

– uniform rank histogram– significant spread-skill relationship (new skill-spread measure)

Care should be used before “throwing more models” at an “under-dispersive” forecast problem

Further questions: hopson@ucar.edu or yliu@ucar.edu

Dugway Proving Ground

other options …Assign dispersion bins, then:

2) Average the error values in each bin, then correlate

3) Calculate individual rank histograms for each bin, convert to a scalar measure

Example: French Broad RiverBefore Calibration => underdispersive

Black curve shows observations; colors are ensemble

Rank Histogram Comparisons

After quantile regression, rank histogram more uniform(although now slightly over-dispersive)

Raw full ensemble After calibration

Frequency Used forQuantile Fitting of Method I:

Best Model=76%Ensemble StDev=13%Ensemble Mean=0%Ranked Ensemble=6%

What Nash-Sutcliffe (RMSE) implies about Utility

Take home message:

For a “calibrated ensemble”, error variance of the ensemble mean is 1/2 the error variance of any ensemble member (on average), independent of the distribution being sampled

obsForecastPDF

Discharge

i=ensembleaverage

( fi −o)2iversus ( f −o)2

Simplifying

eq1 : fi2 −2of + o2

eq2 : f 2 −2of + o2

o : fj ⇒ j

eq1 : 2 f 2 − f 2( )

eq2 : f 2 − f 2

⇒ eq1=2 eq2

Sequentially-averaged models (ranked based on NS Score) and their resultant NS Score

=> Notice the degredation of NS with increasing # (with a peak at 2 models)

=> For an equitable multi-model, NS should rise monotonically

=> Maybe a smaller subset of models would have more utility? (A contradiction for an under-dispersive ensemble?)

What Nash-Sutcliffe (RMSE) implies about Utility (cont)

-- degredation with increased ensemble size

Initial Frequency Used forQuantile Fitting:

What Nash-Sutcliffe implies about Utility (cont)

Reduced Set Frequency Used for Quantile Fitting:

…using only top 1/3 of modelsTo rank and form ensemble mean …… earlier results …

=> Appears to be significant gains in the utility of the ensemble after “filtering” (except for drop in StDev) … however “proof is in the pudding” …=> Examine verification skill measures …

Skill Score Comparisonsbetween full- and “filtered” ensemble sets

Points:

-- quite similar results for a variety of skill scores-- both approaches give appreciable benefit over the original raw multi-model output-- however, only in the CRPSS is there improvement of the “filtered” ensemble set over the full set

=> post-processing method fairly robust=> More work (more filtering?)!

GREEN -- full calibrated multi-modelBLUE -- “filtered” calibrated multi-modelReference – uncalibrated set

quantile regression as a means of calibrating and verifying a mesoscale nwp ensemble

risk ensemble

ensemblertfdda liu

h fcsts36

dpg9 forecast calibration

msmean t wind t mean

gaussian forecast pdfsensemble

h fcstsinput

exceedance probability

Documents

quantile regression as a means of calibrating and verifying...

on bayesian semiparametric quantile...

quantile regression with censoring and endogeneity ·...

session 2 quantile, m-quantile & expectile regression for...

nwp-2016-297 page 2 of 7 · 2017. 2. 24. · nwp-2016-297...

quantile regression

calibrating homeowner equipment

exploring the use of asymmetric maximum likelihood, quantile...

confidence tubes for multiple quantile...

univariate graphs iii review create histogram from commands...

calibrating huff model

calibrating auxiliary differential equation to … ·...

nwp 4-02_jan08

(conditional) quantile regression

calibrating spectrum analyzer

finite mixtures of quantile and m-quantile regression...

quantile and quantile-function estimations under density

operational seasonal forecasting for bangladesh: application...

calibrating rsm

calibrating sleuth