quantile regression as a means of calibrating and verifying a mesoscale nwp ensemble
Post on 07-Feb-2016
78 Views
Preview:
DESCRIPTION
TRANSCRIPT
Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble
Tom Hopson1
Josh Hacker1, Yubao Liu1, Gregory Roux1, Wanli Wu1, Jason Knievel1, Tom Warner1, Scott Swerdlin1,
John Pace2, Scott Halvorson2
2U.S. Army Test and Evaluation Command
1
OutlineI. Motivation: ensemble forecasting and post-
processingII. E-RTFDDA for Dugway Proving GroundsIII. Introduce Quantile Regression (QR; Kroenker
and Bassett, 1978)III. Post-processing procedureIV. Verification resultsV. Warning: dynamically finding ensemble
dispersion at risk ensemble mean utility VI. Conclusions
Goals of an EPS
• Predict the observed distribution of events and atmospheric states
• Predict uncertainty in the day’s prediction• Predict the extreme events that are possible on a
particular day• Provide a range of possible scenarios for a
particular forecast
1. Greater accuracy of ensemble mean forecast (half the error variance of single forecast)
2. Likelihood of extremes3. Non-Gaussian forecast PDF’s4. Ensemble spread as a representation of forecast
uncertainty=> All rely on forecasts being calibrated
Further … -- Argue calibration essential for tailoring to local application:
NWP provides spatially- and temporally-averaged gridded forecast output
-- Applying gridded forecasts to point locations requires location specific calibration to account for local spatial- and temporal-scales of variability ( => increasing ensemble dispersion)
More technically …
Dugway Proving Grounds, Utah e.g. T Thresholds
• Includes random and systematic differences between members.
• Not an actual chance of exceedance unless calibrated.
Challenges in probabilistic mesoscale prediction
• Model formulation• Bias (marginal and conditional)• Lack of variability caused by truncation and approximation• Non-universality of closure and forcing
• Initial conditions• Small-scales are damped in analysis systems, and the model must
develop them• Perturbation methods designed for medium-range systems may not be
appropriate• Lateral boundary conditions
• After short time periods the lateral boundary conditions can dominate• Representing uncertainty in lateral boundary conditions is critical
• Lower boundary conditions• Dominate boundary-layer response• Difficult to estimate uncertainty in lower boundary conditions
RTFDDA and Ensemble-RTFDDA
Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010yliu@ucar.edu
The Ensemble Execution Module
Perturbations
observations
Member 1
Perturbations
observations
Member 2
Perturbations
observations
Member 3
Perturbations
observations
Member N
…
36-48h
fcsts
36-48h
fcsts
36-48h
fcsts
36-48h
fcsts
Input to decision support
tools
Postprocessing
Archiving and verification
RTFDDA
RTFDDA
RTFDDA
RTFDDA
Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010yliu@ucar.edu
Operated at US Army DPG
since Sep. 2007
D1
D2
D3
Surface and X-sections – Mean, Spread, Exceedance Probability, Spaghetti, …
Likelihood for SPD > 10m/s
Mean T & Wind
T Mean and SD
Wind Speed
T-2m
Wind Rose
Pin-point Surface and Profiles – Mean, Spread, Exceedance probability, spaghetti, Wind roses, Histograms …
Real-time Operational Products for DPG
Forecast “calibration” or “post-processing”Pr
obab
ility
calibration
Flow rate [m3/s]
Prob
abili
ty
Post-processing has corrected:• the “on average” bias• as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”)
“spread” or “dispersion”
“bias”obs
obs
ForecastPDF
ForecastPDF
Flow rate [m3/s]
Our approach:• under-utilized “quantile regression” approach• probability distribution function “means what it says”• daily variation in the ensemble dispersion directly relate to changes in forecast skill => informative ensemble skill-spread relationship
Example of Quantile Regression (QR)
Our application
Fitting T quantiles using QR conditioned on:
1) Ranked forecast ens
2) ensemble mean
3) ensemble median
4) ensemble stdev
5) Persistence
T [K
]
Timeforecastsobserved
Regressor set: 1. reforecast ens2. ens mean3. ens stdev 4. persistence 5. LR quantile (not shown)
Prob
abili
ty/°
K
Temperature [K]
climatologicalPDF
Step I: Determineclimatological quantiles
Step 2: For each quan, use “forward step-wisecross-validation” to iteratively select best subsetSelection requirements: a) QR cost function minimum, b) Satisfy binomial distribution at 95% confidenceIf requirements not met, retain climatological “prior”
1.
3.2.
4.
Step 3: segregate forecasts into differing ranges of ensemble dispersion and refit models (Step 2) uniquely for each range
Time
forecasts
T [K
]
I. II. III. II. I.Pr
obab
ility
/°K
Temperature [K]
ForecastPDF
prior
posterior
Final result: “sharper” posterior PDFrepresented by interpolated quans
Measures Used:1) Rank histogram (converted to scalar measure)2) Root Mean square error (RMSE)3) Brier score4) Rank Probability Score (RPS)5) Relative Operating Characteristic (ROC) curve6) New measure of ensemble skill-spread utility
=> Using these for automated calibration model selection by using weighted sum of skill scores of each
Utilizing Verification measures near-real-time …
Problems with Spread-Skill Correlation … ECMWF spread-skill
(black) correlation << 1
Even “perfect model” (blue) correlation << 1 and varies with forecast lead-time
ECMWFr = 0.33“Perfect”r = 0.68
ECMWFr =“Perfect”r = 0.56
ECMWFr = 0.39“Perfect”r = 0.53
ECMWFr = 0.36“Perfect”r = 0.49
1 day
7 day
4 day
10 day
National Security Applications Program Research Applications Laboratory
3-hr dewpoint time seriesBefore Calibration After Calibration
Station DPG S01
42-hr dewpoint time seriesBefore Calibration After Calibration
Station DPG S01
obs
Blue is “raw” ensembleBlack is calibrated ensembleRed is the observed value
Notice: significant change in both “bias” and dispersion of final PDF
(also notice PDF asymmetries)
PDFs: raw vs. calibrated
National Security Applications Program Research Applications Laboratory
3-hr dewpoint rank histogramsStation DPG S01
National Security Applications Program Research Applications Laboratory
Station DPG S01
42-hr dewpoint rank histograms
Skill Scores
• Single value to summarize performance.• Reference forecast - best naive guess;
persistence, climatology• A perfect forecast implies that the object
can be perfectly observed• Positively oriented – Positive is good
SS =Aforc −Aref
Aperf −Aref
National Security Applications Program Research Applications Laboratory
Skill Score VerificationRMSE Skill Score CRPS Skill Score
Reference Forecasts:Black -- raw ensembleBlue -- persistence
Computational Resource Questions:
How best to utilize a multi-model simulations (forecast), especially if under-dispersive?
a) Should more dynamical variability be searched for? Orb) Is it better to balance post-processing with multi-model
utilization to create a properly dispersive, informative ensemble?
National Security Applications Program Research Applications Laboratory
3-hr dewpoint rank histogramsStation DPG S01
National Security Applications Program Research Applications Laboratory
RMSE of ensemble members
3hr Lead-time 42hr Lead-time
Station DPG S01
National Security Applications Program Research Applications Laboratory
Significant calibration regressors
3hr Lead-time 42hr Lead-time
Station DPG S01
Questions revisited:How best to utilize a multi-model simulations (forecast),
especially if under-dispersive?
a) Should more dynamical variability be searched for? Orb) Is it better to balance post-processing with multi-model
utilization to create a properly dispersive, informative ensemble?
Warning: adding more models can lead to decreasing utility of the ensemble mean (even if the ensemble is under-dispersive)
Summary Quantile regression provides a powerful framework for improving the whole (potentially non-gaussian) PDF of an ensemble forecast – different regressors for different quantiles and lead-times
This framework provides an umbrella to blend together multiple statistical correction approaches (logistic regression, etc., not shown) as well as multiple regressors
As well, “step-wise cross-validation” based calibration provides a method to ensure forecast skill no worse than climatological and persistence for a variety of cost functions
As shown here, significant improvements made to the forecast’s ability to represent its own potential forecast error (while improving sharpness):
– uniform rank histogram– significant spread-skill relationship (new skill-spread measure)
Care should be used before “throwing more models” at an “under-dispersive” forecast problem
Further questions: hopson@ucar.edu or yliu@ucar.edu
Dugway Proving Ground
other options …Assign dispersion bins, then:
2) Average the error values in each bin, then correlate
3) Calculate individual rank histograms for each bin, convert to a scalar measure
Example: French Broad RiverBefore Calibration => underdispersive
Black curve shows observations; colors are ensemble
Rank Histogram Comparisons
After quantile regression, rank histogram more uniform(although now slightly over-dispersive)
Raw full ensemble After calibration
Frequency Used forQuantile Fitting of Method I:
Best Model=76%Ensemble StDev=13%Ensemble Mean=0%Ranked Ensemble=6%
What Nash-Sutcliffe (RMSE) implies about Utility
Note:
Take home message:
For a “calibrated ensemble”, error variance of the ensemble mean is 1/2 the error variance of any ensemble member (on average), independent of the distribution being sampled
Prob
abili
ty
obsForecastPDF
Discharge
i=ensembleaverage
( fi −o)2iversus ( f −o)2
i
Simplifying
eq1 : fi2 −2of + o2
eq2 : f 2 −2of + o2
o : fj ⇒ j
eq1 : 2 f 2 − f 2( )
eq2 : f 2 − f 2
⇒ eq1=2 eq2
Sequentially-averaged models (ranked based on NS Score) and their resultant NS Score
=> Notice the degredation of NS with increasing # (with a peak at 2 models)
=> For an equitable multi-model, NS should rise monotonically
=> Maybe a smaller subset of models would have more utility? (A contradiction for an under-dispersive ensemble?)
What Nash-Sutcliffe (RMSE) implies about Utility (cont)
-- degredation with increased ensemble size
Initial Frequency Used forQuantile Fitting:
Best Model=76%Ensemble StDev=13%Ensemble Mean=0%Ranked Ensemble=6%
What Nash-Sutcliffe implies about Utility (cont)
Reduced Set Frequency Used for Quantile Fitting:
Best Model=73%Ensemble StDev=3%Ensemble Mean=32%Ranked Ensemble=29%
…using only top 1/3 of modelsTo rank and form ensemble mean …… earlier results …
=> Appears to be significant gains in the utility of the ensemble after “filtering” (except for drop in StDev) … however “proof is in the pudding” …=> Examine verification skill measures …
Skill Score Comparisonsbetween full- and “filtered” ensemble sets
Points:
-- quite similar results for a variety of skill scores-- both approaches give appreciable benefit over the original raw multi-model output-- however, only in the CRPSS is there improvement of the “filtered” ensemble set over the full set
=> post-processing method fairly robust=> More work (more filtering?)!
GREEN -- full calibrated multi-modelBLUE -- “filtered” calibrated multi-modelReference – uncalibrated set
top related