postprocessing of ensemble weather forecasts using a
TRANSCRIPT
Postprocessing of Ensemble Weather Forecasts Using a Stochastic Weather Generator
JIE CHEN AND FRANCOIS P. BRISSETTE
Department of Construction Engineering, �Ecole de technologie sup�erieure, Universit�e du Qu�ebec, Montreal, Quebec, Canada
ZHI LI
College of Natural Resources and Environment, Northwest A & F University, Yangling, Shaanxi, China
(Manuscript received 31 May 2013, in final form 13 November 2013)
ABSTRACT
This study proposes a new statistical method for postprocessing ensemble weather forecasts using a sto-
chastic weather generator. Key parameters of the weather generator were linked to the ensemble forecast
means for both precipitation and temperature, allowing the generation of an infinite number of daily times
series that are fully coherent with the ensemble weather forecast. This method was verified through post-
processing reforecast datasets derived from the Global Forecast System (GFS) for forecast leads ranging be-
tween 1 and 7 days over two Canadian watersheds in the Province of Quebec. The calibration of the ensemble
weather forecasts was based on a cross-validation approach that leaves one year out for validation and uses the
remaining years for training the model. The proposed method was compared with a simple bias correction
method for ensemble precipitation and temperature forecasts using a set of deterministic and probabilistic
metrics. The results show underdispersion and biases for the raw GFS ensemble weather forecasts, which in-
dicated that they were poorly calibrated. The proposed method significantly increased the predictive power of
ensemble weather forecasts for forecast leads ranging between 1 and 7 days, andwas consistently better than the
bias correction method. The ability to generate discrete, autocorrelated daily time series leads to ensemble
weather forecasts’ straightforward use in forecasting models commonly used in the fields of hydrology or ag-
riculture. This study further indicates that the calibration of ensemble forecasts for a period up to one week is
reasonable for precipitation, and for temperature it could be reasonable for another week.
1. Introduction
Ensemble weather forecasts offer great potential
benefits for water resourcemanagement, as they provide
useful information for analyzing the uncertainty of pre-
dicted variables (Boucher et al. 2011). The advantages of
ensemble weather forecasts over deterministic forecasts
were observed in several studies, even at locations where
the spatial resolution of ensemble forecasts was much
lower (Bertotti et al. 2011; Boucher et al. 2011).However,
raw ensemble forecasts are generally biased and tend to
be underdispersed (Buizza 1997; Hamill andColucci 1997;
Eckel and Walters 1998; Toth et al. 2001; Pellerin et al.
2003), thus limiting the predictive power of probability
density functions (PDFs). For example, Hamill and
Colucci (1997) verified the Eta and regional spectral
model (Eta-RSM) for predicting short-range (24 h)
850-mb temperature, 500-mb geopotential height, and
24-h total precipitation amounts using rank histograms.
The nonuniform rank histograms indicated that the as-
sumption of identical errors for each member was not
achieved. Buizza et al. (2005) compared three ensemble
prediction systems [the European Centre for Medium-
Range Weather Forecasts (ECMWF), the Meteorologi-
cal Service of Canada (MSC), and the National Centers
for Environmental Prediction (NCEP)] in forecasting the
500-hPa geopotential height over the Northern Hemi-
sphere and found that none of them was able to capture
all sources of forecast uncertainty. In addition, both
spread-error correlations and underdispersion were de-
tected. Therefore, some form of postprocessing is re-
quired before ensemble forecasts can be incorporated into
the decision-making process so that the predictive distri-
butions are reliable and properly reflect the real-world
Corresponding author address: Jie Chen, Department of Con-
struction Engineering, �Ecole de technologie sup�erieure, Universit�edu Qu�ebec, 1100 rue Notre-Dame Ouest, Montreal QC H3C 1K3,
Canada.
E-mail: [email protected]
1106 MONTHLY WEATHER REV IEW VOLUME 142
DOI: 10.1175/MWR-D-13-00180.1
� 2014 American Meteorological SocietyUnauthenticated | Downloaded 05/04/22 11:44 AM UTC
uncertainty (Hamill and Colucci 1998; Richardson 2001;
Boucher et al. 2011; Cui et al. 2012).
During the last two decades, a number of post-
processing methods have been proposed and imple-
mented to address the bias and underdispersion of
ensemble weather forecasts. These include rank histo-
gram techniques (Hamill and Colucci 1998; Eckel and
Walters 1998; Wilks 2006), ensemble dressing (Roulston
and Smith 2003; Wang and Bishop 2005; Wilks and
Hamill 2007; Brocker and Smith 2008), Bayesian model
averaging (BMA; Raftery et al. 2005; Vrugt et al. 2006;
Wilson et al. 2007; Sloughter et al. 2007; Soltanzadeh et al.
2011), logistic regression (Hamill et al. 2006; Wilks and
Hamill, 2007; Hamill et al. 2008), analog techniques
(Hamill et al. 2006; Hamill and Whitaker 2007), and
nonhomogeneous Gaussian regression (NGR; Gneiting
et al. 2005;Wilks andHamill, 2007;Hagedorn et al. 2008).
Among these methods, the logistic regression method
was most often used to calibrate both precipitation and
temperature, and BMA and NGR were usually used to
calibrate the temperature (Raftery et al. 2005; Hagedorn
et al. 2008; Hamill et al. 2008). More recently, studies
have also extended the BMA for the postprocessing of
precipitation (Sloughter et al. 2007; Schmeits and Kok
2010).
Hamill et al. (2004) used a logistic regression method
to improve the medium-range precipitation and tem-
perature forecast skill using retrospective forecasts. The
ensemble mean and ensemble mean anomaly were used
as predictors for precipitation and temperature, respec-
tively. The results showed that the logistic regression-
based probability forecasts (using retrospective forecasts)
weremuchmore skillful and reliable than the operational
NCEP forecast. Raftery et al. (2005) proposed using the
BMA method to calibrate the ensemble forecasts of
temperature and found that the calibrated predictive
PDFs were much better than those of the raw forecast.
Wilks (2006) compared eight ensemble model out-
put statistics (MOS) methods for the statistical post-
processing of ensemble forecast using the idealized
Lorenz’96 setting. The eight methods were classified
into four categories: 1) early, ad hoc approaches (di-
rect model output, rank-histogram recalibration, and
multiple implementations of single-integration MOS
equations), 2) the ensemble dressing approach, 3) re-
gression methods (logistic regression and NGR), and
4) Bayesian methods (forecast assimilation and BMA).
This is probably the most thorough study to date in terms
of including the greatest number ofMOSmethods for the
postprocessing of ensemble forecasts. The three best
performing methods were found to be logistic regression,
NGR, and ensemble dressing. Wilks and Hamill (2007)
further compared these threemethods for postprocessing
daily temperature, and medium-range (6–10 and 8–
14 days) temperature and precipitation forecasts. The
results showed there was not a single best method for all
of the applications of daily and medium-range forecasts.
For example, the logistic regression method yielded the
best Brier score (BS) for central forecast quantiles,
while the NGR forecasts displayed slightly greater ac-
curacy for probability forecasts of the more extreme
daily precipitation quantiles. Hagedorn et al. (2008) and
Hamill et al. (2008) did a parallel study that used NGR
and logistic regression for postprocessing temperature
and precipitation, respectively, using the ECMWF and
Global Forecast System (GFS) ensemble reforecasts.
The skill and reliability of ECMWF andGFS ensemble
temperature and precipitation forecasts were largely
improved when using the NGR and logistic regression
methods, respectively. These studies also emphasized
the benefits of using ensemble retrospective forecasts
(reforecasts).
Other studies such as Wilson et al. (2007) and
Soltanzadeh et al. (2011) showed that BMA is also able
to improve the skill and reliability of ensemble forecasts.
However, most studies were only focused on the cali-
bration of temperature rather than precipitation using
BMA, because the original BMA developed by Raftery
et al. (2005) was only suitable for variables whose pre-
dictive PDFs are approximately normal. To use it for the
calibration of precipitation, Sloughter et al. (2007) ex-
tended BMA by modeling the predictive PDFs corre-
sponding to an ensemblemember as amixture of a discrete
event at zero and a gamma distribution. The extended
BMAyielded calibrated and sharp predictive distributions
for 48-h precipitation forecasts. It even outperformed the
logistic regression at estimating the probability of high
precipitation events, because it gives a full predictive PDF
rather than separate forecast probability equations for
different predictand thresholds. Similarly, Wilks (2009)
also extended the logistic regression to provide full PDF
forecasts. The main advantage of the extended logistic
regression is that the forecasted probabilities are mutually
consistent, thus, the cumulative probability for a small
predictand threshold cannot be larger than the probability
for a larger threshold (Wilks 2009). Based on the above-
mentioned studies, Schmeits and Kok (2010) compared
the raw ensemble output, modified BMA, and extended
logistic regression for postprocessing ECMWF ensemble
precipitation reforecasts. The results showed that, even
though the raw ensemble precipitation forecasts were
relatively well calibrated, their skill could be significantly
improved by the modified BMA and extended logistic
regression methods. However, the difference in skill be-
tweenmodifiedBMAand extended logistic regressionwas
not significant.
MARCH 2014 CHEN ET AL . 1107
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
Even though a number of methods have been pro-
posed for postprocessing the ensemble weather fore-
casts, most of them are aimed at finding the underlying
probabilistic distribution of forecasted variables. How-
ever, for some practical applications, such as ensemble
streamflow predictions, several sets of discrete, auto-
correlated time series over several days are needed for
driving the impact models (e.g., hydrological models).
However, there is no simple way to go from the un-
derlying distribution to the generation of a discrete,
autocorrelated time series that is fully consistent with
the underlying distribution. This study presents a new
method for postprocessing ensemble weather forecasts
using a stochastic weather generator. The ensemblemean
precipitation and temperature anomalies are used as
predictors for the calibration of precipitation and tem-
perature, respectively. A great number of ensemble
members can be produced using the stochastic weather
generator with a gamma distribution for generating pre-
cipitation amounts and a normal distribution for gener-
ating temperature.A simple bias correction (BC)method
is used as a benchmark to demonstrate the performance
of the proposed method [i.e., the generator-based post-
processing (GPP) method]. The GPP ensemble forecasts
were comparedwithBC and rawGFS ensemble forecasts
over two Canadian watersheds in Quebec, Canada, using
a set of deterministic and probabilistic metrics. The ulti-
mate goal of this study is to provide reliable ensemble
weather forecasts for ensemble streamflow forecasts.
Therefore, watershed-averaged precipitation and tem-
perature are used instead of traditional station meteo-
rological data.
2. Study area and dataset
a. Study area
The ultimate goal of this project is to provide and
evaluate ensemble streamflow forecasting. It is with this
goal in mind that we chose to focus on watershed-
averaged meteorological data instead than station data.
Accordingly, this study is conducted over two Canadian
catchments located in the Province of Quebec (Fig. 1).
Two different catchments (Peribonka andYamaska) were
selected to evaluate the impact of basin characteristics
on ensemble weather forecasts. Both the Peribonka and
Yamaska catchments are composed of several tributaries
draining basins of approximately 27 000 and 4843km2
in southeastern and southern Quebec, respectively. The
southern parts of the Peribonka and Yamaska catch-
ments, known as the Chute-du-Diable (CDD) and the
Yamaska (YAM) watersheds, respectively, are used in
this study. The two watersheds differ in size (9700 vs
3330km2) and location (the CDDwatershed is located in
central Quebec and the smaller YAM watershed is lo-
cated in southern Quebec). Additional details on both
watersheds are presented below.
1) CHUTE-DU-DIABLE (CDD)
The CDD watershed (48.58–50.28N, 70.58–71.58W) is
located in central Quebec. With a mostly forested sur-
face area of 9700 km2, it is a subbasin of the Peribonka
Riverwatershed. The basin is part of the northernQuebec
subarctic region, characterized by wide daily and annual
temperature ranges, heavy wintertime snowfall, and pro-
nounced rainfall and/or snowmelt peaks in the spring
(April–June; Coulibaly 2003). The average annual rainfall
in the area is 962mm, of which about 36% is snowfall. The
average annual maximum and minimum temperatures
(Tmax and Tmin) between 1979 and 2003 were 5.498 and25.858C, respectively. The CDD watershed contains a
large hydropower reservoir managed by Rio Tinto
Alcan for hydroelectric power generation. River flows
are regulated by two upstream reservoirs. Snow plays
a crucial role in the watershed management, with 35%
of the total yearly discharge occurring during the spring
flood. The mean annual discharge of the CDDwatershed
FIG. 1. Location map of the two catchments.
1108 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
is 211m3 s21 with daily maximum registered flood of
1666m3 s21. Snowmelt peak discharge usually occurs in
May and averages about 1200m3 s21.
2) YAMASKA (YAM)
The YAM watershed (45.18–46.18N, 72.28–73.18W) is
composed of a number of tributaries draining a basin of
approximately 4843 km2 in southern Quebec; the south-
ern part of the YAM basin, with an area of 3330km2, is
used in this study. The average annual rainfall in the area
is 1175mm, of which about 23% is snowfall. The average
annual Tmin and Tmax are above the freezing mark at
0.568 and 10.838C, respectively, between 1979 and 2003.
Themean annual discharge of theYAMRiver is 61m3 s21
with a daily maximum registered flood of 881m3 s21.
Snowmelt peak discharge usually occurs in April and
averages about 495m3 s21.
b. Dataset
The dataset consists of observed and ensemble-
forecasted daily total precipitation and mean tempera-
ture. The observed daily precipitation and temperature
over two watersheds were taken from the National Land
andWater Information Service (www.agr.gc.ca/nlwis-snite)
dataset covering the period of 1979–2003. This dataset
was created by interpolating station data to a 10-km grid
using a thin plate-smoothing spline surface fittingmethod
(Hutchinson et al. 2009). All grid points within a water-
shedwere averaged to represent the observed time series.
Ensemble forecasts (daily total precipitation and mean
temperature) with the global grid of 2.58 were taken fromthe GFS reforecast dataset (http://www.esrl.noaa.gov/
psd/forecasts/reforecast/; Hamill et al. 2006). Several
previous studies (e.g., Hamill et al. 2004, 2006;Hamill and
Whitaker 2006, 2007; Whitaker et al. 2006; Wilks and
Hamill 2007) have presented the benefit of calibrating
probabilistic forecasts using ensemble reforecast data-
sets. Forecasts for each day since 1979 were made with
GFS, composed of a 15-member run out to 15 days. Since
little skill is retained for precipitation after 1 week, only
1–7 lead days are used in this study over the 1979–2003
time frame. Two grid boxes were selected and averaged
for the CDD watershed and only one grid box was se-
lected for the YAM watershed.
3. Methodology
a. Stochastic weather generator
A stochastic weather generator is a computer model
that can produce climate time series of arbitrary length
and with statistical properties similar to those of the ob-
served data (Richardson 1981; Nicks and Gander 1994;
Semenov and Barrow 2002; Chen et al. 2010, 2012). The
generation of precipitation and temperatures are usually
the two main components of a weather generator. Pre-
cipitation is most often generated using a two-component
model: one for the precipitation occurrence and the other
for the wet-day precipitation amount.
The precipitation occurrence is usually generated using
a Markov chain with various orders based on transition
probabilities. Alternatively, the precipitation occurrence
can also be generated based on an unconditional pre-
cipitation probability if the precipitation model only
considers the wet- and dry-day probabilities rather than
the wet- and dry-spell structures. In this sense, if a random
number drawn from a uniform distribution for one day is
less than the unconditional precipitation probability, a wet
day is predicted. Since the weather generator is used in
this study to synthesize the wet and dry states of ensemble
members for a given day rather than to generate the
continuous time series of precipitation occurrence, only
the second method was used. For a predicted wet day,
stochastic weather generators usually produce the pre-
cipitation amount by using a parametric probability dis-
tribution (e.g., exponential and gammadistributions). The
two-parameter gamma distribution is the most widely
used method to simulate wet-day precipitation. Tem-
perature is usually generated using a two-parameter
(mean and standard deviation) normal distribution. In
this study, the gamma and normal distributions are used
to generate the ensemble members of precipitation and
temperature, respectively, for a given day. Similarly to
stochastic weather generators such as Weather Genera-
tor (WGEN;Richardson 1981) andWeather generator of
the �Ecole de Technologie Sup�erieure (WeaGETS; Chen
et al. 2012), the auto- and cross correlation of and be-
tween Tmax and Tmin are preserved using a first-order
linear autoregressivemodel. The detailedmethodology is
presented below.
b. Generator-based postprocessing (GPP) method
The GFS ensemble forecasts are postprocessed using
the GPP method. The observed daily precipitation and
temperature are used as predictands, and the forecasted
ensemble mean precipitation and temperature anoma-
lies are used as predictors, respectively. The evaluation
of the GPP method is based on a cross-validation ap-
proach (Wilks 2005) to ensure the independence of the
training and evaluation data. Given 25 years of available
forecasts, when making forecasts for a particular year,
the remaining 24 years were used as training data.
1) POSTPROCESSING FOR PRECIPITATION
The calibration of precipitation is based on four sea-
sons: winter [January–March (JFM)], spring [April–June
MARCH 2014 CHEN ET AL . 1109
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
(AMJ)], summer [July–September (JAS)], and autumn
[October–December (OND)]. The methodology for the
precipitation calibration is based on the hypothesis that
a relationship must exist between the mean of the en-
semble forecast and both the probability of precipitation
occurrence and wet-day precipitation amounts. The larger
the mean of the ensemble forecast, the more likely that
rainfall will occur, and the more likely that a large pre-
cipitation amount will be registered. For each season and
lead day, the ensemble precipitation is calibrated with the
following three steps.
1) The ensemble mean precipitation is first calculated
using the 15-member ensemble precipitation fore-
casts. The calculated ensemble mean precipitation
for each lead day in the given season is then classified
into several classes based on wet-day precipitation
amounts. Depending on the training samples, the
number of classes is different. A maximum of 10
classes with wet-day precipitation amounts between
0–1, 1–2, 2–3, 3–4, 4–5, 5–7, 7–10, 10–15, 15–25, and
$25mm are used in this study. If the training sample
in the largest class is less than 30 precipitation events,
the last two classes are combined, and so on. The
probabilities of the observed precipitation occurrence
and observed mean wet-day precipitation amount
corresponding to each class of forecasted precipitation
are then calculated. The observed wet-day precipita-
tion events in each class are fitted using a gamma
distribution. For example, for the first class, all of
the observed wet-day precipitation that correspond to
ensemble mean precipitation ranging between 0 and
1mm are pooled and fitted using a gamma distribution.
2) The second step involves establishing relationships
between the forecasted precipitation classes and the
probabilities of observed precipitation occurrence
and observed mean wet-day precipitation amounts.
Figure 2 presents the probabilities of the observed
precipitation occurrence and mean wet-day precipi-
tation amounts as functions of the forecasted precip-
itation classes for summer precipitation at 1 and
3 lead days over the two selected watersheds (solid
lines in the Fig. 2). The results clearly show the re-
lationship between the mean of the ensemble forecast
and the observed probability of precipitation occurrence
FIG. 2. The relationships between forecasted summer (JAS) precipitation classes and the probability of observed
summer precipitation occurrence and mean wet-day precipitation amounts for 1 and 3 lead days over the (a),(b)
CDD and (c),(d) YAM watersheds.
1110 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
(left-hand side), and between that same mean and
the observed mean precipitation amount (right-hand
side). For a large ensemble mean, the observed
precipitation occurrence is nearly 100% for the
larger basin. For a 7-day lead time (not shown), both
relationships are close to a horizontal line, indicating
that the ensemble precipitation forecast has little
relevance for that lead time. The variability observed
in the graphs is due to sampling times that are too
short. Accordingly, the lines were smoothed using a
second-order polynomial (dashed lines in Fig. 2).
3) In the third step, the relationships (smoothed func-
tions) between the probability of the observed precip-
itation occurrence and the forecasted precipitation
class are directly used to determine the probability
of precipitation occurrence for a given day. For any
given day in the evaluation period, a forecasted pre-
cipitation class is first determined according to the
ensemble mean precipitation for that day. For exam-
ple, if the ensemblemean precipitation is 0.5mm for a
given day, it is classified into the first class (between
0 and 1mm). The corresponding probability of ob-
served precipitation occurrence (e.g., 40% for the
YAM basin) is then used as the precipitation proba-
bility for this day. Then 1000 random numbers drawn
from a uniform distribution are generated to rep-
resent 1000 members for this day. If the random
numbers are less than or equal to the corresponding
probability of observed precipitation occurrence (e.g.,
40%), the correspondingmembers are predicted to be
wet, otherwise, they are predicted to be dry. Finally, if
amember is deemedwet, the fitted gamma function in
the corresponding class is used to generate the pre-
cipitation amounts with uniform random numbers.
Overall, 1000 members are generated for any given
day. A large number of members are used to obtain
the truest possible results of a weather generator. A
small number of samples could result in biases due to
the random nature of the stochastic process. The
proposed postprocessing approach does not directly
take into account the autocorrelation of precipitation
occurrence. During the period covered by the ensem-
bleweather forecast, the probability of precipitation is
directly given by the forecast for each lead day, and
thus preserves the coherence of the ensemble forecast.
As such the autocorrelation of precipitation occur-
rence is directly governed by the forecast. If the
forecast is wet for several days, all 1000 members will
carry this information stochastically and all sequences
will be dominated bywet days.As long as the forecasts
have skill, using the probability of precipitation oc-
currence given by the forecasts is highly preferable
to using the mean probabilities used to generate the
occurrence series in a pure stochastic mode. Similarly
to most stochastic weather generators, the proposed
method does not account for the possible autocorre-
lation of precipitation amounts.
2) POSTPROCESSING FOR TEMPERATURE
The postprocessing for temperature is performed on
a daily basis. The calibration of ensemble temperature
forecasts includes two stages. The first stage consists of
the BC of the ensemble mean temperature using a linear
regression method. The second stage adds the ensemble
spread using a weather generator–based method. For
each evaluation year and lead day, the ensemble tem-
perature forecast BC follows three specific steps:
1) Similarly to precipitation, the ensemble mean tem-
perature (24 yr3 365 days) is first calculated using the
15-member ensemble temperature forecasts (24 yr 3365 days 315 members). The mean observed daily
temperature (1 yr3 365 days) is also calculated using
the 24-yr daily time series (24 yr 3 365 days). The
temperature anomalies (24 yr 3 365 days) of both
observed and forecasted data are then determined by
subtracting themean observed daily temperature (1 yr
3 365 days) from the observed temperature (24 yr 3365 days) and from the ensemble mean temperature
(24 yr 3 365 days), respectively.
2) Linear equations are fitted between observed and
forecasted temperature anomalies using a 31-day win-
dow centered on the day of interest. For example, when
fitting the linear equation for 16 January, the tempera-
ture anomalies from 1 January to 31 January over 24yr
are pooled. The use of a 31-day window ensures there
will be enough data points to fit a reliable equation. This
process is conducted for each day to obtain 365 equa-
tions, which can be used to correct the bias of ensemble
mean temperature anomaly for an entire year.
3) The fitted linear equations in step 2 are used to correct
the daily ensemble mean temperature anomaly for
each validation year. Finally, the bias-corrected en-
semble mean temperature is obtained by adding the
mean observed temperature to the bias corrected
temperature anomalies.
A scatterplot of the ensemble mean temperature be-
fore and after BC is plotted against the corresponding
observed temperature for the 1 lead day over the two
selected watersheds (Fig. 3). In this case, all 25 years of
raw and corrected mean forecasts are pooled together
rather than separated by 31-day windows. Only a slight
bias is observed for the raw GFS ensemble mean temper-
ature for both watersheds, as displayed in Figs. 3a and 3c,
and where the linear regression line slightly deviates from
MARCH 2014 CHEN ET AL . 1111
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
the 1:1 line. However, this bias is removed by using the
linear regression method, as shown in Figs. 3b and 3d
where the linear regression and 1:1 lines overlap each
other.
After the BC of the ensemble mean temperature, the
ensemble spread is added using a stochastic weather
generator–based method. The ensemble temperature of
any given day is supposed to follow a two-parameter
(mean and standard deviation) normal distribution. The
bias-corrected ensemble mean temperature is used as the
mean of the normal distribution. The standard deviation
for each season (the same standard deviation is used for
each day in a specific season) is obtained using an opti-
mization algorithm to minimize the root-mean-square
error (RMSE) of rank histogram bins. Specifically, the
optimization algorithm involves four steps. 1) A number
of standard deviation values (ensemble spreads) are
preset for each season. For example, they are set between
0.58 and 58Cwith an interval of 0.058C in this study. 2) The
ensemble temperature for every day in this season is
calculated by multiplying each standard deviation by
a normally distributed random number and adding the
bias-corrected ensemble mean temperature. This step is
repeated for all preset standard deviation values to obtain
a number of temperature ensembles. 3) Rank histograms
are constructed for all temperature ensembles. 4) The
RMSEs of rank histogram bins are calculated for all
histograms. The standard deviation corresponded to the
lowest RMSE is selected as the optimized one for this
season. These four steps are repeated for all four seasons
to obtain four optimized standard deviations for entire
year postprocessing. To insure there are enough samples
to construct reliable rank histograms, the standard de-
viation is optimized at the seasonal scale.
For any given day, the postprocessed ensemble tem-
perature is found by multiplying the optimized standard
deviation in the specific season with a normally distrib-
uted random number (1000 in this study to represent
1000 members) and adding the bias-corrected ensemble
mean temperature for that day. However, the ensemble
temperature generated this way lacks an autocorrelation
structure. For hydrological studies, autocorrelated time
series of Tmax and Tmin are usually needed to run hy-
drological models. Applying a similar technique used in
weather generators, the observed auto- and cross corre-
lation for and between Tmax and Tmin can be preserved
using a first-order linear autoregressive model. With this
model, the Tmax and Tmin ensembles over several lead
FIG. 3. The relationships between observed temperature and (left)GFS and (right)GPP ensemblemean temperature
for 1 lead day over the (a),(b) CDD and (c),(d) YAM watersheds [linear regression (LR)].
1112 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
days are generated at the same time, rather than generated
day after day and variable after variable.
The ensemblemean Tmax and Tmin are first obtained
by adding the mean observed Tmax and Tmin to the
biased-corrected temperature anomalies (obtained from
step 2), respectively, for all lead days. The residual series
of Tmax and Tmin with desired auto- and cross correla-
tion are then generated using a first-order linear autore-
gressive model:
xi( j)5Axi21( j)1Bei( j) , (1)
where xi( j) is a (2 3 1) vector for lead day i whose el-
ements are the residuals of the generated Tmax ( j 5 1)
and Tmin ( j5 2); ei( j) is a (23 1) vector of independent
random components that are normally distributed with
amean of zero and a variance of unity. HereA andB are
(2 3 2) matrices whose elements are defined such that
the new sequences have the desired auto- and cross-
correlation coefficients. The A and B matrices are de-
termined by
A5M1M210 , (2)
BBT 5M0 2M1M210 MT
1 , (3)
where the superscripts21 and T denote the inverse and
transpose of thematrix, respectively, andM0 andM1 are
the lag 0 and lag 1 covariance matrices and calculated
using the observed time series for each season. With
Eq. (1), a number of the residual series over all lead days
are generated to represent the ensemble members. Fi-
nally, the ensemble Tmax and Tmin over several days can
be found by multiplying the optimized standard de-
viation in the specific season by the generated residual
series and adding the bias-corrected ensemble mean
Tmax and Tmin.
The first-order linear autoregressive model has been
tested extensively in several studies (e.g., Richardson
1981; Chen et al. 2011, 2012) and showed a good perfor-
mance at preserving the desired auto- and cross correla-
tion of and between Tmax and Tmin. Since the main goal
of this paper is to present the postprocessing method,
only results involving mean temperature are presented.
c. Bias correction (BC) method
A simple BC method is used as a benchmark to dem-
onstrate the advantages of the proposed GPP method.
The BC step for temperature is similar to that of the GPP
method. Linear equations with the form of y 5 ax 1 b
(where a and b are two estimated coefficients) are fitted
between observed and forecasted temperature anomalies
using a 31-day window centered on the day of interest.
The fitted linear equations are than used to correct the
daily ensemble temperature anomaly for all 15 members.
This step supposes that all ensemble members have the
same bias. The variance optimization stage of the GPP
methodwas not applied to theBCmethod.As such, it can
be expected to outline the advantages of theGPPmethod
over the simpler direct bias correction method.
A bias correction procedure is also applied to the
ensemble precipitation forecast. Linear equations of the
form y 5 ax (where a is the estimated coefficient) are
fitted between the observed and forecasted mean pre-
cipitation using a 31-day window centered on the day of
interest. It differs from the temperature correction in
that the linear equation for precipitation is fitted using
mean precipitation values and not the daily values. This
results in a more reliable estimation of the linear de-
pendence between observed and forecasted values.
Moreover, since the distribution of the daily precipitation
is highly skewed, a fourth root transformation was ap-
plied to precipitation values prior to fitting the linear
equations. Similarly to the temperature, for a given
day, the same linear equation is used for all ensemble
members.
d. Verification of the postprocessing method
Rank histograms permit a quick examination of the
quality of ensemble weather forecasts. Consistent biases
in an ensemble weather forecast result in a sloped rank
histogram, and a lack of variability (underdispersion) is
revealed as a U-shaped, concave, population of the ranks
(Hamill 2001). Thus, the rank histogram is first used to
evaluate ensemble precipitation and temperature fore-
casts. However, a uniform rank histogram is a necessary
but not a sufficient criterion for determining the re-
liability of an ensemble forecast system (Hamill 2001).
Besides, some other characteristics are not evaluated by
rank histograms, such as the resolution. Other verifica-
tion metrics are thus necessary for testing the predictive
power of an ensemble weather forecast. In this study, the
GFS, BC, and GPP ensemble precipitation and temper-
ature forecasts are verified using the Ensemble Verifica-
tion System (EVS) developed byBrown et al. (2010). The
selected verification metrics include two deterministic
metrics for verifying the ensemble mean, and two prob-
abilistic metrics for verifying the distribution. The con-
tinuous ranked probability skill score (CRPSS) and the
Brier skill score (BSS) are also used to verify the skill of
the ensemble forecasts relative to the climatology.
The two deterministic metrics are the mean absolute
error (MAE) and RMSE. TheMAEmeasures the mean
absolute difference between the ensemble mean forecast
and the corresponding observation and the RMSE mea-
sures the average square error of the ensemble mean
MARCH 2014 CHEN ET AL . 1113
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
forecast. The two probabilistic metrics include the BS
and the reliability diagram. The BSmeasures the average
square error of a probability forecast. It is analogous to
the mean square error of a deterministic forecast. It can
be decomposed into three components: reliability, reso-
lution, and uncertainty. A reliability diagram measures
the accuracy with which a discrete event is forecast by an
ensemble forecast system. The BS and reliability diagram
only verify discrete events in the continuous forecast
distributions. Thus, one or more thresholds have to be
defined to represent cutoff values from which discrete
events are computed. Six thresholds corresponding to the
probability of precipitation and temperature exceeding
10% (lower decile), 33% (lower tercile), 50% (median),
67% (upper tercile), 90% (upper decile), and 95% (95th
percentile) are used in this study. Details of these metrics
can be found in Brown et al. (2010), Demargne et al.
(2010), and in the user manual of the EVS (Brown 2012)
(http://amazon.nws.noaa.gov/ohd/evs/evs.html).
4. Results
Figure 4 presents the rank histograms of GFS andGPP
ensemble precipitation forecasts for 1 lead day over two
watersheds. Only wet-day precipitation is used to pro-
duce the rank histograms. To allow for a proper com-
parison with the raw ensemble forecasts, only 15members
are generated using the GPP method in this case. The
results show that the distributions of the raw GFS en-
semble forecasts are highly nonuniform; there is a
marked tendency for the distribution to be most popu-
lated at the lowest and extreme ranks to form U-shaped
rank histograms (Figs. 4a,c). This indicates that the raw
GFS forecasts are considerably underdispersive for both
watersheds. Wet biases are observed for the CDD wa-
tershed and dry biases for theYAMwatershed.However,
after the calibration with the GPP method, the rank his-
tograms aremuch flatter for both watersheds (Figs. 4b,d),
even though only 15 members are generated in this case.
Using more members would result in more uniform rank
histograms.
Figure 5 shows the rank histograms of ensemble tem-
perature forecasts before and after calibration for 1 lead
day over the twowatersheds. Similarly to precipitation, to
allow for a fair comparison with the raw ensemble fore-
casts, only 15 members are generated for the GPP en-
semble forecasts in this case. The results show that the
distribution of raw GFS ensemble forecasts is highly
FIG. 4. Rank histograms of the (left)GFS and (right)GPP ensemble precipitation forecasts for the 1 lead day over the
(a),(b) CDD and (c),(d) YAM watersheds.
1114 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
nonuniform (U shaped) for temperature. There is a
marked tendency for the distribution to be most popu-
lated at the extreme ranks, indicating the underdispersion
and cold bias of the raw forecasts over the two water-
sheds. However, rank histograms of calibrated ensemble
forecasts tend to be uniform for both watersheds.
Figures 6 and 7 show the quality of the ensemblemean
forecast before and after postprocessing in terms of the
MAE and RMSE, respectively, for both precipitation
and temperature over bothwatersheds. Both statistics are
computed using all forecast–observation pairs (25 yr 3365 days). Overall, the GFS ensemble mean forecasts
display large errors for both precipitation and tem-
perature covering leading days from 1 to 7 days. How-
ever, theGPPmethod consistently improves the qualities
of the ensemble mean forecasts for all leads. In terms of
theMAE, the BCmethod displaysmore benefits than the
GPPmethod for precipitation over bothwatersheds. This
is expected, since the BCmethod specifically accounts for
the bias of the GFS forecast. However, in terms of the
RMSE, the GPP method consistently performs better
than the BCmethod for precipitation. Since both BC and
GPPmethods share the same step at removing the bias of
the ensemblemean temperature, theMAEandRMSEof
the forecast temperature are the same for both.
As displayed in Figs. 6a and 6c, the quality of raw en-
semble mean forecasts decreases slightly with increasing
lead time for precipitation in terms of the MAE. How-
ever, the RMSE of raw ensemble mean forecasts tends to
decrease with an increase in lead time for precipitation
(Figs. 7a,c). After postprocessing, the forecast quality
slightly decreases with the increase in lead days. For the
ensemble mean temperature, there is a progressive de-
cline in forecast quality with increasing lead time in terms
of both MAE and RMSE.
Moreover, the quality of the ensemble mean forecast
at the CDD watershed is consistently better than that
at the YAMwatershed for precipitation, suggesting that
watershed size plays an important role. This likely indi-
cates that the numerical weather forecast system is better
at representing precipitation events over a larger area,
since the representation of convective events is very dif-
ficult considering the horizontal resolution of the com-
putational grid. In this work, the observed precipitation
is watershed averaged, and as such, convective precipi-
tation extremes are smoothed over the larger basin. The
FIG. 5. Rank histograms of the (left) GFS and (right) GPP ensemble temperature forecasts for 1 lead day over the
(a),(b) CDD and (c),(d) YAM watersheds.
MARCH 2014 CHEN ET AL . 1115
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
same extremes would play a more important role in a
smaller watershed.
The skill of ensemble forecasts relative to unskilled
climatology is assessed using themean continuous ranked
probability skill score (MCRPSS; Fig. 8). The GFS en-
semble precipitation forecasts show negative skill relative
to the climatology over both watersheds. The forecast
skill consistently increases with the forecast leads. This is
caused primarily by the lack of spread (greater sharpness)
in shorter lead ensemble forecasts and the larger spread
in longer lead ensemble forecasts. Even though the BC
method is able to improve the ensemble precipitation
forecast to a certain extent, the skill is still negative for
all 7 lead days. TheGPPmethod considerably increases
the skill of the ensemble forecast for both watersheds
and is consistently better than the BCmethod. The skill
of the GPP forecast decreases with increasing lead times,
and is close to zero at the 7-day lead, indicating that the
ensemble weather forecast has reached its predictability
limit. Thus, the calibration of ensemble precipitation
forecasts for a period of 7 lead days is appropriate in this
study.
The GFS ensemble temperature forecasts are much
more skillful than their precipitation forecasts for all leads
over both watersheds. Even though the GFS ensemble
temperature forecasts are skillful for the period up to
1 week, they can be further improved by both GPP and
BCmethods. TheGPP ensemble temperature forecasts
are consistently better than the BC ones for all 7 lead
days and both watersheds, indicating that benefits of
the GPP method not only come from the BC stage, but
also from the variance optimization stage. The BC
stage plays a slightly more important role at improving
the raw forecasts. Moreover, the skill of ensemble tem-
perature forecasts (before and after postprocessing)
consistently decreases with the increase in lead time for
both watersheds.
For probabilistic metrics computed for discrete events,
such as the BS, BSS, and reliability diagrams used in this
study, a number of thresholds have been defined. As
mentioned earlier, six thresholds were used in this study.
Since similar patterns are obtained, only the results with
the threshold exceeding the median are presented for il-
lustration for all four metrics.
Figures 9 and 10 show the BS of GFS, GPP, and BC
ensemble precipitation and temperature forecasts for
bothwatersheds, with leads ranging between 1 and 7 days.
The reliability, resolution, and uncertainty components
of theBS and theBSS,whichmeasure the performance of
an ensemble weather forecast relative to the climatology,
FIG. 6. MAE of GFS, GPP, and BC ensemble mean (left) precipitation and (right) temperature forecasts for 1–7 lead
days over the (a),(b) CDD and (c),(d) YAM watersheds.
1116 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
are also presented. The reliability term of BS measures
how close the forecast probabilities are to the true
probabilities, with smaller values indicating a better
forecast system. The resolution termmeasures howmuch
the predicted probabilities differ from a climatological
average and therefore contribute valuable information.
Thus, a larger resolution value suggests a better forecast.
According to its definition, the uncertainty termof the BS
is always equal to 0.25 [0.5 3 (1 2 0.5)] when using the
median as the threshold.
In terms of the BS, the ensemble forecasts are less ac-
curate (in overall performance) for precipitation (Fig. 9),
and reasonably accurate for temperature (Fig. 10) for
both watersheds. In terms of the BS, the BC method
performs slightly better for the temperature forecasts, but
shows no improvement for the ensemble precipitation
forecasts. Nevertheless, the GPP method consistently
increases the accuracy for both precipitation and tem-
perature for all 7 lead days for both watersheds, with
a consistent increase in the resolution component of the
BS. In addition, the reliability component of theBS is also
improved for the ensemble precipitation forecasts for all
lead times. In contrast, the BS’ reliability component is
slightly degraded for the ensemble temperature forecasts
at all lead times. This is because the raw ensemble
forecasts are very reliable for temperature to begin
with (mean reliability component of 0.005 for the CDD
watershed and 0.003 for the YAM watershed). The
moderate decrease in the BS is due to a relatively large
increase in the resolution component and a slight de-
terioration of the reliability component.
In terms of the BSS, the skill of the GFS ensemble
precipitation forecast is negative for all 7 lead days. The
BC method results in small improvements for preci-
pitation forecasts, but only for the first few lead days. It
then progressively becomes worse than the GFS fore-
casts for the other lead days. The GPP method consid-
erably improves the skill of the ensemble forecast for
both watersheds and is consistently better than the BC
method. The skill of the GPP ensemble forecast de-
creases with increasing lead times, with the BSS being
close to zero at 7 lead days, further indicating that the
ensemble weather forecast retains some skill for a pe-
riod of up to 1 week for precipitation. The BSS shows
high skill in GFS ensemble temperature forecasts, for all
lead times and for both watersheds. The BC method
slightly improves the skill of the ensemble temperature
forecast, but at the expense of the resolution. However,
the GPP ensemble forecast consistently exceeds the skill
of GFS and BC ensemble forecasts. In particular, the
FIG. 7. RMSE of GFS, GPP, and BC ensemble mean (left) precipitation and (right) temperature forecasts for 1–7
lead days over the (a),(b) CDD and (c),(d) YAM watersheds.
MARCH 2014 CHEN ET AL . 1117
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
BSS of the GPP ensemble forecast at 7 lead days is
greater than that of theGFS forecast at the 1 lead day for
the CDD watershed.
The reliability diagram (Hartmann et al. 2002) is a
graph of the observed frequency of an event plotted
against its forecast probability. It provides information
with respect to the reliability, resolution, skill, and
sharpness of a forecast system. Figure 11 presents the
reliability diagrams of ensemble precipitation and tem-
perature forecasts before and after postprocessing for
a probability threshold exceeding the median for 1 and 3
lead days over both watersheds. Underdispersion of the
raw ensemble precipitation forecast (Fig. 4) is reflected in
the reliability diagrams (Figs. 8a,c) for both 1 and 3 lead
days for both watersheds, as reflected in the slopes of the
reliability curves that are smaller than 1. This indicates
that the GFS ensemble precipitation forecasts are poorly
calibrated with limited skill and resolution. The prob-
ability forecasts derived from the raw ensembles are
overconfident, which can be reflected by the sharpness
(relative frequency of usage). The BCmethod results in
little improvement for the ensemble precipitation fore-
casts, essentially because it does not account for the
spread. The GPP method results in a dramatic improve-
ment in the reliability of the ensemble precipitation
forecasts for both 1 and 3 lead days, although the sharp-
ness is somewhat lessened. Postprocessing thus improves
the reliability at the expense of the sharpness.
The cold biases of the raw temperature forecast (Fig. 5)
are also reflected in the reliability diagrams (Figs. 8b,d)
for both 1 and 3 lead days for both watersheds, as
displayed by the underforecasting. The ensemble tem-
perature forecasts calibrated using the weather generator–
based method are much more reliable than the GFS and
BC forecasts for both 1 and 3 lead days over both wa-
tersheds, as indicated by the reliability curves, which are
very close to the 1:1 reference line.More importantly, the
improvement in reliability results in a slight decline in the
sharpness. The better performance of the GPP method
over the BCmethod is a clear indication that a significant
part of the performance is derived from the variance
optimization stage.
5. Discussion and conclusions
Ensemble weather forecasts generally suffer from bias
and tend to be underdispersive, which limits their pre-
dictive power. Several methods, such as logistic regression,
BMA, and ensemble dressing have been proposed for
postprocessing these ensemble forecasts. These methods
FIG. 8. MCRPSS of GFS, GPP, and BC ensemble (left) precipitation and (right) temperature forecasts for 1–lead
days over the (a),(b) CDD and (c),(d) YAM watersheds.
1118 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
are relatively complex to set up and generally aim at es-
timating the underlying predictive PDFs. However, a
series of point values are often more convenient for
practical applications such as ensemble streamflow fore-
casts, which need discrete, autocorrelated time series
over several days in order to run hydrological models.
These discrete, autocorrelated time series of precipitation
and temperature need to be physically constrained. For
example, temperatures changes from one day to the next
and the probability of precipitation occurrence are not
random. Even if a method is adequate at reconstructing
the underlying PDF, there is no simpleway to go from the
underlying distributions to generating several time series
fully consistent with the underlying distributions. The
GPP method presented in this study is significantly sim-
pler to implement than most existing methods, and it can
readily generate an infinite number of discrete, auto-
correlated time series over the forecasting horizon. The
auto- and cross correlation of and between Tmax and
Tmin were specifically taken into account with this
method. Moreover, the GPP method specifically takes
into account the precipitation occurrence biases of en-
semble forecasts. Precipitation amounts are modeled
using a parametric distribution. This underlying assump-
tion allows extreme values outside the range of the ob-
served data to be simulated. A gamma distribution was
used in this study. Other distributions with a heavy tail
(e.g., mixed exponential distribution and hybrid expo-
nential and generalized Pareto distribution) can also be
used to better represent the ensemble spread (C. Li et al.
2012; Z. Li et al. 2013). This is one of the major advan-
tages of thismethod over analog techniques. Even though
the Hamill et al. (2006) approach also constructed a dis-
tribution from the observations associated with the ana-
logs, the inclusion of dry events makes it impossible to fit
the parametric distributions. Furthermore, the perfor-
mance of an analog technique is strongly depended on the
reforecast period, especially for extreme precipitation
forecasts. The rarer the events, the more difficult it is to
find appropriate forecast analogs (Hamill et al. 2006).
FIG. 9. BS and its decomposed components [Brier score uncertainty (BSunc), Brier score reliability (BSrel), and Brier score resolution
(BSres)] of (left) GFS, (middle) GPP, and (right) BC ensemble precipitation forecasts for 1–7 lead days over the (a)–(c) CDD and (d)–(f)
YAM watersheds. Probability exceeding 50% (median) is used as the threshold. The BSS of the ensemble forecasts relative to the
climatology is also presented.
MARCH 2014 CHEN ET AL . 1119
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
However, this problem is avoided by the GPP methods,
since the extremes of ensemble forecasts are grouped
to the last class of the ensemble mean forecast. The
ensemble mean precipitation and mean temperature
anomalies are used as predictors for postprocessing pre-
cipitation and temperature, respectively. Since the spatial
resolution of ensemble weather forecasts (model scale) is
usually lower than the required resolution of real appli-
cations (e.g., watershed scale for hydrological studies),
postprocessing also acts as a downscaling method.
A simple BC method was used as a benchmark to
demonstrate the performance of the GPP method over
two Canadian watersheds located in the Province of
Quebec. Ensemble weather forecasts were taken from
theGFS retrospective forecast dataset. Much longer time
series are available for ensemble reforecasts than for
operational forecasts for properly calibrating the post-
processingmethod. Previous studies convincingly showed
that postprocessing using reforecasts achieved substantial
improvements in their skill and reliability (Hagedorn
et al. 2008; Hamill et al. 2008).
The GFS and GPP ensemble weather forecasts were
preliminarily tested using rank histograms. Similarly to
previous studies (Hamill and Colucci 1997; Hagedorn
et al. 2008), the GFS forecasts were found to be biased
and underdispersed, as illustrated by the excess pop-
ulations of the extreme ranks. This underdispersion was
more pronounced at the shorter forecast leads than for
longer forecast leads (results not shown). Uniform rank
histograms could be achieved for both precipitation and
temperature when postprocessed using the GPP method.
The performance of GFS, GPP, and BC ensemble
weather forecasts was further verified using both de-
terministic and probabilistic metrics. The deterministic
metrics (MAE and RMSE) showed large errors in GFS
ensemble mean forecasts for both precipitation and
temperature at all 7 lead days over both watersheds. The
GPPmethod was able to consistently improve the quality
of the ensemble mean forecasts. The skill of ensemble
weather forecasts relative to the climatology was mea-
sured using MCRPSS. The raw forecast had negative to
near-zero skill at all forecast leads for precipitation. The
GPP method substantially improved the skill of the en-
semble forecasts for precipitation, with MCRPSS being
positive for all 7 lead days. Even though relatively good
skill was observed for the raw ensemble temperature
FIG. 10. As in Fig. 9, but for ensemble temperature forecasts.
1120 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
forecasts, they could be further benefitted by applying
the postprocessing method. The performance of the GPP
method was consistently better than that of the BC
method.
Probabilistic metrics computed for discrete events
including the BS, BSS, and reliability diagrams were
further used to verify the overall performance (accu-
racy, skill, reliability, resolution, and sharpness) of en-
semble weather forecasts. Overall, the GPPmethod was
able to consistently improve the accuracy of ensemble
forecasts for both precipitation and temperature over
both watersheds. It also consistently outperformed the
BC method. The GFS ensemble forecasts showed neg-
ative skill for precipitation for all 7 lead days. This in-
dicated that the underdispersedGFS forecasts were even
worse than the climatology for precipitation. However,
with the GPP method, a positive skill was achieved for
a period of up to 7 lead days. With the GPP method, the
skill of the ensemble temperature forecasts was also im-
proved, even though they usually had revealed reason-
able skills before postprocessing. Underdispersion of the
GFS ensemble precipitation forecasts was reflected in the
reliability diagrams, indicating that theGFS precipitation
forecasts were poorly calibrated and showed little skill
and resolution. The GPP method markedly improved
their reliability and resolution for all leads over both
watersheds. However, the sharpness was somewhat
diminished. This is consistent with other studies (e.g.,
Hamill et al. 2008) in that the reliability was improved
at the expense of sharpness. The reliability diagrams
showed cold biases for GFS ensemble temperature.
However, the reliability curves were very close to the 1:1
perfect line after postprocessing.
Overall, even though GFS ensemble forecasts are bi-
ased and tend to be underdispersed, their overall per-
formance was considerably improved using the proposed
GPP method. Predictably, the performance of the GPP
method decreasedwith increasing lead days. For theGFS
ensemble reforecasts and selected basins, 7 days was the
maximum lead day for precipitation. For temperature,
postprocessing over a longer period may be possible. The
use of the BC method for temperature provided an op-
portunity to separate the advantage of the GPP method
into that from the bias correction and variance optimiza-
tion stages. The better performance of the GPP method
clearly demonstrates the importance of the variance
FIG. 11. Reliability diagrams of GFS,GPP, and BC ensemble precipitation and temperature forecasts for 1 and 3 lead days over the (a),(b)
CDD and (c),(d) YAM watersheds. The probability of exceeding 50% (median) is used as the threshold.
MARCH 2014 CHEN ET AL . 1121
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
optimization stage, even though the bias correction
carries the largest part of the performance gain.
Owing to restrictions in paper length, the only results
that were show for the probabilistic metrics (BS, BSS,
and reliability diagrams) are those with a threshold ex-
ceeding the median value. The use of higher thresholds
is also interesting for many real-world applications. The
ensemble weather forecasts were also tested using higher
thresholds (67%, 90%, and 95%).While the results for the
higher thresholds slightly differ from those obtained using
the median values, the patterns were very similar. Spe-
cifically, the skill of the GPP forecasts decreased slightly
with the increasing threshold, because the GFS forecast
performance gets worse for the higher thresholds. How-
ever, the degree of improvement obtained from the GPP
method increased with the larger thresholds.
The excellent performance of the postprocessing
schememay partly be due to the choice of basin-averaged
meteorological time series. While still much smaller than
the numerical model scale, the basin scale (9700 and
3330km2) is nevertheless slightly more comparable and
may result in a better representation of the precipitation
than at the station scale, as is more common. Themethod
performed very well with only one predictor (ensemble
mean for precipitation and ensemble mean anomaly for
temperature). No attempt was made at using additional
predictors. In particular, the use of the ensemble standard
deviation may yield additional improvements.
To obtain the true expectancy of a weather generator,
a large number of ensembles need to be generated with
the proposed method. Short time series could result in
biases due to the random nature of the stochastic process.
Thus, a 1000-member ensemble was generated in this
study. However, for hydrological studies, running such
a large ensemble may be time consuming for the fully
distributed model. However, as indicated by rank histo-
grams in Figs. 4 and 5, the ensemble with only 15 mem-
bers was nevertheless better than the GFS forecast.
Therefore, depending on research purposes and hydrol-
ogy model complexity, an ensemble with fewer members
may also be acceptable.
For the real climate system, a correlation exists be-
tween precipitation and temperature. Generally, mean
temperature is generally cooler on wet days. However,
the proposed GPP method generated precipitation
and temperature independently. This may affect this
correlation to a certain extent. To investigate thus, the
precipitation–temperature correlations were calculated
for observed and forecasted datasets using all 25-yr daily
time series. The correlation coefficients for GFS, GPP,
and BC forecasts were obtained by averaging all co-
efficients over all 7 lead days and all ensemble members.
The correlation coefficients are 0.189, 0.363, 0.161, and
0.227 for observed data, GFS, GPP, and BC forecasts,
respectively, over the CDD watershed. They are equal
to 0.119, 0.239, 0.088, and 0.069, respectively, over the
YAM watershed. These results indicate that the GFS
forecasts overestimated the precipitation–temperature
correlation. However, the precipitation–temperature
correlation was slightly underestimated when using the
GPP method. This is as expected, since the ensemble
precipitation and temperature were generated indepen-
dently. However, it should be noted that any bias cor-
rection or postprocessing method may be expected to
alter the precipitation–temperature correlation, unless
specifically taken into account.
The goal of this work was to provide a postprocessing
method to improve the ensemble weather forecasts for
ensemble streamflow forecasts in the Province of Quebec.
The proposed method was tested on two watersheds. For
a broader use, it should be tested usingmore datasets from
different climate zones. In addition, the performance of
a postprocessingmethodmay partly depend on the skill of
raw forecasts. Thus, it may be necessary to test the pro-
posed method using other ensemble weather forecast
products such as the ECMWF reforecast. Especially, in
the process of this work, a newer version of the GFS
reforecast dataset was made available showing improved
skills over the older one used in this study (Hamill et al.
2013). It would be interesting to know how the GPP
method would perform over this newer dataset. There-
fore, more comprehensive verifications, including evalu-
ating the proposed method over different locations and
using other ensemble weather forecast products are rec-
ommended in further studies.
Acknowledgments. This work is part of a project fun-
ded by the Projet d’Adaptation aux Changements Cli-
matiques (PACC26) of the Province of Quebec, Canada.
The authors thank the Centre d’Expertise Hydrique du
Qu�ebec (CEHQ) and the Ouranos Consortium on Re-
gional Climatology and Adaptation for their contribu-
tions to this project. The authors wish to thankDr.Vincent
Fortin of Environment Canada for his insights on ensem-
ble weather forecasts, and Dr. James D. Brown of the
NOAA/National Weather Service, Office of Hydrologic
Development, for assisting us in the use of the Ensemble
Verification System (EVS) and for providing insightful
comments as we prepared this paper. We also thank the
Earth System Research Laboratory, Physical Sciences
Division for providing reforecast products.
REFERENCES
Bertotti, L., J. R. Bidlot, R. Buizza, L. Cavaleri, and M. Janousek,
2011: Deterministic and ensemble-based prediction of Adriatic
1122 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
Sea sirocco storms leading to ‘acqua alta’ in Venice. Quart. J.
Roy. Meteor. Soc., 137 (659), 1446–1466.
Boucher, M. A., F. Anctil, L. Perreault, and D. Tremblay, 2011:
A comparison between ensemble and deterministic hydro-
logical forecasts in an operational context. Adv. Geosci., 29,
85–94, doi:10.5194/adgeo-29-85-2011.
Brocker, J., and L. A. Smith, 2008: From ensemble forecasts to
predictive distribution functions. Tellus, 60A, 663–678.
Brown, D. J., 2012: Ensemble Verification System (EVS), version
4.0. User’s manual, 107 pp.
——, J. Demargne, D. J. Seo, and Y. Liu, 2010: The Ensemble
Verification System (EVS): A software tool for verifying
ensemble forecasts of hydrometeorological and hydrologic
variables at discrete locations.Environ. Modell. Software, 25,
854–872.
Buizza, R., 1997: Potential forecast skill of ensemble prediction and
spread and skill distributions of the ECMWF ensemble pre-
diction system. Mon. Wea. Rev., 125, 99–119.
——, P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu,
2005: A comparison of the ECMWF,MSC, and NCEPGlobal
Ensemble Prediction Systems. Mon. Wea. Rev., 133, 1076–
1097.
Chen, J., F. P. Brissette, and R. Leconte, 2010: A daily stochastic
weather generator for preserving low-frequency of climate
variability. J. Hydrol., 388, 480–490.——, ——, and ——, 2011: Assessment and improvement of sto-
chastic weather generators in simulating maximum and mini-
mum temperatures. Trans. ASABE, 54 (5), 1627–1637.
——, ——, ——, and A. Caron, 2012: A versatile weather gener-
ator for daily precipitation and temperature. Trans. ASABE,
55 (3), 895–906.
Coulibaly, P., 2003: Impact of meteorological predictions on real-
time spring flow forecasting.Hydrol. Processes, 17 (18), 3791–
3801.
Cui, B., Z. Toth, Y. Zhu, and D. Hou, 2012: Bias correction for
global ensemble forecast. Wea. Forecasting, 27, 396–410.
Demargne, J., J. Brown, Y. Liu, D. J. Seo, L. Wu, Z. Toth, and
Y. Zhu, 2010: Diagnostic verification of hydrometeorological
and hydrologic ensembles. Atmos. Sci. Lett., 11, 114–122.
Eckel, F. A., and M. K. Walters, 1998: Calibrated probabilistic
quantitative precipitation forecasts based on the MRF en-
semble. Wea. Forecasting, 13, 1132–1147.
Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman,
2005: Calibrated probabilistic forecasting using ensemble
model output statistics and minimum CRPS estimation.Mon.
Wea. Rev., 133, 1098–1118.
Hagedorn, R., T. Hamill, and J. S. Whitaker, 2008: Probabilistic
forecast calibration using ECMWF and GFS ensemble refor-
ecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136,
2608–2619.
Hamill, T. M., 2001: Interpretation of rank histograms for verifying
ensemble forecasts. Mon. Wea. Rev., 129, 550–560.
——, and S. J. Colucci, 1997: Verification of Eta–RSM short-range
ensemble forecasts. Mon. Wea. Rev., 125, 1312–1327.
——, and ——, 1998: Evaluation of Eta-RSM ensemble proba-
bilistic precipitation forecasts. Mon. Wea. Rev., 126, 711–
724.
——, and J. S. Whitaker, 2006: Probabilistic quantitative pre-
cipitation forecasts based on reforecast analogs: Theory and
application. Mon. Wea. Rev., 134, 3209–3229.——, and——, 2007: Ensemble calibration of 500-hPa geopotential
height and 850-hPa and 2-m temperatures using reforecasts.
Mon. Wea. Rev., 135, 3273–3280.
——, ——, and X. Wei, 2004: Ensemble reforecasting: Improving
medium-range forecast skill using retrospective forecasts.
Mon. Wea. Rev., 132, 1434–1447.
——, ——, and S. L. Mullen, 2006: Reforecasts: An important
dataset for improving weather predictions. Bull. Amer. Me-
teor. Soc., 87, 33–46.
——, R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast
calibration using ECMWF and GFS ensemble reforecasts.
Part II: Precipitation. Mon. Wea. Rev., 136, 2620–2632.
——, G. T. Bates, J. S. Whitaker, D. R. Murray, M. Fiorino, T. J.
Galarneau Jr., Y. J. Zhu, and W. Lapenta, 2013: NOAA’s
second-generation global medium-range ensemble reforecast
dataset. Bull. Amer. Meteor. Soc., 94, 1553–1565.
Hartmann, H. C., T. C. Pagano, S. Sorooshiam, and R. Bales, 2002:
Confidence builders: Evaluating seasonal climate forecasts
from user perspectives. Bull. Amer. Meteor. Soc., 83, 683–698.
Hutchinson, M. F., D. W. McKenney, K. Lawrence, J. H. Pedlar,
R. F. Hopkinson, E. Milewska, and P. Papadopol, 2009:
Development and testing of Canada-wide interpolated spatial
models of daily minimum–maximum temperature and pre-
cipitation for 1961–2003. J. Appl. Meteor. Climatol., 48, 725–
741.
Li, C., V. P. Singh, and A. K. Mishra, 2012: Simulation of the entire
range of daily precipitation using a hybrid probability dis-
tribution. Water Resour. Res., 48, W03521, doi:10.1029/
2011WR011446.
Li, Z., F. Brissette, and J. Chen, 2013: Finding themost appropriate
precipitation probability distribution for stochastic weather
generation and hydrological modelling in Nordic watersheds.
Hydrol. Processes, 27, 3718–3729, doi: 10.1002/hyp.9499.
Nicks, A. D., and G. A. Gander, 1994: CLIGEN: A weather gen-
erator for climate inputs to water resource and other model.
Proc. Fifth Int. Conf. on Computers in Agriculture, St. Joseph,
MI, American Society of Agricultural Engineers, 3–94.
Pellerin, G., L. Lefaivre, P. Houtekamer, and C. Girard, 2003: In-
creasing the horizontal resolution of ensemble forecasts at
CMC. Nonlinear Processes Geophys., 10, 463–468.Raftery, A. E., T. Gneiting, F. Balabdaout, and M. Polakowski,
2005: Using Bayesian model averaging to calibrate forecast
ensembles. Mon. Wea. Rev., 133, 1155–1174.
Richardson, C. W., 1981: Stochastic simulation of daily preci-
pitation, temperature, and solar radiation.Water Resour. Res.,
17, 182–190.
Richardson, D. S., 2001: Measures of skill and value of ensemble
prediction systems, their interrelationship and the effect of
sample size. Quart. J. Roy. Meteor. Soc., 127, 2473–2489.
Roulston, M. S., and L. A. Smith, 2003: Combining dynamical and
statistical ensembles. Tellus, 55A, 16–30.
Schmeits, M. J., and K. J. Kok, 2010: A comparison between raw
ensemble output, (modified) Bayesian model averaging, and
extended logistic regression using ECMWF ensemble pre-
cipitation reforecasts. Mon. Wea. Rev., 138, 4199–4211.
Semenov,M. A., and E.M. Barrow, 2002: LARS-WG:A stochastic
weather generator for use in climate impact studies. User
manual, 28 pp. [Available online at www.rothamsted.ac.uk/
mas-models/download/LARS-WG-Manual.pdf.]
Sloughter, J. M., A. E. Raftery, T. Gneitting, and C. Fraley, 2007:
Probabilistic quantitative precipitation forecasting using
Bayesian model averaging. Mon. Wea. Rev., 135, 3209–
3220.
Soltanzadeh, I., M. Azadi, and G. A. Vakili, 2011: Using Bayesian
Model Averaging (BMA) to calibrate probabilistic surface
temperature forecasts over Iran.Ann.Geophys., 29, 1295–1303.
MARCH 2014 CHEN ET AL . 1123
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC
Toth, Z., Y. Zhu, and T. Marchok, 2001: The use of ensembles to
identify forecasts with small and large uncertainty. Wea.
Forecasting, 16, 463–477.
Vrugt, J. A., M. P. Clark, C. G. H. Diks, Q. Duan, and B. A.
Robinson, 2006: Multi-objective calibration of forecast en-
sembles using Bayesian model averaging.Geophys. Res. Lett.,
33, L19817, doi:10.1029/2006GL027126.
Wang, X., and C. Bishop, 2005: Improvement of ensemble re-
liability with a new dressing kernel. Quart. J. Roy. Meteor.
Soc., 131, 965–986.
Whitaker, J. S., X. Wei, and F. Vitart, 2006: Improving week-2
forecasts with multimodel reforecast ensembles. Mon. Wea.
Rev., 134, 2279–2284.
Wilks, D. S., 2005: Statistical Methods in the Atmospheric Sciences.
3rd ed. Academic Press, 467 pp.
——, 2006: Comparison of ensemble-MOS methods in the Lorenz
’96 setting. Meteor. Appl., 13, 243–256.——, 2009: Extending logistic regression to provide full-probability-
distribution MOS forecasts.Meteor. Appl., 16, 361–368.
——, and T. M. Hamill, 2007: Comparison of ensemble-MOS
methods using GFS reforecasts.Mon. Wea. Rev., 135, 2379–2390.
Wilson, L. J., S. Beauregard, A. E. Raftery, and R. Verret, 2007:
Calibrated surface temperature forecasts from the Canadian
ensemble prediction system using BayesianModel Averaging.
Mon. Wea. Rev., 135, 1364–1385.
1124 MONTHLY WEATHER REV IEW VOLUME 142
Unauthenticated | Downloaded 05/04/22 11:44 AM UTC