postprocessing of ensemble weather forecasts using a

Postprocessing of Ensemble Weather Forecasts Using a Stochastic Weather Generator

JIE CHEN AND FRANCOIS P. BRISSETTE

Department of Construction Engineering, �Ecole de technologie sup�erieure, Universit�e du Qu�ebec, Montreal, Quebec, Canada

ZHI LI

College of Natural Resources and Environment, Northwest A & F University, Yangling, Shaanxi, China

(Manuscript received 31 May 2013, in final form 13 November 2013)

ABSTRACT

This study proposes a new statistical method for postprocessing ensemble weather forecasts using a sto-

chastic weather generator. Key parameters of the weather generator were linked to the ensemble forecast

means for both precipitation and temperature, allowing the generation of an infinite number of daily times

series that are fully coherent with the ensemble weather forecast. This method was verified through post-

processing reforecast datasets derived from the Global Forecast System (GFS) for forecast leads ranging be-

tween 1 and 7 days over two Canadian watersheds in the Province of Quebec. The calibration of the ensemble

weather forecasts was based on a cross-validation approach that leaves one year out for validation and uses the

remaining years for training the model. The proposed method was compared with a simple bias correction

method for ensemble precipitation and temperature forecasts using a set of deterministic and probabilistic

metrics. The results show underdispersion and biases for the raw GFS ensemble weather forecasts, which in-

dicated that they were poorly calibrated. The proposed method significantly increased the predictive power of

ensemble weather forecasts for forecast leads ranging between 1 and 7 days, andwas consistently better than the

bias correction method. The ability to generate discrete, autocorrelated daily time series leads to ensemble

weather forecasts’ straightforward use in forecasting models commonly used in the fields of hydrology or ag-

riculture. This study further indicates that the calibration of ensemble forecasts for a period up to one week is

reasonable for precipitation, and for temperature it could be reasonable for another week.

1. Introduction

Ensemble weather forecasts offer great potential

benefits for water resourcemanagement, as they provide

useful information for analyzing the uncertainty of pre-

dicted variables (Boucher et al. 2011). The advantages of

ensemble weather forecasts over deterministic forecasts

were observed in several studies, even at locations where

the spatial resolution of ensemble forecasts was much

lower (Bertotti et al. 2011; Boucher et al. 2011).However,

raw ensemble forecasts are generally biased and tend to

be underdispersed (Buizza 1997; Hamill andColucci 1997;

Eckel and Walters 1998; Toth et al. 2001; Pellerin et al.

2003), thus limiting the predictive power of probability

density functions (PDFs). For example, Hamill and

Colucci (1997) verified the Eta and regional spectral

model (Eta-RSM) for predicting short-range (24 h)

850-mb temperature, 500-mb geopotential height, and

24-h total precipitation amounts using rank histograms.

The nonuniform rank histograms indicated that the as-

sumption of identical errors for each member was not

achieved. Buizza et al. (2005) compared three ensemble

prediction systems [the European Centre for Medium-

Range Weather Forecasts (ECMWF), the Meteorologi-

cal Service of Canada (MSC), and the National Centers

for Environmental Prediction (NCEP)] in forecasting the

500-hPa geopotential height over the Northern Hemi-

sphere and found that none of them was able to capture

all sources of forecast uncertainty. In addition, both

spread-error correlations and underdispersion were de-

tected. Therefore, some form of postprocessing is re-

quired before ensemble forecasts can be incorporated into

the decision-making process so that the predictive distri-

butions are reliable and properly reflect the real-world

Corresponding author address: Jie Chen, Department of Con-

struction Engineering, �Ecole de technologie sup�erieure, Universit�edu Qu�ebec, 1100 rue Notre-Dame Ouest, Montreal QC H3C 1K3,

Canada.

E-mail: [email protected]

1106 MONTHLY WEATHER REV IEW VOLUME 142

DOI: 10.1175/MWR-D-13-00180.1

� 2014 American Meteorological SocietyUnauthenticated | Downloaded 05/04/22 11:44 AM UTC

mailto:[email protected]

uncertainty (Hamill and Colucci 1998; Richardson 2001;

Boucher et al. 2011; Cui et al. 2012).

During the last two decades, a number of post-

processing methods have been proposed and imple-

mented to address the bias and underdispersion of

ensemble weather forecasts. These include rank histo-

gram techniques (Hamill and Colucci 1998; Eckel and

Walters 1998; Wilks 2006), ensemble dressing (Roulston

and Smith 2003; Wang and Bishop 2005; Wilks and

Hamill 2007; Brocker and Smith 2008), Bayesian model

averaging (BMA; Raftery et al. 2005; Vrugt et al. 2006;

Wilson et al. 2007; Sloughter et al. 2007; Soltanzadeh et al.

2011), logistic regression (Hamill et al. 2006; Wilks and

Hamill, 2007; Hamill et al. 2008), analog techniques

(Hamill et al. 2006; Hamill and Whitaker 2007), and

nonhomogeneous Gaussian regression (NGR; Gneiting

et al. 2005;Wilks andHamill, 2007;Hagedorn et al. 2008).

Among these methods, the logistic regression method

was most often used to calibrate both precipitation and

temperature, and BMA and NGR were usually used to

calibrate the temperature (Raftery et al. 2005; Hagedorn

et al. 2008; Hamill et al. 2008). More recently, studies

have also extended the BMA for the postprocessing of

precipitation (Sloughter et al. 2007; Schmeits and Kok

2010).

Hamill et al. (2004) used a logistic regression method

to improve the medium-range precipitation and tem-

perature forecast skill using retrospective forecasts. The

ensemble mean and ensemble mean anomaly were used

as predictors for precipitation and temperature, respec-

tively. The results showed that the logistic regression-

based probability forecasts (using retrospective forecasts)

weremuchmore skillful and reliable than the operational

NCEP forecast. Raftery et al. (2005) proposed using the

BMA method to calibrate the ensemble forecasts of

temperature and found that the calibrated predictive

PDFs were much better than those of the raw forecast.

Wilks (2006) compared eight ensemble model out-

put statistics (MOS) methods for the statistical post-

processing of ensemble forecast using the idealized

Lorenz’96 setting. The eight methods were classified

into four categories: 1) early, ad hoc approaches (di-

rect model output, rank-histogram recalibration, and

multiple implementations of single-integration MOS

equations), 2) the ensemble dressing approach, 3) re-

gression methods (logistic regression and NGR), and

4) Bayesian methods (forecast assimilation and BMA).

This is probably the most thorough study to date in terms

of including the greatest number ofMOSmethods for the

postprocessing of ensemble forecasts. The three best

performing methods were found to be logistic regression,

NGR, and ensemble dressing. Wilks and Hamill (2007)

further compared these threemethods for postprocessing

daily temperature, and medium-range (6–10 and 8–

14 days) temperature and precipitation forecasts. The

results showed there was not a single best method for all

of the applications of daily and medium-range forecasts.

For example, the logistic regression method yielded the

best Brier score (BS) for central forecast quantiles,

while the NGR forecasts displayed slightly greater ac-

curacy for probability forecasts of the more extreme

daily precipitation quantiles. Hagedorn et al. (2008) and

Hamill et al. (2008) did a parallel study that used NGR

and logistic regression for postprocessing temperature

and precipitation, respectively, using the ECMWF and

Global Forecast System (GFS) ensemble reforecasts.

The skill and reliability of ECMWF andGFS ensemble

temperature and precipitation forecasts were largely

improved when using the NGR and logistic regression

methods, respectively. These studies also emphasized

the benefits of using ensemble retrospective forecasts

(reforecasts).

Other studies such as Wilson et al. (2007) and

Soltanzadeh et al. (2011) showed that BMA is also able

to improve the skill and reliability of ensemble forecasts.

However, most studies were only focused on the cali-

bration of temperature rather than precipitation using

BMA, because the original BMA developed by Raftery

et al. (2005) was only suitable for variables whose pre-

dictive PDFs are approximately normal. To use it for the

calibration of precipitation, Sloughter et al. (2007) ex-

tended BMA by modeling the predictive PDFs corre-

sponding to an ensemblemember as amixture of a discrete

event at zero and a gamma distribution. The extended

BMAyielded calibrated and sharp predictive distributions

for 48-h precipitation forecasts. It even outperformed the

logistic regression at estimating the probability of high

precipitation events, because it gives a full predictive PDF

rather than separate forecast probability equations for

different predictand thresholds. Similarly, Wilks (2009)

also extended the logistic regression to provide full PDF

forecasts. The main advantage of the extended logistic

regression is that the forecasted probabilities are mutually

consistent, thus, the cumulative probability for a small

predictand threshold cannot be larger than the probability

for a larger threshold (Wilks 2009). Based on the above-

mentioned studies, Schmeits and Kok (2010) compared

the raw ensemble output, modified BMA, and extended

logistic regression for postprocessing ECMWF ensemble

precipitation reforecasts. The results showed that, even

though the raw ensemble precipitation forecasts were

relatively well calibrated, their skill could be significantly

improved by the modified BMA and extended logistic

regression methods. However, the difference in skill be-

tweenmodifiedBMAand extended logistic regressionwas

not significant.

MARCH 2014 CHEN ET AL . 1107

Unauthenticated | Downloaded 05/04/22 11:44 AM UTC

Even though a number of methods have been pro-

posed for postprocessing the ensemble weather fore-

casts, most of them are aimed at finding the underlying

probabilistic distribution of forecasted variables. How-

ever, for some practical applications, such as ensemble

streamflow predictions, several sets of discrete, auto-

correlated time series over several days are needed for

driving the impact models (e.g., hydrological models).

However, there is no simple way to go from the un-

derlying distribution to the generation of a discrete,

autocorrelated time series that is fully consistent with

the underlying distribution. This study presents a new

method for postprocessing ensemble weather forecasts

using a stochastic weather generator. The ensemblemean

precipitation and temperature anomalies are used as

predictors for the calibration of precipitation and tem-

perature, respectively. A great number of ensemble

members can be produced using the stochastic weather

generator with a gamma distribution for generating pre-

cipitation amounts and a normal distribution for gener-

ating temperature.A simple bias correction (BC)method

is used as a benchmark to demonstrate the performance

of the proposed method [i.e., the generator-based post-

processing (GPP) method]. The GPP ensemble forecasts

were comparedwithBC and rawGFS ensemble forecasts

over two Canadian watersheds in Quebec, Canada, using

a set of deterministic and probabilistic metrics. The ulti-

mate goal of this study is to provide reliable ensemble

weather forecasts for ensemble streamflow forecasts.

Therefore, watershed-averaged precipitation and tem-

perature are used instead of traditional station meteo-

rological data.

2. Study area and dataset

a. Study area

The ultimate goal of this project is to provide and

evaluate ensemble streamflow forecasting. It is with this

goal in mind that we chose to focus on watershed-

averaged meteorological data instead than station data.

Accordingly, this study is conducted over two Canadian

catchments located in the Province of Quebec (Fig. 1).

Two different catchments (Peribonka andYamaska) were

selected to evaluate the impact of basin characteristics

on ensemble weather forecasts. Both the Peribonka and

Yamaska catchments are composed of several tributaries

draining basins of approximately 27 000 and 4843km2

in southeastern and southern Quebec, respectively. The

southern parts of the Peribonka and Yamaska catch-

ments, known as the Chute-du-Diable (CDD) and the

Yamaska (YAM) watersheds, respectively, are used in

this study. The two watersheds differ in size (9700 vs

3330km2) and location (the CDDwatershed is located in

central Quebec and the smaller YAM watershed is lo-

cated in southern Quebec). Additional details on both

watersheds are presented below.

1) CHUTE-DU-DIABLE (CDD)

The CDD watershed (48.58–50.28N, 70.58–71.58W) is

located in central Quebec. With a mostly forested sur-

face area of 9700 km2, it is a subbasin of the Peribonka

Riverwatershed. The basin is part of the northernQuebec

subarctic region, characterized by wide daily and annual

temperature ranges, heavy wintertime snowfall, and pro-

nounced rainfall and/or snowmelt peaks in the spring

(April–June; Coulibaly 2003). The average annual rainfall

in the area is 962mm, of which about 36% is snowfall. The

average annual maximum and minimum temperatures

(Tmax and Tmin) between 1979 and 2003 were 5.498 and25.858C, respectively. The CDD watershed contains a

large hydropower reservoir managed by Rio Tinto

Alcan for hydroelectric power generation. River flows

are regulated by two upstream reservoirs. Snow plays

a crucial role in the watershed management, with 35%

of the total yearly discharge occurring during the spring

flood. The mean annual discharge of the CDDwatershed

FIG. 1. Location map of the two catchments.



is 211m3 s21 with daily maximum registered flood of

1666m3 s21. Snowmelt peak discharge usually occurs in

May and averages about 1200m3 s21.

2) YAMASKA (YAM)

The YAM watershed (45.18–46.18N, 72.28–73.18W) is

composed of a number of tributaries draining a basin of

approximately 4843 km2 in southern Quebec; the south-

ern part of the YAM basin, with an area of 3330km2, is

used in this study. The average annual rainfall in the area

is 1175mm, of which about 23% is snowfall. The average

annual Tmin and Tmax are above the freezing mark at

0.568 and 10.838C, respectively, between 1979 and 2003.

Themean annual discharge of theYAMRiver is 61m3 s21

with a daily maximum registered flood of 881m3 s21.

Snowmelt peak discharge usually occurs in April and

averages about 495m3 s21.

b. Dataset

The dataset consists of observed and ensemble-

forecasted daily total precipitation and mean tempera-

ture. The observed daily precipitation and temperature

over two watersheds were taken from the National Land

andWater Information Service (www.agr.gc.ca/nlwis-snite)

dataset covering the period of 1979–2003. This dataset

was created by interpolating station data to a 10-km grid

using a thin plate-smoothing spline surface fittingmethod

(Hutchinson et al. 2009). All grid points within a water-

shedwere averaged to represent the observed time series.

Ensemble forecasts (daily total precipitation and mean

temperature) with the global grid of 2.58 were taken fromthe GFS reforecast dataset (http://www.esrl.noaa.gov/

psd/forecasts/reforecast/; Hamill et al. 2006). Several

previous studies (e.g., Hamill et al. 2004, 2006;Hamill and

Whitaker 2006, 2007; Whitaker et al. 2006; Wilks and

Hamill 2007) have presented the benefit of calibrating

probabilistic forecasts using ensemble reforecast data-

sets. Forecasts for each day since 1979 were made with

GFS, composed of a 15-member run out to 15 days. Since

little skill is retained for precipitation after 1 week, only

1–7 lead days are used in this study over the 1979–2003

time frame. Two grid boxes were selected and averaged

for the CDD watershed and only one grid box was se-

lected for the YAM watershed.

3. Methodology

a. Stochastic weather generator

A stochastic weather generator is a computer model

that can produce climate time series of arbitrary length

and with statistical properties similar to those of the ob-

served data (Richardson 1981; Nicks and Gander 1994;

Semenov and Barrow 2002; Chen et al. 2010, 2012). The

generation of precipitation and temperatures are usually

the two main components of a weather generator. Pre-

cipitation is most often generated using a two-component

model: one for the precipitation occurrence and the other

for the wet-day precipitation amount.

The precipitation occurrence is usually generated using

a Markov chain with various orders based on transition

probabilities. Alternatively, the precipitation occurrence

can also be generated based on an unconditional pre-

cipitation probability if the precipitation model only

considers the wet- and dry-day probabilities rather than

the wet- and dry-spell structures. In this sense, if a random

number drawn from a uniform distribution for one day is

less than the unconditional precipitation probability, a wet

day is predicted. Since the weather generator is used in

this study to synthesize the wet and dry states of ensemble

members for a given day rather than to generate the

continuous time series of precipitation occurrence, only

the second method was used. For a predicted wet day,

stochastic weather generators usually produce the pre-

cipitation amount by using a parametric probability dis-

tribution (e.g., exponential and gammadistributions). The

two-parameter gamma distribution is the most widely

used method to simulate wet-day precipitation. Tem-

perature is usually generated using a two-parameter

(mean and standard deviation) normal distribution. In

this study, the gamma and normal distributions are used

to generate the ensemble members of precipitation and

temperature, respectively, for a given day. Similarly to

stochastic weather generators such as Weather Genera-

tor (WGEN;Richardson 1981) andWeather generator of

the �Ecole de Technologie Sup�erieure (WeaGETS; Chen

et al. 2012), the auto- and cross correlation of and be-

tween Tmax and Tmin are preserved using a first-order

linear autoregressivemodel. The detailedmethodology is

presented below.

b. Generator-based postprocessing (GPP) method

The GFS ensemble forecasts are postprocessed using

the GPP method. The observed daily precipitation and

temperature are used as predictands, and the forecasted

ensemble mean precipitation and temperature anoma-

lies are used as predictors, respectively. The evaluation

of the GPP method is based on a cross-validation ap-

proach (Wilks 2005) to ensure the independence of the

training and evaluation data. Given 25 years of available

forecasts, when making forecasts for a particular year,

the remaining 24 years were used as training data.

1) POSTPROCESSING FOR PRECIPITATION

The calibration of precipitation is based on four sea-

sons: winter [January–March (JFM)], spring [April–June



http://www.agr.gc.ca/nlwis-snite

http://www.esrl.noaa.gov/psd/forecasts/reforecast/

http://www.esrl.noaa.gov/psd/forecasts/reforecast/

(AMJ)], summer [July–September (JAS)], and autumn

[October–December (OND)]. The methodology for the

precipitation calibration is based on the hypothesis that

a relationship must exist between the mean of the en-

semble forecast and both the probability of precipitation

occurrence and wet-day precipitation amounts. The larger

the mean of the ensemble forecast, the more likely that

rainfall will occur, and the more likely that a large pre-

cipitation amount will be registered. For each season and

lead day, the ensemble precipitation is calibrated with the

following three steps.

1) The ensemble mean precipitation is first calculated

using the 15-member ensemble precipitation fore-

casts. The calculated ensemble mean precipitation

for each lead day in the given season is then classified

into several classes based on wet-day precipitation

amounts. Depending on the training samples, the

number of classes is different. A maximum of 10

classes with wet-day precipitation amounts between

0–1, 1–2, 2–3, 3–4, 4–5, 5–7, 7–10, 10–15, 15–25, and

$25mm are used in this study. If the training sample

in the largest class is less than 30 precipitation events,

the last two classes are combined, and so on. The

probabilities of the observed precipitation occurrence

and observed mean wet-day precipitation amount

corresponding to each class of forecasted precipitation

are then calculated. The observed wet-day precipita-

tion events in each class are fitted using a gamma

distribution. For example, for the first class, all of

the observed wet-day precipitation that correspond to

ensemble mean precipitation ranging between 0 and

1mm are pooled and fitted using a gamma distribution.

2) The second step involves establishing relationships

between the forecasted precipitation classes and the

probabilities of observed precipitation occurrence

and observed mean wet-day precipitation amounts.

Figure 2 presents the probabilities of the observed

precipitation occurrence and mean wet-day precipi-

tation amounts as functions of the forecasted precip-

itation classes for summer precipitation at 1 and

3 lead days over the two selected watersheds (solid

lines in the Fig. 2). The results clearly show the re-

lationship between the mean of the ensemble forecast

and the observed probability of precipitation occurrence

FIG. 2. The relationships between forecasted summer (JAS) precipitation classes and the probability of observed

summer precipitation occurrence and mean wet-day precipitation amounts for 1 and 3 lead days over the (a),(b)

CDD and (c),(d) YAM watersheds.



(left-hand side), and between that same mean and

the observed mean precipitation amount (right-hand

side). For a large ensemble mean, the observed

precipitation occurrence is nearly 100% for the

larger basin. For a 7-day lead time (not shown), both

relationships are close to a horizontal line, indicating

that the ensemble precipitation forecast has little

relevance for that lead time. The variability observed

in the graphs is due to sampling times that are too

short. Accordingly, the lines were smoothed using a

second-order polynomial (dashed lines in Fig. 2).

3) In the third step, the relationships (smoothed func-

tions) between the probability of the observed precip-

itation occurrence and the forecasted precipitation

class are directly used to determine the probability

of precipitation occurrence for a given day. For any

given day in the evaluation period, a forecasted pre-

cipitation class is first determined according to the

ensemble mean precipitation for that day. For exam-

ple, if the ensemblemean precipitation is 0.5mm for a

given day, it is classified into the first class (between

0 and 1mm). The corresponding probability of ob-

served precipitation occurrence (e.g., 40% for the

YAM basin) is then used as the precipitation proba-

bility for this day. Then 1000 random numbers drawn

from a uniform distribution are generated to rep-

resent 1000 members for this day. If the random

numbers are less than or equal to the corresponding

probability of observed precipitation occurrence (e.g.,

40%), the correspondingmembers are predicted to be

wet, otherwise, they are predicted to be dry. Finally, if

amember is deemedwet, the fitted gamma function in

the corresponding class is used to generate the pre-

cipitation amounts with uniform random numbers.

Overall, 1000 members are generated for any given

day. A large number of members are used to obtain

the truest possible results of a weather generator. A

small number of samples could result in biases due to

the random nature of the stochastic process. The

proposed postprocessing approach does not directly

take into account the autocorrelation of precipitation

occurrence. During the period covered by the ensem-

bleweather forecast, the probability of precipitation is

directly given by the forecast for each lead day, and

thus preserves the coherence of the ensemble forecast.

As such the autocorrelation of precipitation occur-

rence is directly governed by the forecast. If the

forecast is wet for several days, all 1000 members will

carry this information stochastically and all sequences

will be dominated bywet days.As long as the forecasts

have skill, using the probability of precipitation oc-

currence given by the forecasts is highly preferable

to using the mean probabilities used to generate the

occurrence series in a pure stochastic mode. Similarly

to most stochastic weather generators, the proposed

method does not account for the possible autocorre-

lation of precipitation amounts.

2) POSTPROCESSING FOR TEMPERATURE

The postprocessing for temperature is performed on

a daily basis. The calibration of ensemble temperature

forecasts includes two stages. The first stage consists of

the BC of the ensemble mean temperature using a linear

regression method. The second stage adds the ensemble

spread using a weather generator–based method. For

each evaluation year and lead day, the ensemble tem-

perature forecast BC follows three specific steps:

1) Similarly to precipitation, the ensemble mean tem-

perature (24 yr3 365 days) is first calculated using the

15-member ensemble temperature forecasts (24 yr 3365 days 315 members). The mean observed daily

temperature (1 yr3 365 days) is also calculated using

the 24-yr daily time series (24 yr 3 365 days). The

temperature anomalies (24 yr 3 365 days) of both

observed and forecasted data are then determined by

subtracting themean observed daily temperature (1 yr

3 365 days) from the observed temperature (24 yr 3365 days) and from the ensemble mean temperature

(24 yr 3 365 days), respectively.

2) Linear equations are fitted between observed and

forecasted temperature anomalies using a 31-day win-

dow centered on the day of interest. For example, when

fitting the linear equation for 16 January, the tempera-

ture anomalies from 1 January to 31 January over 24yr

are pooled. The use of a 31-day window ensures there

will be enough data points to fit a reliable equation. This

process is conducted for each day to obtain 365 equa-

tions, which can be used to correct the bias of ensemble

mean temperature anomaly for an entire year.

3) The fitted linear equations in step 2 are used to correct

the daily ensemble mean temperature anomaly for

each validation year. Finally, the bias-corrected en-

semble mean temperature is obtained by adding the

mean observed temperature to the bias corrected

temperature anomalies.

A scatterplot of the ensemble mean temperature be-

fore and after BC is plotted against the corresponding

observed temperature for the 1 lead day over the two

selected watersheds (Fig. 3). In this case, all 25 years of

raw and corrected mean forecasts are pooled together

rather than separated by 31-day windows. Only a slight

bias is observed for the raw GFS ensemble mean temper-

ature for both watersheds, as displayed in Figs. 3a and 3c,

and where the linear regression line slightly deviates from



the 1:1 line. However, this bias is removed by using the

linear regression method, as shown in Figs. 3b and 3d

where the linear regression and 1:1 lines overlap each

other.

After the BC of the ensemble mean temperature, the

ensemble spread is added using a stochastic weather

generator–based method. The ensemble temperature of

any given day is supposed to follow a two-parameter

(mean and standard deviation) normal distribution. The

bias-corrected ensemble mean temperature is used as the

mean of the normal distribution. The standard deviation

for each season (the same standard deviation is used for

each day in a specific season) is obtained using an opti-

mization algorithm to minimize the root-mean-square

error (RMSE) of rank histogram bins. Specifically, the

optimization algorithm involves four steps. 1) A number

of standard deviation values (ensemble spreads) are

preset for each season. For example, they are set between

0.58 and 58Cwith an interval of 0.058C in this study. 2) The

ensemble temperature for every day in this season is

calculated by multiplying each standard deviation by

a normally distributed random number and adding the

bias-corrected ensemble mean temperature. This step is

repeated for all preset standard deviation values to obtain

a number of temperature ensembles. 3) Rank histograms

are constructed for all temperature ensembles. 4) The

RMSEs of rank histogram bins are calculated for all

histograms. The standard deviation corresponded to the

lowest RMSE is selected as the optimized one for this

season. These four steps are repeated for all four seasons

to obtain four optimized standard deviations for entire

year postprocessing. To insure there are enough samples

to construct reliable rank histograms, the standard de-

viation is optimized at the seasonal scale.

For any given day, the postprocessed ensemble tem-

perature is found by multiplying the optimized standard

deviation in the specific season with a normally distrib-

uted random number (1000 in this study to represent

1000 members) and adding the bias-corrected ensemble

mean temperature for that day. However, the ensemble

temperature generated this way lacks an autocorrelation

structure. For hydrological studies, autocorrelated time

series of Tmax and Tmin are usually needed to run hy-

drological models. Applying a similar technique used in

weather generators, the observed auto- and cross corre-

lation for and between Tmax and Tmin can be preserved

using a first-order linear autoregressive model. With this

model, the Tmax and Tmin ensembles over several lead

FIG. 3. The relationships between observed temperature and (left)GFS and (right)GPP ensemblemean temperature

for 1 lead day over the (a),(b) CDD and (c),(d) YAM watersheds [linear regression (LR)].



days are generated at the same time, rather than generated

day after day and variable after variable.

The ensemblemean Tmax and Tmin are first obtained

by adding the mean observed Tmax and Tmin to the

biased-corrected temperature anomalies (obtained from

step 2), respectively, for all lead days. The residual series

of Tmax and Tmin with desired auto- and cross correla-

tion are then generated using a first-order linear autore-

gressive model:

xi( j)5Axi21( j)1Bei( j) , (1)

where xi( j) is a (2 3 1) vector for lead day i whose el-

ements are the residuals of the generated Tmax ( j 5 1)

and Tmin ( j5 2); ei( j) is a (23 1) vector of independent

random components that are normally distributed with

amean of zero and a variance of unity. HereA andB are

(2 3 2) matrices whose elements are defined such that

the new sequences have the desired auto- and cross-

correlation coefficients. The A and B matrices are de-

termined by

A5M1M210 , (2)

BBT 5M0 2M1M210 MT

1 , (3)

where the superscripts21 and T denote the inverse and

transpose of thematrix, respectively, andM0 andM1 are

the lag 0 and lag 1 covariance matrices and calculated

using the observed time series for each season. With

Eq. (1), a number of the residual series over all lead days

are generated to represent the ensemble members. Fi-

nally, the ensemble Tmax and Tmin over several days can

be found by multiplying the optimized standard de-

viation in the specific season by the generated residual

series and adding the bias-corrected ensemble mean

Tmax and Tmin.

The first-order linear autoregressive model has been

tested extensively in several studies (e.g., Richardson

1981; Chen et al. 2011, 2012) and showed a good perfor-

mance at preserving the desired auto- and cross correla-

tion of and between Tmax and Tmin. Since the main goal

of this paper is to present the postprocessing method,

only results involving mean temperature are presented.

c. Bias correction (BC) method

A simple BC method is used as a benchmark to dem-

onstrate the advantages of the proposed GPP method.

The BC step for temperature is similar to that of the GPP

method. Linear equations with the form of y 5 ax 1 b

(where a and b are two estimated coefficients) are fitted

between observed and forecasted temperature anomalies

using a 31-day window centered on the day of interest.

The fitted linear equations are than used to correct the

daily ensemble temperature anomaly for all 15 members.

This step supposes that all ensemble members have the

same bias. The variance optimization stage of the GPP

methodwas not applied to theBCmethod.As such, it can

be expected to outline the advantages of theGPPmethod

over the simpler direct bias correction method.

A bias correction procedure is also applied to the

ensemble precipitation forecast. Linear equations of the

form y 5 ax (where a is the estimated coefficient) are

fitted between the observed and forecasted mean pre-

cipitation using a 31-day window centered on the day of

interest. It differs from the temperature correction in

that the linear equation for precipitation is fitted using

mean precipitation values and not the daily values. This

results in a more reliable estimation of the linear de-

pendence between observed and forecasted values.

Moreover, since the distribution of the daily precipitation

is highly skewed, a fourth root transformation was ap-

plied to precipitation values prior to fitting the linear

equations. Similarly to the temperature, for a given

day, the same linear equation is used for all ensemble

members.

d. Verification of the postprocessing method

Rank histograms permit a quick examination of the

quality of ensemble weather forecasts. Consistent biases

in an ensemble weather forecast result in a sloped rank

histogram, and a lack of variability (underdispersion) is

revealed as a U-shaped, concave, population of the ranks

(Hamill 2001). Thus, the rank histogram is first used to

evaluate ensemble precipitation and temperature fore-

casts. However, a uniform rank histogram is a necessary

but not a sufficient criterion for determining the re-

liability of an ensemble forecast system (Hamill 2001).

Besides, some other characteristics are not evaluated by

rank histograms, such as the resolution. Other verifica-

tion metrics are thus necessary for testing the predictive

power of an ensemble weather forecast. In this study, the

GFS, BC, and GPP ensemble precipitation and temper-

ature forecasts are verified using the Ensemble Verifica-

tion System (EVS) developed byBrown et al. (2010). The

selected verification metrics include two deterministic

metrics for verifying the ensemble mean, and two prob-

abilistic metrics for verifying the distribution. The con-

tinuous ranked probability skill score (CRPSS) and the

Brier skill score (BSS) are also used to verify the skill of

the ensemble forecasts relative to the climatology.

The two deterministic metrics are the mean absolute

error (MAE) and RMSE. TheMAEmeasures the mean

absolute difference between the ensemble mean forecast

and the corresponding observation and the RMSE mea-

sures the average square error of the ensemble mean



forecast. The two probabilistic metrics include the BS

and the reliability diagram. The BSmeasures the average

square error of a probability forecast. It is analogous to

the mean square error of a deterministic forecast. It can

be decomposed into three components: reliability, reso-

lution, and uncertainty. A reliability diagram measures

the accuracy with which a discrete event is forecast by an

ensemble forecast system. The BS and reliability diagram

only verify discrete events in the continuous forecast

distributions. Thus, one or more thresholds have to be

defined to represent cutoff values from which discrete

events are computed. Six thresholds corresponding to the

probability of precipitation and temperature exceeding

10% (lower decile), 33% (lower tercile), 50% (median),

67% (upper tercile), 90% (upper decile), and 95% (95th

percentile) are used in this study. Details of these metrics

can be found in Brown et al. (2010), Demargne et al.

(2010), and in the user manual of the EVS (Brown 2012)

(http://amazon.nws.noaa.gov/ohd/evs/evs.html).

4. Results

Figure 4 presents the rank histograms of GFS andGPP

ensemble precipitation forecasts for 1 lead day over two

watersheds. Only wet-day precipitation is used to pro-

duce the rank histograms. To allow for a proper com-

parison with the raw ensemble forecasts, only 15members

are generated using the GPP method in this case. The

results show that the distributions of the raw GFS en-

semble forecasts are highly nonuniform; there is a

marked tendency for the distribution to be most popu-

lated at the lowest and extreme ranks to form U-shaped

rank histograms (Figs. 4a,c). This indicates that the raw

GFS forecasts are considerably underdispersive for both

watersheds. Wet biases are observed for the CDD wa-

tershed and dry biases for theYAMwatershed.However,

after the calibration with the GPP method, the rank his-

tograms aremuch flatter for both watersheds (Figs. 4b,d),

even though only 15 members are generated in this case.

Using more members would result in more uniform rank

histograms.

Figure 5 shows the rank histograms of ensemble tem-

perature forecasts before and after calibration for 1 lead

day over the twowatersheds. Similarly to precipitation, to

allow for a fair comparison with the raw ensemble fore-

casts, only 15 members are generated for the GPP en-

semble forecasts in this case. The results show that the

distribution of raw GFS ensemble forecasts is highly

FIG. 4. Rank histograms of the (left)GFS and (right)GPP ensemble precipitation forecasts for the 1 lead day over the

(a),(b) CDD and (c),(d) YAM watersheds.



http://amazon.nws.noaa.gov/ohd/evs/evs.html

nonuniform (U shaped) for temperature. There is a

marked tendency for the distribution to be most popu-

lated at the extreme ranks, indicating the underdispersion

and cold bias of the raw forecasts over the two water-

sheds. However, rank histograms of calibrated ensemble

forecasts tend to be uniform for both watersheds.

Figures 6 and 7 show the quality of the ensemblemean

forecast before and after postprocessing in terms of the

MAE and RMSE, respectively, for both precipitation

and temperature over bothwatersheds. Both statistics are

computed using all forecast–observation pairs (25 yr 3365 days). Overall, the GFS ensemble mean forecasts

display large errors for both precipitation and tem-

perature covering leading days from 1 to 7 days. How-

ever, theGPPmethod consistently improves the qualities

of the ensemble mean forecasts for all leads. In terms of

theMAE, the BCmethod displaysmore benefits than the

GPPmethod for precipitation over bothwatersheds. This

is expected, since the BCmethod specifically accounts for

the bias of the GFS forecast. However, in terms of the

RMSE, the GPP method consistently performs better

than the BCmethod for precipitation. Since both BC and

GPPmethods share the same step at removing the bias of

the ensemblemean temperature, theMAEandRMSEof

the forecast temperature are the same for both.

As displayed in Figs. 6a and 6c, the quality of raw en-

semble mean forecasts decreases slightly with increasing

lead time for precipitation in terms of the MAE. How-

ever, the RMSE of raw ensemble mean forecasts tends to

decrease with an increase in lead time for precipitation

(Figs. 7a,c). After postprocessing, the forecast quality

slightly decreases with the increase in lead days. For the

ensemble mean temperature, there is a progressive de-

cline in forecast quality with increasing lead time in terms

of both MAE and RMSE.

Moreover, the quality of the ensemble mean forecast

at the CDD watershed is consistently better than that

at the YAMwatershed for precipitation, suggesting that

watershed size plays an important role. This likely indi-

cates that the numerical weather forecast system is better

at representing precipitation events over a larger area,

since the representation of convective events is very dif-

ficult considering the horizontal resolution of the com-

putational grid. In this work, the observed precipitation

is watershed averaged, and as such, convective precipi-

tation extremes are smoothed over the larger basin. The

FIG. 5. Rank histograms of the (left) GFS and (right) GPP ensemble temperature forecasts for 1 lead day over the

(a),(b) CDD and (c),(d) YAM watersheds.



same extremes would play a more important role in a

smaller watershed.

The skill of ensemble forecasts relative to unskilled

climatology is assessed using themean continuous ranked

probability skill score (MCRPSS; Fig. 8). The GFS en-

semble precipitation forecasts show negative skill relative

to the climatology over both watersheds. The forecast

skill consistently increases with the forecast leads. This is

caused primarily by the lack of spread (greater sharpness)

in shorter lead ensemble forecasts and the larger spread

in longer lead ensemble forecasts. Even though the BC

method is able to improve the ensemble precipitation

forecast to a certain extent, the skill is still negative for

all 7 lead days. TheGPPmethod considerably increases

the skill of the ensemble forecast for both watersheds

and is consistently better than the BCmethod. The skill

of the GPP forecast decreases with increasing lead times,

and is close to zero at the 7-day lead, indicating that the

ensemble weather forecast has reached its predictability

limit. Thus, the calibration of ensemble precipitation

forecasts for a period of 7 lead days is appropriate in this

study.

The GFS ensemble temperature forecasts are much

more skillful than their precipitation forecasts for all leads

over both watersheds. Even though the GFS ensemble

temperature forecasts are skillful for the period up to

1 week, they can be further improved by both GPP and

BCmethods. TheGPP ensemble temperature forecasts

are consistently better than the BC ones for all 7 lead

days and both watersheds, indicating that benefits of

the GPP method not only come from the BC stage, but

also from the variance optimization stage. The BC

stage plays a slightly more important role at improving

the raw forecasts. Moreover, the skill of ensemble tem-

perature forecasts (before and after postprocessing)

consistently decreases with the increase in lead time for

both watersheds.

For probabilistic metrics computed for discrete events,

such as the BS, BSS, and reliability diagrams used in this

study, a number of thresholds have been defined. As

mentioned earlier, six thresholds were used in this study.

Since similar patterns are obtained, only the results with

the threshold exceeding the median are presented for il-

lustration for all four metrics.

Figures 9 and 10 show the BS of GFS, GPP, and BC

ensemble precipitation and temperature forecasts for

bothwatersheds, with leads ranging between 1 and 7 days.

The reliability, resolution, and uncertainty components

of theBS and theBSS,whichmeasure the performance of

an ensemble weather forecast relative to the climatology,

FIG. 6. MAE of GFS, GPP, and BC ensemble mean (left) precipitation and (right) temperature forecasts for 1–7 lead

days over the (a),(b) CDD and (c),(d) YAM watersheds.



are also presented. The reliability term of BS measures

how close the forecast probabilities are to the true

probabilities, with smaller values indicating a better

forecast system. The resolution termmeasures howmuch

the predicted probabilities differ from a climatological

average and therefore contribute valuable information.

Thus, a larger resolution value suggests a better forecast.

According to its definition, the uncertainty termof the BS

is always equal to 0.25 [0.5 3 (1 2 0.5)] when using the

median as the threshold.

In terms of the BS, the ensemble forecasts are less ac-

curate (in overall performance) for precipitation (Fig. 9),

and reasonably accurate for temperature (Fig. 10) for

both watersheds. In terms of the BS, the BC method

performs slightly better for the temperature forecasts, but

shows no improvement for the ensemble precipitation

forecasts. Nevertheless, the GPP method consistently

increases the accuracy for both precipitation and tem-

perature for all 7 lead days for both watersheds, with

a consistent increase in the resolution component of the

BS. In addition, the reliability component of theBS is also

improved for the ensemble precipitation forecasts for all

lead times. In contrast, the BS’ reliability component is

slightly degraded for the ensemble temperature forecasts

at all lead times. This is because the raw ensemble

forecasts are very reliable for temperature to begin

with (mean reliability component of 0.005 for the CDD

watershed and 0.003 for the YAM watershed). The

moderate decrease in the BS is due to a relatively large

increase in the resolution component and a slight de-

terioration of the reliability component.

In terms of the BSS, the skill of the GFS ensemble

precipitation forecast is negative for all 7 lead days. The

BC method results in small improvements for preci-

pitation forecasts, but only for the first few lead days. It

then progressively becomes worse than the GFS fore-

casts for the other lead days. The GPP method consid-

erably improves the skill of the ensemble forecast for

both watersheds and is consistently better than the BC

method. The skill of the GPP ensemble forecast de-

creases with increasing lead times, with the BSS being

close to zero at 7 lead days, further indicating that the

ensemble weather forecast retains some skill for a pe-

riod of up to 1 week for precipitation. The BSS shows

high skill in GFS ensemble temperature forecasts, for all

lead times and for both watersheds. The BC method

slightly improves the skill of the ensemble temperature

forecast, but at the expense of the resolution. However,

the GPP ensemble forecast consistently exceeds the skill

of GFS and BC ensemble forecasts. In particular, the

FIG. 7. RMSE of GFS, GPP, and BC ensemble mean (left) precipitation and (right) temperature forecasts for 1–7

lead days over the (a),(b) CDD and (c),(d) YAM watersheds.



BSS of the GPP ensemble forecast at 7 lead days is

greater than that of theGFS forecast at the 1 lead day for

the CDD watershed.

The reliability diagram (Hartmann et al. 2002) is a

graph of the observed frequency of an event plotted

against its forecast probability. It provides information

with respect to the reliability, resolution, skill, and

sharpness of a forecast system. Figure 11 presents the

reliability diagrams of ensemble precipitation and tem-

perature forecasts before and after postprocessing for

a probability threshold exceeding the median for 1 and 3

lead days over both watersheds. Underdispersion of the

raw ensemble precipitation forecast (Fig. 4) is reflected in

the reliability diagrams (Figs. 8a,c) for both 1 and 3 lead

days for both watersheds, as reflected in the slopes of the

reliability curves that are smaller than 1. This indicates

that the GFS ensemble precipitation forecasts are poorly

calibrated with limited skill and resolution. The prob-

ability forecasts derived from the raw ensembles are

overconfident, which can be reflected by the sharpness

(relative frequency of usage). The BCmethod results in

little improvement for the ensemble precipitation fore-

casts, essentially because it does not account for the

spread. The GPP method results in a dramatic improve-

ment in the reliability of the ensemble precipitation

forecasts for both 1 and 3 lead days, although the sharp-

ness is somewhat lessened. Postprocessing thus improves

the reliability at the expense of the sharpness.

The cold biases of the raw temperature forecast (Fig. 5)

are also reflected in the reliability diagrams (Figs. 8b,d)

for both 1 and 3 lead days for both watersheds, as

displayed by the underforecasting. The ensemble tem-

perature forecasts calibrated using the weather generator–

based method are much more reliable than the GFS and

BC forecasts for both 1 and 3 lead days over both wa-

tersheds, as indicated by the reliability curves, which are

very close to the 1:1 reference line.More importantly, the

improvement in reliability results in a slight decline in the

sharpness. The better performance of the GPP method

over the BCmethod is a clear indication that a significant

part of the performance is derived from the variance

optimization stage.

5. Discussion and conclusions

Ensemble weather forecasts generally suffer from bias

and tend to be underdispersive, which limits their pre-

dictive power. Several methods, such as logistic regression,

BMA, and ensemble dressing have been proposed for

postprocessing these ensemble forecasts. These methods

FIG. 8. MCRPSS of GFS, GPP, and BC ensemble (left) precipitation and (right) temperature forecasts for 1–lead

days over the (a),(b) CDD and (c),(d) YAM watersheds.



are relatively complex to set up and generally aim at es-

timating the underlying predictive PDFs. However, a

series of point values are often more convenient for

practical applications such as ensemble streamflow fore-

casts, which need discrete, autocorrelated time series

over several days in order to run hydrological models.

These discrete, autocorrelated time series of precipitation

and temperature need to be physically constrained. For

example, temperatures changes from one day to the next

and the probability of precipitation occurrence are not

random. Even if a method is adequate at reconstructing

the underlying PDF, there is no simpleway to go from the

underlying distributions to generating several time series

fully consistent with the underlying distributions. The

GPP method presented in this study is significantly sim-

pler to implement than most existing methods, and it can

readily generate an infinite number of discrete, auto-

correlated time series over the forecasting horizon. The

auto- and cross correlation of and between Tmax and

Tmin were specifically taken into account with this

method. Moreover, the GPP method specifically takes

into account the precipitation occurrence biases of en-

semble forecasts. Precipitation amounts are modeled

using a parametric distribution. This underlying assump-

tion allows extreme values outside the range of the ob-

served data to be simulated. A gamma distribution was

used in this study. Other distributions with a heavy tail

(e.g., mixed exponential distribution and hybrid expo-

nential and generalized Pareto distribution) can also be

used to better represent the ensemble spread (C. Li et al.

2012; Z. Li et al. 2013). This is one of the major advan-

tages of thismethod over analog techniques. Even though

the Hamill et al. (2006) approach also constructed a dis-

tribution from the observations associated with the ana-

logs, the inclusion of dry events makes it impossible to fit

the parametric distributions. Furthermore, the perfor-

mance of an analog technique is strongly depended on the

reforecast period, especially for extreme precipitation

forecasts. The rarer the events, the more difficult it is to

find appropriate forecast analogs (Hamill et al. 2006).

FIG. 9. BS and its decomposed components [Brier score uncertainty (BSunc), Brier score reliability (BSrel), and Brier score resolution

(BSres)] of (left) GFS, (middle) GPP, and (right) BC ensemble precipitation forecasts for 1–7 lead days over the (a)–(c) CDD and (d)–(f)

YAM watersheds. Probability exceeding 50% (median) is used as the threshold. The BSS of the ensemble forecasts relative to the

climatology is also presented.



However, this problem is avoided by the GPP methods,

since the extremes of ensemble forecasts are grouped

to the last class of the ensemble mean forecast. The

ensemble mean precipitation and mean temperature

anomalies are used as predictors for postprocessing pre-

cipitation and temperature, respectively. Since the spatial

resolution of ensemble weather forecasts (model scale) is

usually lower than the required resolution of real appli-

cations (e.g., watershed scale for hydrological studies),

postprocessing also acts as a downscaling method.

A simple BC method was used as a benchmark to

demonstrate the performance of the GPP method over

two Canadian watersheds located in the Province of

Quebec. Ensemble weather forecasts were taken from

theGFS retrospective forecast dataset. Much longer time

series are available for ensemble reforecasts than for

operational forecasts for properly calibrating the post-

processingmethod. Previous studies convincingly showed

that postprocessing using reforecasts achieved substantial

improvements in their skill and reliability (Hagedorn

et al. 2008; Hamill et al. 2008).

The GFS and GPP ensemble weather forecasts were

preliminarily tested using rank histograms. Similarly to

previous studies (Hamill and Colucci 1997; Hagedorn

et al. 2008), the GFS forecasts were found to be biased

and underdispersed, as illustrated by the excess pop-

ulations of the extreme ranks. This underdispersion was

more pronounced at the shorter forecast leads than for

longer forecast leads (results not shown). Uniform rank

histograms could be achieved for both precipitation and

temperature when postprocessed using the GPP method.

The performance of GFS, GPP, and BC ensemble

weather forecasts was further verified using both de-

terministic and probabilistic metrics. The deterministic

metrics (MAE and RMSE) showed large errors in GFS

ensemble mean forecasts for both precipitation and

temperature at all 7 lead days over both watersheds. The

GPPmethod was able to consistently improve the quality

of the ensemble mean forecasts. The skill of ensemble

weather forecasts relative to the climatology was mea-

sured using MCRPSS. The raw forecast had negative to

near-zero skill at all forecast leads for precipitation. The

GPP method substantially improved the skill of the en-

semble forecasts for precipitation, with MCRPSS being

positive for all 7 lead days. Even though relatively good

skill was observed for the raw ensemble temperature

FIG. 10. As in Fig. 9, but for ensemble temperature forecasts.



forecasts, they could be further benefitted by applying

the postprocessing method. The performance of the GPP

method was consistently better than that of the BC

method.

Probabilistic metrics computed for discrete events

including the BS, BSS, and reliability diagrams were

further used to verify the overall performance (accu-

racy, skill, reliability, resolution, and sharpness) of en-

semble weather forecasts. Overall, the GPPmethod was

able to consistently improve the accuracy of ensemble

forecasts for both precipitation and temperature over

both watersheds. It also consistently outperformed the

BC method. The GFS ensemble forecasts showed neg-

ative skill for precipitation for all 7 lead days. This in-

dicated that the underdispersedGFS forecasts were even

worse than the climatology for precipitation. However,

with the GPP method, a positive skill was achieved for

a period of up to 7 lead days. With the GPP method, the

skill of the ensemble temperature forecasts was also im-

proved, even though they usually had revealed reason-

able skills before postprocessing. Underdispersion of the

GFS ensemble precipitation forecasts was reflected in the

reliability diagrams, indicating that theGFS precipitation

forecasts were poorly calibrated and showed little skill

and resolution. The GPP method markedly improved

their reliability and resolution for all leads over both

watersheds. However, the sharpness was somewhat

diminished. This is consistent with other studies (e.g.,

Hamill et al. 2008) in that the reliability was improved

at the expense of sharpness. The reliability diagrams

showed cold biases for GFS ensemble temperature.

However, the reliability curves were very close to the 1:1

perfect line after postprocessing.

Overall, even though GFS ensemble forecasts are bi-

ased and tend to be underdispersed, their overall per-

formance was considerably improved using the proposed

GPP method. Predictably, the performance of the GPP

method decreasedwith increasing lead days. For theGFS

ensemble reforecasts and selected basins, 7 days was the

maximum lead day for precipitation. For temperature,

postprocessing over a longer period may be possible. The

use of the BC method for temperature provided an op-

portunity to separate the advantage of the GPP method

into that from the bias correction and variance optimiza-

tion stages. The better performance of the GPP method

clearly demonstrates the importance of the variance

FIG. 11. Reliability diagrams of GFS,GPP, and BC ensemble precipitation and temperature forecasts for 1 and 3 lead days over the (a),(b)

CDD and (c),(d) YAM watersheds. The probability of exceeding 50% (median) is used as the threshold.



optimization stage, even though the bias correction

carries the largest part of the performance gain.

Owing to restrictions in paper length, the only results

that were show for the probabilistic metrics (BS, BSS,

and reliability diagrams) are those with a threshold ex-

ceeding the median value. The use of higher thresholds

is also interesting for many real-world applications. The

ensemble weather forecasts were also tested using higher

thresholds (67%, 90%, and 95%).While the results for the

higher thresholds slightly differ from those obtained using

the median values, the patterns were very similar. Spe-

cifically, the skill of the GPP forecasts decreased slightly

with the increasing threshold, because the GFS forecast

performance gets worse for the higher thresholds. How-

ever, the degree of improvement obtained from the GPP

method increased with the larger thresholds.

The excellent performance of the postprocessing

schememay partly be due to the choice of basin-averaged

meteorological time series. While still much smaller than

the numerical model scale, the basin scale (9700 and

3330km2) is nevertheless slightly more comparable and

may result in a better representation of the precipitation

than at the station scale, as is more common. Themethod

performed very well with only one predictor (ensemble

mean for precipitation and ensemble mean anomaly for

temperature). No attempt was made at using additional

predictors. In particular, the use of the ensemble standard

deviation may yield additional improvements.

To obtain the true expectancy of a weather generator,

a large number of ensembles need to be generated with

the proposed method. Short time series could result in

biases due to the random nature of the stochastic process.

Thus, a 1000-member ensemble was generated in this

study. However, for hydrological studies, running such

a large ensemble may be time consuming for the fully

distributed model. However, as indicated by rank histo-

grams in Figs. 4 and 5, the ensemble with only 15 mem-

bers was nevertheless better than the GFS forecast.

Therefore, depending on research purposes and hydrol-

ogy model complexity, an ensemble with fewer members

may also be acceptable.

For the real climate system, a correlation exists be-

tween precipitation and temperature. Generally, mean

temperature is generally cooler on wet days. However,

the proposed GPP method generated precipitation

and temperature independently. This may affect this

correlation to a certain extent. To investigate thus, the

precipitation–temperature correlations were calculated

for observed and forecasted datasets using all 25-yr daily

time series. The correlation coefficients for GFS, GPP,

and BC forecasts were obtained by averaging all co-

efficients over all 7 lead days and all ensemble members.

The correlation coefficients are 0.189, 0.363, 0.161, and

0.227 for observed data, GFS, GPP, and BC forecasts,

respectively, over the CDD watershed. They are equal

to 0.119, 0.239, 0.088, and 0.069, respectively, over the

YAM watershed. These results indicate that the GFS

forecasts overestimated the precipitation–temperature

correlation. However, the precipitation–temperature

correlation was slightly underestimated when using the

GPP method. This is as expected, since the ensemble

precipitation and temperature were generated indepen-

dently. However, it should be noted that any bias cor-

rection or postprocessing method may be expected to

alter the precipitation–temperature correlation, unless

specifically taken into account.

The goal of this work was to provide a postprocessing

method to improve the ensemble weather forecasts for

ensemble streamflow forecasts in the Province of Quebec.

The proposed method was tested on two watersheds. For

a broader use, it should be tested usingmore datasets from

different climate zones. In addition, the performance of

a postprocessingmethodmay partly depend on the skill of

raw forecasts. Thus, it may be necessary to test the pro-

posed method using other ensemble weather forecast

products such as the ECMWF reforecast. Especially, in

the process of this work, a newer version of the GFS

reforecast dataset was made available showing improved

skills over the older one used in this study (Hamill et al.

2013). It would be interesting to know how the GPP

method would perform over this newer dataset. There-

fore, more comprehensive verifications, including evalu-

ating the proposed method over different locations and

using other ensemble weather forecast products are rec-

ommended in further studies.

Acknowledgments. This work is part of a project fun-

ded by the Projet d’Adaptation aux Changements Cli-

matiques (PACC26) of the Province of Quebec, Canada.

The authors thank the Centre d’Expertise Hydrique du

Qu�ebec (CEHQ) and the Ouranos Consortium on Re-

gional Climatology and Adaptation for their contribu-

tions to this project. The authors wish to thankDr.Vincent

Fortin of Environment Canada for his insights on ensem-

ble weather forecasts, and Dr. James D. Brown of the

NOAA/National Weather Service, Office of Hydrologic

Development, for assisting us in the use of the Ensemble

Verification System (EVS) and for providing insightful

comments as we prepared this paper. We also thank the

Earth System Research Laboratory, Physical Sciences

Division for providing reforecast products.

REFERENCES

Bertotti, L., J. R. Bidlot, R. Buizza, L. Cavaleri, and M. Janousek,

2011: Deterministic and ensemble-based prediction of Adriatic



Sea sirocco storms leading to ‘acqua alta’ in Venice. Quart. J.

Roy. Meteor. Soc., 137 (659), 1446–1466.

Boucher, M. A., F. Anctil, L. Perreault, and D. Tremblay, 2011:

A comparison between ensemble and deterministic hydro-

logical forecasts in an operational context. Adv. Geosci., 29,

85–94, doi:10.5194/adgeo-29-85-2011.

Brocker, J., and L. A. Smith, 2008: From ensemble forecasts to

predictive distribution functions. Tellus, 60A, 663–678.

Brown, D. J., 2012: Ensemble Verification System (EVS), version

4.0. User’s manual, 107 pp.

——, J. Demargne, D. J. Seo, and Y. Liu, 2010: The Ensemble

Verification System (EVS): A software tool for verifying

ensemble forecasts of hydrometeorological and hydrologic

variables at discrete locations.Environ. Modell. Software, 25,

854–872.

Buizza, R., 1997: Potential forecast skill of ensemble prediction and

spread and skill distributions of the ECMWF ensemble pre-

diction system. Mon. Wea. Rev., 125, 99–119.

——, P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu,

2005: A comparison of the ECMWF,MSC, and NCEPGlobal

Ensemble Prediction Systems. Mon. Wea. Rev., 133, 1076–

1097.

Chen, J., F. P. Brissette, and R. Leconte, 2010: A daily stochastic

weather generator for preserving low-frequency of climate

variability. J. Hydrol., 388, 480–490.——, ——, and ——, 2011: Assessment and improvement of sto-

chastic weather generators in simulating maximum and mini-

mum temperatures. Trans. ASABE, 54 (5), 1627–1637.

——, ——, ——, and A. Caron, 2012: A versatile weather gener-

ator for daily precipitation and temperature. Trans. ASABE,

55 (3), 895–906.

Coulibaly, P., 2003: Impact of meteorological predictions on real-

time spring flow forecasting.Hydrol. Processes, 17 (18), 3791–

3801.

Cui, B., Z. Toth, Y. Zhu, and D. Hou, 2012: Bias correction for

global ensemble forecast. Wea. Forecasting, 27, 396–410.

Demargne, J., J. Brown, Y. Liu, D. J. Seo, L. Wu, Z. Toth, and

Y. Zhu, 2010: Diagnostic verification of hydrometeorological

and hydrologic ensembles. Atmos. Sci. Lett., 11, 114–122.

Eckel, F. A., and M. K. Walters, 1998: Calibrated probabilistic

quantitative precipitation forecasts based on the MRF en-

semble. Wea. Forecasting, 13, 1132–1147.

Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman,

2005: Calibrated probabilistic forecasting using ensemble

model output statistics and minimum CRPS estimation.Mon.

Wea. Rev., 133, 1098–1118.

Hagedorn, R., T. Hamill, and J. S. Whitaker, 2008: Probabilistic

forecast calibration using ECMWF and GFS ensemble refor-

ecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136,

2608–2619.

Hamill, T. M., 2001: Interpretation of rank histograms for verifying

ensemble forecasts. Mon. Wea. Rev., 129, 550–560.

——, and S. J. Colucci, 1997: Verification of Eta–RSM short-range

ensemble forecasts. Mon. Wea. Rev., 125, 1312–1327.

——, and ——, 1998: Evaluation of Eta-RSM ensemble proba-

bilistic precipitation forecasts. Mon. Wea. Rev., 126, 711–

724.

——, and J. S. Whitaker, 2006: Probabilistic quantitative pre-

cipitation forecasts based on reforecast analogs: Theory and

application. Mon. Wea. Rev., 134, 3209–3229.——, and——, 2007: Ensemble calibration of 500-hPa geopotential

height and 850-hPa and 2-m temperatures using reforecasts.

Mon. Wea. Rev., 135, 3273–3280.

——, ——, and X. Wei, 2004: Ensemble reforecasting: Improving

medium-range forecast skill using retrospective forecasts.

Mon. Wea. Rev., 132, 1434–1447.

——, ——, and S. L. Mullen, 2006: Reforecasts: An important

dataset for improving weather predictions. Bull. Amer. Me-

teor. Soc., 87, 33–46.

——, R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast

calibration using ECMWF and GFS ensemble reforecasts.

Part II: Precipitation. Mon. Wea. Rev., 136, 2620–2632.

——, G. T. Bates, J. S. Whitaker, D. R. Murray, M. Fiorino, T. J.

Galarneau Jr., Y. J. Zhu, and W. Lapenta, 2013: NOAA’s

second-generation global medium-range ensemble reforecast

dataset. Bull. Amer. Meteor. Soc., 94, 1553–1565.

Hartmann, H. C., T. C. Pagano, S. Sorooshiam, and R. Bales, 2002:

Confidence builders: Evaluating seasonal climate forecasts

from user perspectives. Bull. Amer. Meteor. Soc., 83, 683–698.

Hutchinson, M. F., D. W. McKenney, K. Lawrence, J. H. Pedlar,

R. F. Hopkinson, E. Milewska, and P. Papadopol, 2009:

Development and testing of Canada-wide interpolated spatial

models of daily minimum–maximum temperature and pre-

cipitation for 1961–2003. J. Appl. Meteor. Climatol., 48, 725–

741.

Li, C., V. P. Singh, and A. K. Mishra, 2012: Simulation of the entire

range of daily precipitation using a hybrid probability dis-

tribution. Water Resour. Res., 48, W03521, doi:10.1029/

2011WR011446.

Li, Z., F. Brissette, and J. Chen, 2013: Finding themost appropriate

precipitation probability distribution for stochastic weather

generation and hydrological modelling in Nordic watersheds.

Hydrol. Processes, 27, 3718–3729, doi: 10.1002/hyp.9499.

Nicks, A. D., and G. A. Gander, 1994: CLIGEN: A weather gen-

erator for climate inputs to water resource and other model.

Proc. Fifth Int. Conf. on Computers in Agriculture, St. Joseph,

MI, American Society of Agricultural Engineers, 3–94.

Pellerin, G., L. Lefaivre, P. Houtekamer, and C. Girard, 2003: In-

creasing the horizontal resolution of ensemble forecasts at

CMC. Nonlinear Processes Geophys., 10, 463–468.Raftery, A. E., T. Gneiting, F. Balabdaout, and M. Polakowski,

2005: Using Bayesian model averaging to calibrate forecast

ensembles. Mon. Wea. Rev., 133, 1155–1174.

Richardson, C. W., 1981: Stochastic simulation of daily preci-

pitation, temperature, and solar radiation.Water Resour. Res.,

17, 182–190.

Richardson, D. S., 2001: Measures of skill and value of ensemble

prediction systems, their interrelationship and the effect of

sample size. Quart. J. Roy. Meteor. Soc., 127, 2473–2489.

Roulston, M. S., and L. A. Smith, 2003: Combining dynamical and

statistical ensembles. Tellus, 55A, 16–30.

Schmeits, M. J., and K. J. Kok, 2010: A comparison between raw

ensemble output, (modified) Bayesian model averaging, and

extended logistic regression using ECMWF ensemble pre-

cipitation reforecasts. Mon. Wea. Rev., 138, 4199–4211.

Semenov,M. A., and E.M. Barrow, 2002: LARS-WG:A stochastic

weather generator for use in climate impact studies. User

manual, 28 pp. [Available online at www.rothamsted.ac.uk/

mas-models/download/LARS-WG-Manual.pdf.]

Sloughter, J. M., A. E. Raftery, T. Gneitting, and C. Fraley, 2007:

Probabilistic quantitative precipitation forecasting using

Bayesian model averaging. Mon. Wea. Rev., 135, 3209–

3220.

Soltanzadeh, I., M. Azadi, and G. A. Vakili, 2011: Using Bayesian

Model Averaging (BMA) to calibrate probabilistic surface

temperature forecasts over Iran.Ann.Geophys., 29, 1295–1303.



www.rothamsted.ac.uk/mas-models/download/LARS-WG-Manual.pdf

www.rothamsted.ac.uk/mas-models/download/LARS-WG-Manual.pdf

Toth, Z., Y. Zhu, and T. Marchok, 2001: The use of ensembles to

identify forecasts with small and large uncertainty. Wea.

Forecasting, 16, 463–477.

Vrugt, J. A., M. P. Clark, C. G. H. Diks, Q. Duan, and B. A.

Robinson, 2006: Multi-objective calibration of forecast en-

sembles using Bayesian model averaging.Geophys. Res. Lett.,

33, L19817, doi:10.1029/2006GL027126.

Wang, X., and C. Bishop, 2005: Improvement of ensemble re-

liability with a new dressing kernel. Quart. J. Roy. Meteor.

Soc., 131, 965–986.

Whitaker, J. S., X. Wei, and F. Vitart, 2006: Improving week-2

forecasts with multimodel reforecast ensembles. Mon. Wea.

Rev., 134, 2279–2284.

Wilks, D. S., 2005: Statistical Methods in the Atmospheric Sciences.

3rd ed. Academic Press, 467 pp.

——, 2006: Comparison of ensemble-MOS methods in the Lorenz

’96 setting. Meteor. Appl., 13, 243–256.——, 2009: Extending logistic regression to provide full-probability-

distribution MOS forecasts.Meteor. Appl., 16, 361–368.

——, and T. M. Hamill, 2007: Comparison of ensemble-MOS

methods using GFS reforecasts.Mon. Wea. Rev., 135, 2379–2390.

Wilson, L. J., S. Beauregard, A. E. Raftery, and R. Verret, 2007:

Calibrated surface temperature forecasts from the Canadian

ensemble prediction system using BayesianModel Averaging.

Mon. Wea. Rev., 135, 1364–1385.



postprocessing of ensemble weather forecasts using a

Documents