robust forecast superiority testing with an application to...
TRANSCRIPT
-
Robust Forecast Superiority Testing with an Application toAssessing Pools of Expert Forecasters*
Valentina Corradi1, Sainan Jin2 and Norman R. Swanson31University of Surrey, 2Singapore Management University, and 3Rutgers University
September 2020
Abstract
We develop a forecast superiority testing methodology which is robust to the choice of loss function.Following Jin, Corradi and Swanson (JCS: 2017), we rely on a mapping between generic loss forecastevaluation and stochastic dominance principles. However, unlike JCS tests, which are not uniformlyvalid, and have correct asymptotic size only under the least favorable case, our tests are uniformly
asymptotically valid and non-conservative. These properties are derived by first establishing uniformconvergence (over error support) of HAC variance estimators and of their bootstrap counterparts, andby extending the asymptotic validity of generalized moment selection tests to the case of non-vanishingrecursive parameter estimation error. Monte Carlo experiments indicate good finite sample performanceof the new tests, and an empirical illustration suggests that prior forecast accuracy matters in the Surveyof Professional Forecasters. Namely, for our longest forecast horizons (4 quarters ahead), selecting poolsof expert forecasters based on prior accuracy results in ensemble forecasts that are superior to those basedon forming simple averages and medians from the entire panel of experts.
Keywords: Robust Forecast Evaluation, Many Moment Inequalities, Bootstrap, Estimation Error, Com-bination Forecasts, Survey of Professional Forecasters._________________________*Valentina Corradi, School of Economics, University of Surrey, Guildford, Surrey, GU2 7XH, UK, [email protected];Sainan Jin, School of Economics, Singapore Management University, 90 Stamford Road, Singapore 178903, [email protected];
and Norman R. Swanson, Department of Economics, Rutgers University, 75 Hamilton Street, New Brunswick, NJ 08901,
USA, [email protected]. We are grateful to Kevin Lee, Patrick Marsh, Luis Martins, Jams Mitchell, Alessia
Paccagini, Paulo Parente, Ivan Petrella, Valerio Poti, Barbara Rossi, Simon Van Norden, Claudio Zoli, and to the partici-
pants at the 2018 NBER-NSF Times Series Conference, the 2016 European Meeting of the Econometric Society, Conference
for 50 years of Keynes College at Kent University, and seminars at Mannheim University, the University of Nottingham,
University College Dublin, Instituto Universitário de Lisboa, Universita’ di Verona and the Warwick Business School for
useful comments and suggestions. Additionally, many thanks are owed to Mingmian Cheng for excellent research assistance.
-
1 Introduction
Forecast accuracy is typically measured in terms of a given loss function, with quadratic and absolute loss
being the most common choices. In recent years, there has been a growing discussion about the choice
of the “right” loss function. Gneiting (2011) stresses the importance of matching the quantity to be
forecasted and the choice of loss function (or scoring rule). The latter is said to be consistent for a given
statistical functional (e.g. the mean or the median), if expected loss is minimized when such a functional
is used. In a recent paper, Patton (2019) shows that if forecasts are based on nested information sets and
on correctly specified models, then in the absence of estimation error, forecast ranking is robust to the
choice of loss function within the class of consistent functions. On the other hand, if any of the above
conditions fail, then model ranking is dependent on the specific loss function used. This is an important
finding, given that it is natural for researchers to focus on the comparison of multiple misspecified models;
immediately implying that model rankings are loss function dependent.
In summary, given the importance of loss function dependence when comparing forecast accuracy, an
issue of key concern to empirical economists is the construction of loss function robust forecast accuracy
tests. A loss function free forecast evaluation criterion of interest should be based on the distribution
of raw forecast errors. Heuristically, one can define the best forecasting model as that producing errors
having a step cumulative distribution function that is equal to zero on the negative real line and equal to
one on the positive real line. Diebold and Shin (2015, 2017) build on this idea, and suggest choosing the
model for which the cumulative distribution of the forecast errors is closest to a step function. This idea
is also discussed in Corradi and Swanson (2013). Jin, Corradi and Swanson (JCS: 2017) establish a one to
one mapping between generalized loss (GL) forecast superiority and first order stochastic dominance, as
well as a one to one mapping between convex loss function (CL) and second order stochastic dominance.1
In particular, they show that the “best” model (regardless of loss function) according to a GL (CL)
function is the one which is first (second) order stochastically dominated on the negative real line and
first (second) order stochastically dominant on the positive real line, when comparing forecast errors. In
this sense, JCS (2017) establish that loss function free tests for forecast superiority can be framed in
terms of tests for stochastic dominance. In this paper, we note that tests for stochastic dominance can be
seen as tests for infinitely many moment inequalities. This allows us to utilize tools recently developed
by Andrews and Shi (2013, 2017) to derive asymptotically uniformly valid and non conservative forecast
superiority tests. Importantly, these tests improve over those introduced in JCS (2017), as the latter were
asymptotically non conservative only in the least favorable case under the null (i.e., when all moment
weak inequalities hold with equality). Needless to say, controlling for slack inequalities is crucial when
there are infinitely many of them.
1A loss function is a GL function if it is monotonically non-decreasing as the error moves away from zero. Additionally,
CL functions are the subset of convex GL functions.
1
-
The implementation of our tests require that sample moments are standardized by an estimator of
the standard deviation. Now, forecast errors are typically non martingale difference sequences, either
because they are based on dynamically misspecified models or because forecasters do not efficiently use
all the available information, in the case of subjective predictions. Hence, we require heteroskedasticity
and autocorrelation (HAC) robust variance estimators. In our set-up, each variance estimator depends
on a specific point in the forecasting error support. Thus, in order to introduce our new tests for
forecast superiority, we must establish the consistency of HAC variance estimators uniformly over the
error support. Moreover, in order to carry out inference, using our tests, we also establish uniform
convergence of the HAC variance estimator bootstrap counterparts. Because of the presence of the
lag truncation parameter, uniform convergence of HAC estimators and of their bootstrap analogs does
not follow straightforwardly from uniform convergence of (kernel) nonparametric estimators. To the
best of our knowledge this contribution is a novel addition to the vast literature on HAC covariance
matrix estimation. In the sequel, we focus on the case of judgmental forecasts, in which there is no
parameter estimation error. In a supplemental online appendix, we consider the case of predictions based
on estimated models, and extend all of our results to the case of non vanishing estimation error. This is
accomplished under a recursive estimation scheme, by extending the recursive block bootstrap introduced
in Corradi and Swanson (2007).
Linton, Song and Whang (2010) also develop tests for stochastic dominance which are correctly
asymptotically sized over the boundary of the null, for the pairwise comparison case. A key role in their
asymptotic analysis is played by the contact set (i.e., the set of over which the two CDFs are equal).
However, the notion of contact set does not extend straightforwardly to the multiple comparison case
considered in this paper. It should also be noted that other papers have addressed the problem of forecast
evaluation in the absence of full specification of the loss function. For example, Patton and Timmermann
(2007) have studied forecast optimality under only generic assumptions on the loss function. However,
they do not address the issue of forecast ranking under (partially) unknown loss. More recently, Barendse
and Patton (2019) introduce forecast multiple comparison under loss functions which are specified only
up to a shape parameter.
We assess the forecast superiority testing methodology discussed in this paper via a series of Monte
Carlo experiments. Simulation results show that our new tests are in some key cases much more accurately
sized and have much higher power than JCS tests. For example, in size experiments where DGPs contain
some models which are worse than the benchmark model, our new tests are substantially better sized
than the tests of JCS (2017). Additionally, our new tests exhibit notable power gains, relative to JCS
tests, in power experiments where DGPs contain some alternative models that dominate the benchmark,
while others are strictly dominated. These findings are as expected, given that JCS tests are undersized,
while our new tests are asymptotically non conservative.
2
-
In an empirical illustration, we apply our testing procedure to the Survey of Professional Forecasters
(SPF) dataset. In the SPF, participants are told which variables to forecast and whether they should
provide a point forecast or instead a probability interval, but they are not given a loss function (see
Crushore (1993) for a detailed description of the SPF). In the context of analyzing the predictive content of
the SPF, many papers find evidence of the usefulness of forecast combinations constructed using individual
SPF predictions, under quadratic or absolute loss. For example, Zarnowitz and Braun (1993) find that
using the mean or median provides a consensus forecast with lower average errors than most individual
forecasts. Aiolfi, Capistrán, and Timmermann (2011) and Genre, Kenny, Meyler, and Timmermann
(2013) find that equal weighted averages of SPF and ECB (European Central Bank) SPF forecasts often
outperform model based forecasts. In our illustration, we depart from these papers by noting that the
SPF naturally lends itself to loss function free forecast superiority testing, since participants are not given
loss functions. In light of this, we apply our new tests, and show that forecast averages (and medians)
from small pools of survey participants ranked according to recent forecast performance are preferred to
forecast averages based on the entire pool of experts, for our longest forecast horizon (1-year ahead). We
thus conclude that simple average and median forecasts can in some cases be “beaten”, regardless of loss
function.
The rest of the paper is organized as follows. Section 2 outlines the set-up and introduces our new
tests. Section 3 establishes the asymptotic properties of the tests in the context of generalized moment
selection. Section 4 contains the results of our Monte Carlo experiments, and Section 5 contains the
results of our analysis of GDP growth forecasts from the SPF. Finally, Section 6 provides a number of
concluding remarks. Proofs are gathered in an appendix. In a supplemental appendix, we establish the
asymptotic properties of our new tests in the context of non-vanishing parameter estimation error, for
the recursive estimation schemes.
2 Forecast Superiority Tests
Assume that we have a time series of forecast errors for each model/forecaster. Namely, we observe
for = 1 and = 1 , where denotes the number of models/forecasters, and denotes the
number of observations. As stated earlier, we focus on the case in which we can ignore estimation error,
such as when forecasts are judgmental or subjective. Surveys including the SPF are leading examples of
judgmental forecasts. The case of non-vanishing recursive estimation error is analyzed in the supplemental
appendix. Hereafter, the sequence 1 = 1 is called the “benchmark”. In the context of the SPF,
an example of a relevant benchmark against which to compare all other sequences is the consensus forecast
constructed as the simple arithmetic average of individual forecasts in the survey. Our goal is to test
whether there exists some competing forecast that is superior to the benchmark for any loss function, ,
satisfying Assumption A0.
3
-
Assumption A0 (i) ∈ L if : R→ R+ is continuously differentiable, except for finitely many points,with derivative 0 such that 0() ≤ 0 for all ≤ 0 and 0() ≥ 0 for all ≥ 0 (ii) ∈ L is a convexfunction belonging to LNote that L includes most of the loss functions commonly used by practitioners, including asymmet-
ric loss, and it basically coincides with notion of generalized loss in Granger (1999). The only restriction
is that the loss depends solely on the forecast errors. This rules out the class of loss function considered
in e.g. Section 3 of Patton and Timmermann (2007).
Hereafter, let () denote the cumulative distribution function (CDF) of forecast error Also,
define () = 1 ≥ 0 () = −1 0 Propositions 2.2 and 2.3 in JCS (2017) establishthe following results.
1. For any ∈ ((1)) ≤ ((2)) if and only if (2()− 1())() ≤ 0 for all ∈ X
2. For any ∈ ((1)) ≤ ((2)) if and only if³R −∞(1()− 2())1( 0) +
R∞(2()− 1())1( ≥ 0)
´≤ 0 for all ∈ X
The first statement establishes a mapping between GL forecast superiority and first order stochastic
dominance (FOSD). In particular, 1 is not GL dominated by 2 if 1() lies below 2() on the negative
real line, and lies above 2() on the positive real line. Indeed, this ensures that we choose the forecast
whose CDF has larger mass around zero. Likewise, the second statement establishes a mapping between
CL superiority and second order stochastic dominance.
In this framework, it follows that testing for loss function robust forecast superiority involves testing:
0 : max=2
(((1))−(())) ≤ 0 for all ∈ (2.1)
versus
: max=2
(((1))−(())) 0 for some ∈ (2.2)
with 0 and defined analogously, replacing with
Hereafter, let X = X− ∪ X+ be the union of the support of (1 ) Given the equivalence between () forecast superiority and first (second) order stochastic dominance, we can restate 0
0
and as
0 = −0 ∩+0
:¡1()− () ≤ 0 for = 2 and for all ∈ X−
¢∩ ¡()− 1() ≤ 0 for = 2 and for all ∈ X+¢
4
-
versus
= − ∪+
:¡1()− () 0 for some = 2 and for some ∈ X−
¢∪ ¡()− 1() 0 for some = 2 and for some ∈ X+¢
Analogously,
0 = −0 ∩+0
:
µZ −∞(1()− ()) ≤ 0 for = 2 , and for all ∈ X−
¶∩µZ ∞
(()− 1()) ≤ 0 for = 2 and for all ∈ X+¶
versus
= − ∪+
:
µZ −∞(1()− ()) 0 for some = 2 and for some ∈ X−
¶∪µmax
=2
Z ∞
(()− 1()) 0 for some = 2 and for some ∈ X+¶
It is immediate to see that 0 and 0 can be written as the intersection of (−1) moment inequalities,
which have to hold uniformly over X This gives rise to an infinite number of moment conditions. Andrewsand Shi (2013) develop tests for conditional moment inequalities, and as is well known in the literature
on consistent specification testing (e.g., see Bierens (1982, 1990)) a finite number of conditional moments
can be transformed into an infinite number of unconditional moments. The same is true in the case of
weak inequalities. Andrews and Shi (2017) consider tests for conditional stochastic dominance, which
are then characterized by an infinite number of conditional moment inequalities and so by a “twice”
infinite number of unconditional inequalities. Recalling that our interest is on testing GL or CL forecast
superiority as in (2.1) and (2.2), we confine our attention to unconditional testing of stochastic dominance.
Because of the discontinuity at zero in the tests, +0³
+
0
´and −0 (
−0 ) should be tested
separately, and then one can use Holm (1979) bounds to control the two resulting p-values (see Rules TG
and TC in JCS (2017)). In the sequel, for the sake of brevity, but without loss of generality, we focus our
discussion on testing +0 versus + and
+0 versus
+ However, when defining statistics, some
discussion of the statistics associated with the case where ∈ X− is also given, when needed for clarityof exposition.
We begin by testing GL forecast superiority. Let +() =¡+2 ()
+ ()
¢ with + () = ()−
1(), for ≥ 0 Define the empirical analog of +() as + () =³+2()
+()
´ and for ≥ 0
let
+() =b()− b1() (2.3)5
-
where b() denotes the empirical CDF of Similarly, let +() = ¡+2 () + ()¢ with + () =R∞(()− 1()) 1( ≥ 0). Define the empirical analog of +() as + () =
³+2()
+()
´
and let
+() =
Z ∞
³ b()− b1()´ 1( ≥ 0) (2.4)=
1
X=1
³[(1 − )]+ − [( − )]+
´
where []+ = max{0 }. Further, define
Σ+ ( 0) = acov¡√
+()√+(0)
¢(2.5)
and
Σ+ (
0) = bΣ+ ( 0) + −1 (2.6)where ≥ 0, and where bΣ+ ( 0) is the sample analog of Σ+ ( 0) In (2.6), the role of the additional−1 term is to correct for the possible singularity of the covariance estimator, for certain values of
This is the case when we compare forecast errors from nested models. Let b() = 1 { ≤ } −1
P=1 1 { ≤ }, so that the −th element of bΣ+ ( ) is given by
b2+ () = 1X=1
(b()− b1())2+21
X=1
X=+1
(b()− b1())(b− ()− b1− ()) (2.7)where = 1− 1+ with →∞ as →∞ Also, let
2+
( 0) be the −th element of Σ+ ( 0)
and let 2+ ( 0) be the −th element of Σ+ ( 0) Analogously,
Σ+ ( 0) = acov¡√
+()√+(0)
¢and
Σ+
( 0) = bΣ+ ( 0) + −1
wherebΣ+ ( 0) is the sample analog of Σ+ ( 0)Furthermore, b2+ () is constructed by replacing b1() and b() in the above expression with
b1() = [(1 − )]+ − 1X=1
[(1 − )]+
and b() = [( − )]+ − 1X=1
[( − )]+
6
-
Note that −() −() b2− () and b2− () can be defined by utilizing the function. Namely,
regardless of whether ≥ 0 or 0, one can construct () =³ b()− b1()´ () and
() =
Z −∞
³ b1()− b()´ 1( 0)− Z ∞
³ b()− b1()´ 1( ≥ 0)=
1
X=1
³[(1 − ) ()]+ − [( − ) ()]+
´
b2() = 1X=1
(b()− b1())2+21
X=1
X=+1
(b()− b1())()(b− ()− b1− ())()and b2() by replacing b1() and b() in the above expression with
b1() = [(1 − ) ()]+ − 1X=1
[(1 − ) ()]+
and b() = [( − ) ()]+ − 1X=1
[( − ) ()]+
Given the above framework, our new robust forecast superiority test statistics are:
+ =
Z∈X+
X=2
Ãmax
(0√()
()
)!2() and − =
Z∈X−
X=2
Ãmax
(0√()
()
)!2()
(2.8)
and
+ =
Z∈X+
X=2
Ãmax
(0√()
()
)!2() and − =
Z∈X−
X=2
Ãmax
(0√()
()
)!2()
(2.9)
where is a weighting function defined below; +() and +() are the −th components of + ()
and + () as defined in (2.3) and (2.4), respectively Here, + and
+ are “sum” functions, as in
equation (3.8) in Andrews and Shi (2013), and satisfy their Assumptions S1-S4, which are required to
guarantee that convergence is uniform over the null DGPs.2 ,3 If = 2 and () = 1 for all and
(i.e. no standardization), then + is the statistic used in Linton, Song and Whang (2010) for testing
FOSD.2Note that we could have constructed a different “sum” function, using the statistic in (3.9) of Andrews and Shi (2013).3Recall that one main drawback of the max=2 sup∈X+
√+ statistic in JCS (2017) is that it diverges to −∞
under some sequence of probability measures under the null, thus ruling out uniformity.
7
-
Of note is that in our context, potential slackness causes a discontinuity in the pointwise asymptotic
distribution of the statistic.4 This is because the pointwise asymptotic distribution is discontinuous,
unless all moment conditions hold with equality. On the other hand, the finite sample distribution is not
necessarily discontinuous. Thus, in the presence of slackness, the pointwise limiting distribution is not a
good approximation of the finite sample distribution, and critical values based on pointwise asymptotics
may be invalid. This is why we construct tests that are uniformly asymptotically valid (i.e., this is why
we study the limiting distribution of our tests under drifting sequences of probability measures belonging
to the null hypothesis). Moreover, in the infinite dimensional case, there is an additional source of
discontinuity. In particular, the number of moment inequalities which contributes to the statistic varies
across the different values of For example, the key difference between the case of = 2 and 2 is
that in the former case, for each value of there is only one moment inequality which can be binding (or
not). On the other hand, if = 3, say, then for each value of there can be either one or two moment
inequalities which may be binding (or not), and whether or not a particular inequality is binding (or not)
varies over . Under this setup, we require the following assumptions in order to analyze the asymptotic
behavior of our test statistics.
Assumption A1: For = 1 is strictly stationary and −mixing, with mixing coefficients, =
− where 61−2 0 12 and 1
Assumption A2: The union of the supports of 1 is the compact set, X = X− ∪X+.Assumption A3: () has a continuos bounded density.
Assumption A4: The weighting function has full support X+ (or X− )We use Assumption A2 in the proof of Lemma 1, where we require X+ in (2.8) and (2.9) to be a
compact set. However, for the case of generalized loss superiority, the union of the supports of 1
can be unbounded. This is because + is bounded, regardless of the boundedness of the support. On
the other hand, +
is bounded only when the union of the support of the forecasting error is bounded.
3 Asymptotic Properties
3.1 Uniform Convergence of the HAC Estimator
We now turn to a discussion of the estimation of the variance in our forecast superiority test statistics.
If 1 were martingale difference sequences, then we can still use the sample second moment
as a variance estimator, and uniform consistency will follow by application of an appropriate uniform
law of large numbers. In our set-up we can assume that 1 are martingale difference sequences if
either: (i) they are judgmental forecasts from professional forecasters, say, who efficiently use all available
information at time (a strong assumption, which is tested in the forecast rationality literature); or (ii)
4By pointwise asymptotic distribution we mean the limiting distribution under a fixed probability measure.
8
-
they are prediction errors from one-step ahead forecasts based on dynamically correctly specified models.
With respect to (i), it is worth noting that professional forecasters may be rational, ex-post, according
to some loss function (see Elliott, Komunjer and Timmermann (2005,2008), although it is not as likely
that they are rational according to a generalized loss function. With respect to (ii), it should be noted
that at most one model can be dynamically correctly specified for a given information set, and thus
cannot be a martingale difference sequence, for all = 1 In light of these facts, we allow for time
dependence in the forecast error sequences used in our statistics, and use a HAC variance estimator in
(2.8) and (2.9). In order to ensure that the HAC estimators converge uniformly over X+ it suffices toestablish the counterpart of Lemma A1 of Supplement A of Andrews and Shi (2013) to the case of mixing
sequences. This is done below.
Lemma 1: Let Assumptions A1-A3 hold. Then, if ≈ 0 12 with defined as in AssumptionA1:
(i)
sup∈X+
¯̄̄b2+ ()− 2+ ()¯̄̄ = (1) with 2+ () =
¡√+()
¢; and
(ii)
sup∈X+
¯̄̄b2+ ()− 2+ ()¯̄̄ = (1) with 2+ () =
¡√+()
¢
Lemma 1 establishes the uniform convergence over X+ of HAC estimators. It is the time seriescounterpart of Lemma A1 in Andrews and Shi (2013). Of note is that we require −mixing. This differsfrom the stationary pointwise HAC variance estimator case studied by Andrews (1991), where −mixingsuffices, and where the mixing coefficients decline to zero slightly slower than in our Assumption A1.
This is because there is a trade-off between the degree of dependence and the rate of growth of the lag
truncation parameter in the HAC estimator. Indeed, in the uniform case, the covering number (e.g., see
Andrews and Pollard (1994)) grows with both and the degree of dependence, thus leading to a trade-off
between the two. For example, in the case of exponential mixing series, can be arbitrarily close to 12
For carrying out inference on our forecast superiority tests, we require a bootstrap analog of the
HAC variance estimator, which can be constructed as follows. Using the block bootstrap, make
draws of length from 1 in order to obtain¡∗1
∗
¢=¡1+1 1+ +
¢
with = where the block size, is equal to the lag truncation parameter in the HAC estima-
tor described above.5 Now, let ∗1() = 1©∗1 ≤
ª − 1P=1 1 {1 ≤ } ∗() = 1©∗ ≤ ª −5We thus use the same notation, for both the lag truncation parameter and the block length.
9
-
1
P=1 1 { ≤ } and
b2∗+ () = 1X=1
Ã1
12
X=1
³∗(−1)+()− ∗1(−1)+()
´!2 (3.1)
Define b∗2+ () analogously, replacing ∗1() with ∗1() = £∗1 − ¤+ − 1P=1 1 £∗1 − ¤+ and∗() with
∗() =
£∗ −
¤+− 1
P=1 1
£∗ −
¤+ Additionally, define
b2∗+0 () = 1X=1
Ã1
12
X=1
³∗(−1)+()− ∗1(−1)+()
´³∗0(−1)+()− ∗1(−1)+()
´!
The following result holds.
Lemma 2: Let Assumptions A1-A3 hold. Then, if ≈ 0 12 with defined as in AssumptionA1:
(i)
sup∈X+
¯̄̄b∗+ ()− E∗ ³b∗+ ()´¯̄̄ = ∗ (1) and (ii)
sup∈X+
¯̄̄b∗+ ()− E∗ ³b∗+ ()´¯̄̄ = ∗ (1) where ∗(1) denotes convergence to zero according to the bootstrap law,
∗ conditional on the sample.
As in our above discussion, when constructing bootstrap counterparts for the statistics defined in
(2.8) and (2.9) on both the positive and negatives supports of , it suffices to utilize the func-
tion, and note that ()2 = 1. For example, replace ∗1() with ∗1() =
£(∗1 − )()
¤+−
1
P=1
£(∗1 − )()
¤+ replace ∗() with
∗() =
£(∗ − )()
¤+− 1
P=1
£(∗ − )()
¤+
and define
b2∗0() = 1X=1
Ã1
12
X=1
³∗(−1)+()− ∗1(−1)+()
´³∗0(−1)+()− ∗1(−1)+()
´!
and
b2∗0() = 1X=1
Ã1
12
X=1
³∗(−1)+()− ∗1(−1)+()
´³∗0(−1)+()− ∗1(−1)+()
´!
3.2 Inference Using the Bootstrap and Bounding Limiting Distributions
The statistics + and + are highly discontinuous over Exactly which moment conditions, and how
many of them are binding varies over Hence, + and + do not necessarily have a well defined
limiting distribution; and the continuous mapping theorem cannot be applied. However, following the
10
-
generalized moment selection (GMS) test approach of Andrews and Shi (2013) we can establish lower
and upper bound limiting distributions. Let
+() = diagΣ+ ( )
+() = +()−12
¡√+2 ()
√+ ()
¢0 (3.2)
+ ( 0) = +()−12
¡Σ+ + −1
¢( 0)+(0)−12 (3.3)
and
+() = (+2 () + ())
0 (3.4)
where +() is a ( − 1)−dimensional zero mean Gaussian process with correlation + ( 0). Also,let +() +()
+ (
0) +() be defined analogously, by replacing Σ+ ( ) +2 () + ()
with Σ+ ( ) +2 () + () Finally, define
†+ =ZX+
X=2
⎛⎝max⎧⎨⎩0
+ () +
+()q
+()
⎫⎬⎭⎞⎠2 d() (3.5)
where +() is the −th element of + ( ), and let
†+∞ =ZX+
X=2
⎛⎝max⎧⎨⎩0
+ () +
+∞()q
+()
⎫⎬⎭⎞⎠2 d() (3.6)
where +∞() = 0 if () = 0 and +∞() = −∞ , if () 0 Also, define †+ and †+∞
analogously, by replacing + () +()
+∞() and
+() with
+ ()
+()
+∞()
and +() Hereafter let
P+0 =© : +0 holds
ªso that P+0 is the collection of DGPs under which the null hypothesis holds. Let P+0 be definedanalogously, with +0 replaced by
+0 The following result holds.
Theorem 1: Let Assumptions A1-A4 hold. Then:
(i) under +0 there exists a 0 such that
lim sup→∞
sup∈P+0
h³+
+
´−
³†+ +
+
´i≤ 0
and
lim inf→∞ inf∈P+0
h³+
+
´−
³†+ − +
´i≥ 0;
and
11
-
(ii) under +0 there exists a 0 such that
lim sup→∞
sup∈P+0
h³+
+
´−
³†+ +
+
´i≤ 0
and
lim inf→∞ inf∈P+0
h³+
+
´−
³†+ − +
´i≥ 0
Theorem 1 provides upper and lower bounds for ³+
+
´and
³+
+
´ uniformly,
over the probabilities under +0 and +0 respectively. Note that
+(·) and +(·) depend on the
degree of slackness, and do not need to converge. Indeed, + and/or + do not have to converge in
distribution for this result to hold.
Following Andrews and Shi (2013), we can construct bootstrap critical values which properly mimic
the critical values of †+∞ and †+∞ We rely on the block bootstrap to capture the dependence in the
data when constructing our bootstrap statistics. Consider the case of †+∞ Let¡∗1
∗
¢ and
be defined as in the previous subsection, and let:
∗+() =1
X=1
¡1©∗ ≤
ª− 1©∗1 ≤ ª¢ (3.7)and
∗+ () =√ b−12+ () ¡∗+ ()−+ ()¢ (3.8)
with ∗+ () =³∗+2 ()
∗+ ()
´and b+ () = diagbΣ+ ( ) Then, define:
+ () = −1
12−12+ ()
+() (3.9)
with →∞ as →∞ Here, +() is the -th element of + () = diag
³Σ+ ( )
´ + () =³
+2 () +()
´ and
+ () = 1n+ () −1
o (3.10)
with a positive sequence, which is bounded away from zero. Thus, + () = when
+()
−−1212+ () (i.e., when the −th inequality is slack at ) and is zero otherwise.It it clear from the selection rule in (3.10), that we do need an estimator of the variance of the
moment conditions, despite the fact we use bootstrap critical values. In fact, standardization does not
play a crucial role in the statistics, as all positive sample moment conditions matter. On the other hand,
without the scaling factor in (3.9), the number of non-slack moment conditions would depend on the
scale, and hence our bootstrap critical values would no longer be scale invariant. Let
∗+ =ZX+
X=2
max
⎛⎝⎧⎨⎩0 ∗+ ()− + ()q
∗+ ()
⎫⎬⎭⎞⎠2 d() (3.11)
12
-
where ∗+() is the − element of
−12+ ()Σ
∗+ ( )
−12+ () and Σ
∗+ ( ) is the
bootstrap analog of Σ+
( ) 6 Note that if grows with then all slack inequalities are discarded,
asymptotically. It is immediate to see that ∗+ is the bootstrap counterpart of †+ in (3.5), with
+ () mimicking the contribution of the slackness of inequality (i.e., of −th element of +())However, + () is not a consistent estimator of
+() since the latter cannot be consistently estimated.
Now, consider the case of †+∞ Let:
∗+() =1
X=1
³£∗ −
¤+− £∗1 − ¤+´
and define ∗+ () b+ () + () and + () analogously to ∗+ () b+ () + () and + ()by replacing ∗+ () + () and bΣ+ ( ) with ∗+ () + () and bΣ+ ( ) Then, construct:
∗+ =ZX+
X=2
max
⎛⎝⎧⎨⎩0 ∗+ ()− + ()q
∗+ ()
⎫⎬⎭⎞⎠2 d() (3.12)
By comparing (2.8) and (2.9) with (3.11) and (3.12), it is immediate to see that +() (+()) does
not contribute to the test statistic when +() 0 (+() 0) while it does not contribute to
the bootstrap statistic when +() −−1212+
() (+() −−12
12+
()) with
−12 → 0 Heuristically, by letting grow with the sample size, we control the rejection rates in a
uniform manner.
It remains to define the GMS bootstrap critical values. Let ∗+1−³+
∗+
´be the (1 − )-th
critical value of ∗+ based on bootstrap replications, with + defined as in (3.10) and
∗+ () =b−12+ ()Σ∗+ ( ) b−12+ (). The (1−)-th GMS bootstrap critical value, ∗+01− ³+ ∗+ ´
is defined as:
∗+01−³+
∗+
´= lim
→∞∗+1−+
³+
∗+
´+
for 0 arbitrarily small. Further, ∗+1−³+
∗+
´and ∗+01−
³+
+
´are defined analo-
gously.
Here, the constant is used to guarantee uniformity over the infinite dimensional nuisance parameters,
+() +() uniformly on ∈ X+ and is termed the infinitesimal uniformity factor by Andrews and
Shi (2013). Heuristically, if all moment conditions are slack, then both the statistic and its bootstrap
counterpart are zero, and by having 0 though arbitrarily close to zero we control the asymptotic
rejection rate.
Finally, let
B+ =n ∈ X+ s.t. +∞ = 0 for some = 2
o(3.13)
6Thus, the diagonal elements of Σ∗+ ( ) are the 2∗+ () described in the previous subsection, while the off-diagonalelements of Σ∗+ ( ) are defined accordingly, as 2∗+0 (), with 6= 0
13
-
and
B+ =n ∈ X+ s.t. +∞ = 0 for some = 2
o (3.14)
where B+ and B+ define the sets over which at least one moment condition holds with strict equality,and these sets represent the boundaries of +0 and
+0 respectively.
Although we require that the block length grows at the same rate as the lag truncation parameter,
in Lemma 2 (i.e., we require that ≈ 0 12 with being the mixing coefficient in A1), forthe asymptotic uniform validity of the bootstrap critical values, we require that the block length grows
at a rate slower than 13 This slower rate is required for the bootstrap empirical central limit theorem
for a mixing process to hold (see Peligrad (1998)). Needless to say, even in the construction of b2+ (),we should thus use = (13) The following result holds.
Theorem 2: Let Assumptions A1-A4 hold, and let →∞ and 13− → 0 as →∞ Under +0 :(i) if as →∞ →∞ and → 0 then
lim sup→∞
sup∈P+0
³+ ≥ ∗+01−
³+
∗+
´´≤ ;
and
(ii) if as →∞ →∞ →∞, √ →∞ and ¡B+¢ 0 then
lim→0
lim sup→∞
sup∈P+0
³+ ≥ ∗+01−
³+
∗+
´´=
Also, under +0 :
(iii) if as →∞ →∞ and → 0 then
lim sup→∞
sup∈P+0
³+ ≥ ∗+01−
³+
∗+
´´≤ ;
and (iv) if as →∞ →∞ →∞,√ →∞ and
¡B+¢ 0 thenlim→0
lim sup→∞
sup∈P+0
³+ ≥ ∗+01−
³+
∗+
´´=
Statements (i) and (iii) of Theorem 2 establish that inference based on GMS bootstrap critical values
are uniformly asymptotically valid. Statements (ii) and (iv) of the theorem establish that inference
based on GMS bootstrap critical values is asymptotically non-conservative, whenever (B+) 0 or¡B+¢ 0 (i.e., whenever at least one moment condition holds with equality, over a set ∈ X+ with
non-zero −measure). Although the GMS based tests are not similar on the boundary, the degree ofnon similarity, which is
lim→0
lim sup→∞
sup∈P+0
³+ ≥ ∗+01−
³+
∗+
´´− lim
→0lim inf inf
∈P+0³+ ≥ ∗+01−
³+
∗+
´´
14
-
is much smaller than that associated with using the “usual” recentered bootstrap. In the case of pairwise
comparison (i.e., = 2) Theorem 2(ii) of Linton, Song and Whang (2010) establishes similarity of
stochastic dominance tests on a subset of the boundary.
For implementation of the tests discussed in this paper, it thus follows that one can use Holm bounds
as is done in JCS (2017), with modifications due to the presence of the constant . Estimate bootstrap
−values ++
= 1P
=1 1³(∗
+
+ ) ≥ +
´and −
−= 1
P=1 1
³(∗
− + ) ≥
−
´.
Estimate ++
and −−
in analogous fashion. Then, use the following rules (Holm (1979)):
Rule : Reject 0 at level if minn++
−−
o≤ (− )2.
Rule : Reject 0 at level if min
n++
−−
o≤ (− )2.
3.3 Power against Fixed and Local Alternatives
As our statistics are weighted averages over X+ they have non-trivial power only if the null is violatedover a subset of non zero −measure. This applies to both power against fixed alternative, as well as topower against
√−local alternatives. In particular, for power against fixed alternatives, we require the
following assumption.
Assumption FA: (i) (+ ) 0 where + = { ∈ X+ : () 0 for some = 2 } . (ii)
(+ ) 0 where + = { ∈ X+ : () 0 for some = 2 }
The following result holds.
Theorem 3: Let Assumptions A0-A4 hold.
(i) If Assumption FA(i) holds, then under + :
lim→∞
³+ ≥ ∗+01−
³+
∗+
´´= 1
(ii) If Assumption FA(ii) holds, then under + :
lim→∞
³+ ≥ ∗+01−
³+
∗+
´´= 1
It is immediate to see that we have unit power against fixed alternatives, provided that the null hypothesis
is violated, for at least one = 2 over a subset of X+ of non-zero −measure. Now, if we insteadused a Kolmogorov type statistic (i.e., replace the integral over X+ with the supremum over X+) thenwe would not need Assumption FA, and it would suffice to have violation for some with possibly zero
−measure, or in general with zero Lebesgue measure.7 However, as pointed out in Supplement B of7The Kolmorogov versions of + and
+ are:
+ = max∈X+
=2
max
0
√+()
+()
2
+ = max∈X+
=2
max
0
√+()
+()
2
15
-
Andrews and Shi (2013) the statements in parts (ii) and (iv) of Theorem 2 do not apply to Kolmogorov
tests, and hence asymptotic non-conservativeness does not necessarily hold. This is because the proof of
those statements use the bounded convergence theorem, which applies to integrals but not to suprema.
We now consider the following sequences of local alternatives:
+ : +() =
+ () +
1()√
+ ³−12
´ for = 2 ∈ X+
and
+ : +() =
+ () +
2()√
+ ³−12
´ for = 2 ∈ X+
We have lim→∞√+()−12+() → +∞() + 1() and lim→∞
√+()−12+() →
+
∞() + 2() Define,
†+∞1 =ZX+
X=2
⎛⎝max⎧⎨⎩0
+ () +
+∞() + 1()q+()
⎫⎬⎭⎞⎠2 d()
and
†+∞2 =ZX+
X=2
⎛⎝max⎧⎨⎩0
+ () +
+∞() + 2()q+()
⎫⎬⎭⎞⎠2 d()
We require the following assumption.
Assumption LA:(i) (+ ) 0 where
+ =n :√+()−12+()→ +∞() + 1() 0 +∞() + 1() ∞ for some = 2
o.
(ii) (+ ) 0 where
+ =n :√+()−12+()→ +∞() + 2() 0 +∞() + 2() ∞ for some = 2
o
The following result holds.
Theorem 4: Let Assumptions A1-A4 hold.
(i) If Assumption LA(i) holds, then under + :
lim→∞
³+ ≥ ∗+01−
³+
∗+
´´=
³†+∞1 ≥ 1−
³+∞
+∞
´´
with 1−³+∞
+∞
´denoting the ( 1 − )-th critical value of †+∞1, with 0 +∞() +
1() ∞ for some = 2 (ii) If Assumption LA(ii) holds, then under + :
lim→∞
³+ ≥ ∗+01−
³+
∗+
´´=
³†+∞2 ≥ 1−
³+∞
+∞
´´
with 1−³+∞
+∞
´denoting the ( 1 − )-th critical value of †+∞2 , with 0 +∞() +
2() ∞ for some = 2
16
-
Theorem 4 establishes that our tests have power against√−alternatives, provided that the drifting
sequence is bounded away from zero, over a subset of X+ of non-zero −measure. Note also that forgiven loss function, the sequence of local alternatives for the White reality check can be defined as:
: max=2
(((1))−(())) = √+
³−12
´ 0 (3.15)
For sake of simplicity, suppose that = 2 (this is the well known Diebold and Mariano (1995) test
framework). Here,
0 = 12((1))−(()) + (1)= 12
Z ∞−∞
() (1()− 2()) d
= −12Z 0−∞
0() (1()− 2()) d
−12Z ∞0
0() (1()− 2()) d
= 12Z 0−∞
³−∞() + 1()
´()d+ 12
Z ∞0
³+∞() + 1()
´()d (3.16)
where () = ()+1()√
, and 1 = 11− 12 Hence, in (3.15) is equivalent to + ∩− ,
whenever Assumption A0 holds and () = 0()()
Analogously, for any convex loss function, which satisfies Assumption A0, in (3.15) is equiv-
alent to − ∩+− , whenever () = 00()() In fact, it is easy to see that:
0 = 12((1))−(()) + (1)= 12
Z ∞−∞
() (1()− 2()) d
= −12Z 0−∞
0() (1()− 2()) d− 12Z ∞0
0() (1()− 2()) d
= −0()12Z −∞
(1()− 2()) d¯̄0−∞ +
12
Z 0−∞
00()µZ −∞
(1()− 2()) d¶d
+120()Z ∞
(1()− 2()) d |∞0 − 12Z ∞0
00()µZ ∞
(1()− 2()) d¶d
= 12Z 0−∞
00()µZ −∞
(1()− 2()) d¶d− 12
Z ∞0
00()µZ ∞
(1()− 2()) d¶d
= 12Z 0−∞
µZ −∞
³−∞() + 2()
´d
¶()d− 12
Z ∞0
µZ ∞
³+∞() + 2()
´d
¶()d
17
-
4 Monte Carlo Experiments
In this section, we evaluate the finite sample performance of GL and CL forecast superiority tests when
there are multiple competing sequences of forecast errors, under stationarity. In addition to analyzing the
performance of our tests based on + and − (GL forecast superiority) as well as based on
+ and
− (CL forecast superiority), we also analyze the performance of the related test statistics from JCS
(2017), here called + , −
+ , and
− . For the sake of brevity, these two classes of
tests are called and tests, respectively.8 For each experiment we carry out 1000 Monte Carlo
replications, and the number of bootstrap samples is = 500. Additionally, four different values of
the smoothing parameter, are examined for the tests, including = {020 035 050 060};and four different values of the uniformity constant, are examined for the tests, including =
{00015 0002 00025 0003}.9 Additionally, for tests, when constructing Σ+ (as well as Σ− , etc.),
we set = integer[02] and = 1 − 4 Finally, when implementing the bootstrap counterpart of ,we set =
p03 log() and =
p04 log() log(log()) following Andrews and Shi (2013, 2017).
Sample sizes of ∈ {300 600 900} are generated using each of the following eight data generatingprocesses (DGPs), with independent forecast errors.
DGP1: 1 ∼ (0 1) and ∼ (0 1) = 2 3DGP2: 1 ∼ (0 1) and ∼ (0 1) = 2 3 4 5DGP3: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 142) = 4 5DGP4: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 162) = 4 5DGP5: 1 ∼ (0 1), ∼ (0 082) = 2 3 and ∼ (0 122) = 4 5DGP6: 1 ∼ (0 1) ∼ (0 082) = 2 3 4 5 and ∼ (0 122) = 6 7 8 9DGP7: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 082) = 4 5DGP8: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 062) = 4 5DGP9: 1 ∼ (0 1) and ∼ (0 082) = 2 3 4 5DGP10: 1 ∼ (0 1) and ∼ (0 062) = 2 3 4 5Additionally, we conducted experiments using DGPs specified with autocorrelated errors. For the
sake of brevity, these finding are reported in the supplemental online appendix. Denoting e = e−1+8 In the construction of the statistics
+ =
∈X+
=2
max
0√()
()
2() and − =
∈X−
=2
max
0√()
()
2()
we set () = 11506
. Thus, (·) is still uniform. For inference using our tests, once is determined, estimate bootstrap−values, + =
1
=1 1
(∗
+
+ ) ≥ +
and − =
1
=1 1
(∗− + ) ≥ −
. Then, use the
following rules (Holm, (1979): Reject 0 at level if min+
−
≤ ( − )2. Reject 0 at level if
min+
−
≤ (− )2.
9 In JCS (2017), the constant that we call is called .
18
-
(1− 2)12with ∼ (0 1) = 1 5 the DGPs for these experiments are as follows.DGP11: 1 = e1 and = e = 2 3 4 5DGP12: 1 = e1, = e = 2 3 and = 14e = 4 5DGP13: 1 = e1, = 08e = 2 3 and = 12e = 4 5DGP14: 1 = e1, = e = 2 3 and = 06e = 4 5In the above setup, DGPs 1-4 and DGPs 11-12 are used to conduct size experiments, while DGPs 4-10
and DGPs 13-14 are used to conduct power experiments. In all cases, 1 denote the forecast errors from
the benchmark model. Note that DGPs 1-2 correspond to the least favorable elements in the null, while
in DGPs 3-4 and DGPs 11-12, some models underperform the benchmark. This is the case where we
expect significant improvement when using our new tests instead of JCS tests. In DGPs 5-6 and DGP13,
one half of the competing models outperform the benchmark model and the other half underperform. In
DGPs 7-8, one half of the competing models outperform, while is DGPs 9-10 and DGP14, the competing
models all outperform the benchmark model. The above DGPs are similar to those examined in JCS
(2017), and are utilized in our experiments because they clearly illustrate the trade-offs associated with
using and forecast superiority tests.
We now discuss the experimental findings gathered in Tables 1 and 2. All reported results are rejection
frequencies based on carrying out the and tests using a nominal size equal to 0.1. Turning first
to Table 1, note that results in this table are based on tests. Summarizing, tests have
reasonably good size under DGPs 1-2 (the least favorable case under the null). However, they are often
undersized (and in some cases severely so) in some sample size / permutations when some models
are worse than the benchmark (see DGPs 3-4), as should be expected given that the tests are not
asymptotically correctly sized under these two DGPs. Moreover, in these cases the empirical size is non
monotonic, in paricular for forecast superiority. Turning to Table 2, note that tests, which are
asymptotically non conservative, often exhibit better size properties under DGPs 3-4 (compare DGPs 3-4
in Tables 1 and 2) than tests. For example, for the CL forecast superiority test the empirical size
of the test is 0020 for all values of , when = 900 (see Table 1). The analogous value based
on implementation of the test is 0083 for all values of (see Table 2) Again, it is worth stressing
that this finding comes as no surprise, given that the test is asymptotically non conservative on the
boundary of the null hypotheses, while the test is conservative. Turning to power, note that the
power of the test is sometimes quite low relative to that of the test. For example, under DGP7,
power is 0445 for the GL forecast superiority test and 0845 for the CL forecast superiority
test, when = 300). Note that analogous rejection frequencies for the tests are 0870 and 0923 (see
Table 2, DGP7, = 300). As expected, thus, tests exhibit improved power relative to tests,
when some models are worse than the benchmark. All of the above findings pertain to the analysis of
DGPS 1-10, in which forecast errors are serially uncorrelated. Results for DGPS 11-14, in which errors
19
-
are serially correlated are gathered in the supplemental appendix. Examination of the results for these
DGPSs (see Tables Supplemental S1 and S2) are qualitatively the same as those reported on above.
Finally, it should be pointed out that the tests are not overly sensitive to the choice of , and the
empirical size of tests appears “best” when is very small, as should be expected. In conclusion,
there is a clear performance improvement when comparing our new robust predictive superiority tests
with tests.10
5 Empirical Illustration: Robust Forecast Evaluation of SPF Ex-
pert Pools
In the real-time forecasting literature, predictions from econometric models are often compared with
surveys of expert forecasters.11 Such comparisons are important when assessing the implications asso-
ciated with using econometric models in policy setting contexts, for example. One key survey dataset
collecting expert predictions is the Survey of Professional Forecasters (SPF), which is maintained by
the Philadelphia Federal Reserve Bank (see Croushore (1993)). This dataset, formerly known as the
American Statistical Association/National Bureau of Economic Research Economic Outlook Survey, col-
lects predictions on various key economic indicators (including, for example, nominal GDP growth, real
GDP growth, prices, unemployment, and industrial production). For further discussion of the variables
contained in the SPF, refer to Croushore (1993) and Aiolfi, Capistrán, and Timmermann (2011). The
SPF has been examined in numerous papers. For example, Zarnowitz and Braun (1993) comprehensively
study the SPF, and find, among other things, that use of the mean or median provides a consensus
forecast with lower average errors than most individual forecasts. More recently, Aiolfi, Capistrán, and
Timmermann (2011) consider combinations of SPF survey forecasts, and find that equal weighted aver-
ages of survey forecasts outperform model based forecasts, although in some cases these mean forecasts
can be improved upon by averaging them with mean econometric model-based forecasts. When uti-
lizing European data from the recently released ECB SPF, Genre, Kenny, Meyler, and Timmermann
(2013) again find that it is very difficult to beat the simple average. This well known result pervades
the macroeconometric forecasting literature, and reasons for the success of such simple forecast averaging
10For a discussion of simulation results based on application of the Diebold and Mariano (DM: 1995) test (in which
specific loss functions are utilized) in our experimental setup, refer to JCS(2017). Summarizing from that paper, it is clear
that when the loss function is unknown, there is an advantage to using our approach of testing for forecast superiority.
However the DM test for pairwise comparison or a reality check test for multiple comparisons might yield improved power,
for a given loss function. Indeed, under quadratic loss, JCS (2017) show that when the sample size is small, the DM test
has better power performance than type tests. When the sample size increases, the power difference between the two
tests becomes smaller. This is as expected.11 See Fair and Shiller (1990), Swanson and White (1997a,b), Aiolfi, Capistrán and Timmermann (2011), and the references
cited therein for further discussion.
20
-
are discussed in Timmermann (2006). He notes, among other things, that model misspecification related
to instability (non-stationarities) and estimation error in situations where there are many models and
relatively few observations may account to some degree for the success of simple forecast and model av-
eraging. Our empirical illustration attempts to shed further light on the issue of simple model averaging
and its importance in forecasting macroeconomic variables.
Our approach is to address the issue of forecast averaging and combination (called pooling) by viewing
the problem through the lens of forecast superiority testing. Our use of loss function robust tests is unique
to the SPF literature, to the best of our knowledge. Since we use robust forecast superiority tests, we do
not evaluate pooling by using loss function specific tests, such as those discussed in Diebold and Mariano
(1995), McCracken (2000), Corradi and Swanson (2003), and Clark and McCracken (2013). Additionally,
our approach differs from that taken by Elliott, Timmermann, and Komunjer (2005, 2008), where the
rationality of sequences of forecasts is evaluated by determining whether there exists a particular loss
function under which the forecasts are rational. We instead evaluate predictive accuracy irrespective of
the loss function implicitly used by the forecaster, and determine whether certain forecast combinations
are superior when compared against any loss function, regardless of how the forecasts were constructed.
In our tests, the benchmarks against which we compare our forecast combinations are simple average and
median consensus forecasts. We aim to assess whether the well documented success of these benchmark
combinations remains intact when they are compared against other combinations, under generic loss.12
In all of our experiments, we utilize SPF predictions of nominal GDP growth. The SPF is a quarterly
survey, and the dataset is available at the Philadelphia Federal Reserve Bank (PFRB) website. The
original survey began in 1968:Q4, and PFRB took control of it in 1990:Q2; but from that date, there
are only around 100 quarterly observations prior to 2018:Q1, where we end our sample. In our analysis
we thus use the entire dataset, which, after trimming to account for differing forecast horizons in our
calculations, is 166 observations.13,14
For our analysis, we consider 5 forecast horizons (i.e., = 0 1 2 3 4) The reason we use = 0 for
one of the horizons is that the first horizon for which survey participants predict GDP growth is the
quarter in which they are making their predictions. In light of this, forecasts made at = 0 are called
nowcasts. Moreover, it is worth noting that nowcasts are very important in policy making settings, since
12For an interesting discussion of machine learning and forecast combination methods, see Lahiri, Peng, and Zhao (2017);
and for a discussion of probability forecasting and calibrated combining using the SPF, see Lahiri, Peng, and Zhao (2015).
In these papers, various cases where consensus combinations do not “win” are discussed.13 It should be noted that the “timing” of the survey was not known with certainty prior to 1990. However, SPF
documentation states that they believe, although are not sure, that the timing of the survey was similar before and after
they took control of it.14Note that the number of experts for which forecasts are recorded for each calendar date, was approximately 90 experts
during each of the 4 quarters of 1968, while there where only approximately 40 experts in each quarter in 2017. For further
details on the SPF dataset, refer to the documentation at https://www.philadelphiafed.org/research-and-data/real-time-
center/survey-of-professional-forecasters.
21
-
first release GDP data are not available until around the middle of the subsequent quarter. The nominal
GDP variable that we examine is called NGDP in the SPF. All test statistics are constructed using
NGDP growth rate prediction errors. In particular, assume that one survey participant makes a forecast
of NGDP, say +|F.15 The associated forecast error is:
= {ln(+)− ln()}−nln(+|F)− ln()
o= ln(+)− ln(+|F)
where the actual NGDP value, + is reported in the SPF, along with the NGDP predictions of each
survey participant. Note that when = 0, F does not include However, for 0, F includes .As discussed previously, this is due to the release dates associated with the availability of NGDP data.
Figure 1 illustrates some of the key properties of the NGDP data that we utilize. Namely, note that
the distributions of the expert forecasts vary over time, and exhibit interesting skewness and kurtosis
properties (compare Panels A-D of the figure, and the skewness and kurtosis statistics reported below
the plots in the figure). Based on examination of the densities in Figure 1, one might wonder whether
“trimming” experts from the panel, say those experts that provided the forecasts appearing in the left
tails of the distributions, might improve overall predictive accuracy of the panel. Although this question
is not directly addressed in our analysis, we do construct and analyze the performance of various “pools”
formed by trimming experts that exhibit sub-par predictive accuracy, for example.
In addition to constructing + −
− and
+ tests in our empirical investigation, we also test
for forecast superiority using the tests discussed above, which have correct size only under the
least favorable case under the null. In particular, we construct + −
− and
+
test statistics (see Section 2 and JCS (2017) for further details). All test statistics are calculated using
the same parameter values (for , and ) as used in our Monte Carlo experiments. However,
results are only reported for = 020 and = 0002 since our findings remain unchanged when other
values of and from our Monte Carlo experiments are used.
Two different benchmark models are considered, including (i) the arithmetic mean prediction from all
participants; and (ii) the median prediction from all participants. Additionally, a variety of alternative
model “groups” are considered. In all alternative models, mean and median predictions are again formed,
but this time using subsets of the total available panel of experts, chosen in a number of ways, as outlined
below.
Group 1 - Experts Chosen Based on Experience: Three expert pools (i.e. three alternative models)
consisting of experts with 1, 3, and 5 years of experience.
In all of the remaining groups of combinations, individuals are ranked according to average absolute
forecast errors, as well as according to average squared forecast errors. Mean (or median) predictions
from these groups are then compared with our benchmark combinations.
15Here, F denotes the information set available to the expert forecaster at the time their predictions are made.
22
-
Group 2 - Experts Chosen Based on Forecast Accuracy I: Three expert pools consisting of most accurate
expert over last 1, 3, and 5 years.
Group 3 - Experts Chosen Based on Forecast Accuracy II: Three expert pools consisting of most accurate
group of 3 experts over last 1, 3, and 5 years.
Group 4 - Experts Chosen Based on Forecast Accuracy III: Three expert pools consisting of top 10%
most accurate group of experts over last 1, 3, and 5 years.
Group 5 - Experts Chosen Based on Forecast Accuracy III: Three expert pools consisting of top 25%
most accurate group of experts over last 1, 3, and 5 years.
Finally, 3 additional groups which combine models from each of Groups 1-5 are analyzed. These
include:
Group 6: Five expert pools, including one pool with experts that have 1 year of experience, and 4
additional pools, one from each of Groups 2-5, all defined over the last 1 year.
Group 7: Five expert pools, including one pool with experts that have 3 years of experience, and 4
additional pools, one from each of Groups 2-5, all defined over the last 3 years.
Group 8: Five expert pools, including one pool with experts that have 5 years of experience, and 4
additional pools, one from each of Groups 2-5, all defined over the last 5 years.
As an example of how testing is performed, note that when implementing the test using Group
1, there are three alternative models. The same is true when implementing tests using Groups 2-5.
For Groups 6-8, tests are implemented using 5 alternative models, where one alternative is taken from
each of Groups 1-5. Summarizing, we consider: (i) two benchmark models, against which each group of
alternatives is compared; (ii) alternative models that are based on either mean or median pooled forecasts
for, Groups 2-8 ; (iii) forecast accuracy pools used in 1-8 that are based on either average absolute
forecast errors or average squared forecast errors; (iii) 5 forecast horizons.
We now discuss our empirical findings. In Tables 3-4, statistics are reported for all forecast superiority
tests. Entries are , ,
, and
test statistics reported for forecast horizons = 0 1 2 3 4.
More specifically, = + if
+
+≤ −
−; otherwise =
− . The other statistics reported
in the tables (i.e., , , and
) are defined analogously. Rejections of the null of no forecast
superiority at a 10% level are denoted by a superscript *. In Table 3, the benchmark model is always the
arithmetic mean prediction from all participants, and expert pool forecasts are also arithmetic means.
Analogously, in Table 4 the benchmark is the median prediction from all participants, and expert pool
forecasts are also medians. To understand the layout of the tables, turn to Table 3, and note that for
1, the 4 statistics defined above (i.e., , ,
, and
) are given, for each forecast
horizon, = 0 1 2 3 and 4 Superscripts denote rejection of the null hypothesis based on a particular
test. For example, note that application of the test in 2 yields a test rejection for horizons
= 2 and 4 Turning to the results summarized in the tables, a number of clear conclusions emerge.
23
-
First, the majority of test rejections occur for = 4, as can be seen by inspection of the results
in both Tables 3 and 4 In particular, note that for = 4, there are 13 test rejections in Table 3 and
11 test rejections in Table 4, across 1-8. On the other hand, for all other forecast horizons
combined (i.e., = {0 1 2 3}), there are 11 test rejections in Table 3 and 8 test rejections in Table 4.This suggests that expert pools which are constructed by “trimming” the least effective experts are most
useful for longer horizon forecasts. These findings make sense if one assumes that it is easier to make
short term forecasts than long term forecasts. Namely, some experts are simply not “up to the task”
when forecasting at longer horizons. Summarizing, our main finding indicates that simple average or
median forecasts can be beaten, in cases where forecasts are more difficult to make (i.e., longer horizons).
Second, “experience” as measured by the length of time an expert has taken part in the SPF is not a
direct indicator of forecast superiority, since there are no rejections of our tests for 1 when either
mean (see Table 3) or median (see Table 4) forecasts are used in our tests. This does not necessarily
mean that experience does not matter, at least indirectly (notice that test rejections sometimes occur for
6-8, where experience and accuracy traits are combined.16 Finally, note that Tables S1 and S2
in the supplemental appendix report root mean square forecast errors (RMSFEs) from the benchmark
and competing models utilized in our empirical analysis. In these tables, we see that in the majority of
cases considered, combination forecasts that utilize the mean have lower RMSFEs than when the median
is used for constructing combination forecasts. For example, when comparing the benchmark RMSFEs
of 1 that are reported in Tables S1 and S2, RMSFEs associated with mean combination forecasts
(see Table S1) are lower for = {0 2 3 4} than the RMSFEs associated with median combinationforecasts (see Table S2). This is interesting, given the clear asymmetry and long left tails associated
with the distributions of expert forecasts exhibited in Figure 1, and suggests that outlier forecasts from
“less accurate” experts are not overly influential when using measures of central tendency as ensemble
forecasts.
Summarizing, we have direct evidence that judicious selection of pools of experts can lead to loss
function robust forecast superiority. However, it should be stressed that in this illustration of the testing
techniques developed in this paper, we do not consider various combination methods, including Bayesian
model averaging, for example. Additionally, we only look at nominal GDP, although the SPF has various
other variables in it. Extensions such as these are left to future research.16To explore this finding in more detail, we also constructed additional tables that are closely related to Tables 3 and 4,
except that in these tables, RMSFEs are reported for all of the models used in each test (see supplemental appendix, Tables
S1 and S2). In these tables, we see that combining experience with prior predictive accuracy can lead to lower RMSFEs,
relative to the case where the entire pool of experts is used. However, RMSFEs are even lower for various alternative models
for which we only use prior predictive accuracy to select expert pools (compare RMSFEs for 3-5 with those for
6-8 in the supplemental tables).
24
-
6 Concluding Remarks
We develop uniformly valid forecast superiority tests that are asymptotically non conservative, and that
are robust to the choice of loss function. Our tests are based on principles of stochastic dominance, which
can be interpreted as tests for infinitely many moment inequalities. In light of this, we use tools from
Andrews and Shi (2013, 2017) when developing our tests. The tests build on earlier work due to Jin,
Corradi, and Swanson (2017), and are meant to provide a class of predictive accuracy tests that are not
reliant on a choice of loss function, such as the Diebold and Mariano (1995) test discussed in McCracken
(2000). In developing the new tests, we establish uniform convergence (over error support) of HAC
variance estimators, and of their bootstrap counterparts. In a Supplement, we also extend the theory
of generalized moment selection testing to allow for the presence of non-vanishing parameter estimation
error. In a series of Monte Carlo experiments, we show that finite sample performance of our tests is quite
good, and that the power of our tests dominates those proposed by JCS (2017). Additionally, we carry
out an empirical analysis of the well known Survey of Professional Forecasters, and show that utilizing
expert pools based on past forecast quality can lead to loss function robust forecast superiority, when
compared with pools that include all survey participants. This finding is particularly prevalent for our
longest forecast horizon (i.e., 1-year ahead).
25
-
7 Appendix
Proof of Lemma 1: (i) The proof is the same for all Thus, let () = (1 { ≤ }− ()) −(1 {1 ≤ }− 1()) and define
bb2+ () = 1X=1
2 () + 21
X=1
()− ()
We first show that
sup∈X+
¯̄̄bb2+ ()− 2+()¯̄̄ = (1) and then we show that
sup∈X+
¯̄̄bb2+ ()− b2+ ()¯̄̄ = (1) (7.1)Now,
sup∈X+
¯̄̄bb2+ ()− 2+()¯̄̄≤ sup
∈X+
¯̄̄̄¯ 1
X=1
¡2 ()− E
¡2 ()
¢¢+ 2
1
X=1
X=1
(()− ()− E (()− ()))¯̄̄̄¯
+ sup∈X+
¯̄̄̄¯Ã2()− 1
X=1
E¡2 ()
¢+ 2
1
X=1
X=1
E (()− ())
!¯̄̄̄¯ (7.2)
We begin with the first term on the RHS of (7.2). First note that,
sup∈X+
¯̄̄̄¯ 1
X=1
¡2 ()− E
¡2 ()
¢¢+ 2
1
X=1
X=1
(()− ()− E (()− ()))¯̄̄̄¯
≤ sup∈X+
2
X=0
¯̄̄̄¯ 1
X=1
(()− ()− E(()− ()))¯̄̄̄¯
Now,
Pr
Ãsup∈X+
2
X=0
¯̄̄̄¯ 1
X=1
(()− ()− E (()− ()))¯̄̄̄¯ ≥
!
≤ 2X=0
Pr
Ãsup∈X+
¯̄̄̄¯ 1
X=1
(()− ()− E (()− ()))¯̄̄̄¯ ≥
!
so that we need to show that,
Pr
Ãsup∈X+
¯̄̄̄¯ 1
X=1
(()− ()− E (()− ()))¯̄̄̄¯ ≥
!
26
-
Given Assumption A2, WLOG, we can set X+ = [0∆] so that it can be covered by −1 balls = 1 ∆−1 centered at with radius Then,
sup∈X+
¯̄̄̄¯ 1
X=1
(()− ()− E (()− ()))¯̄̄̄¯
≤ max=1∆−1
¯̄̄̄¯ 1
X=1
(()− ()− E(()− ()))¯̄̄̄¯
+ max=1∆−1
sup∈
2
¯̄̄̄¯Ã1
X=1
− () (()− ())!
−Ã1
X=1
E(− () (()− ()))!¯̄̄̄¯
+smaller order
= +
Now,
≤ max=1∆−1
sup∈
¯̄̄̄¯ 1
X=1
− () (()− ())¯̄̄̄¯
+ max=1∆−1
sup∈
¯̄̄̄¯ 1
X=1
E (− () (()− ()))¯̄̄̄¯
= +
Given Assumption A1, noting that by Cauchy - Schwarz,
max=1∆−1
sup∈
¯̄̄̄¯ 1
X=1
E(− () (()− − ()))¯̄̄̄¯
≤ max=1∆−1
sup∈
qE(− ())
2max
=1−1sup∈
qE(()− ())2
= ³12
´
27
-
for some constant Recalling given that () = (1 { ≤ }− ()) − (1 {1 ≤ }− 1()) and() stay between −1 and 1
max=1∆−1
sup∈
¯̄̄̄¯ 1
X=1
− () (()− ())¯̄̄̄¯
≤ 2 max=1∆−1
sup∈
1
X=1
|()− ()|
≤ 2
X=1
1 {− ≤ 1 ≤ + }+ 2
X=1
1 {− ≤ ≤ + }
+2 sup∈X+
(1() + ())
= () = (12 )
Hence, by Chebyshev inequality
Pr
µ
¶=
¡
3
¢= (1)
for = ¡−3¢
Now, consider By the Lemma on page 739 of Hansen (2008), setting = −4 =∆42
and =
with 12 and recalling that given Assumption A1, var (P
=1 (()− ()− E(()− ()))) ≤ it follows that for some constant
Pr
Ãmax
=1−1
¯̄̄̄¯ 1
X=1
(()− ()− E (()− ()))¯̄̄̄¯ ≥
!
≤ −1 Prï̄̄̄¯X=1
(()− ()− E (()− ()))¯̄̄̄¯ ≥
!
≤ 4−1
⎛⎝exp⎛⎝− 22 2
64+ 83∆2
43
⎞⎠+ 162
µ4
∆
2
¶−⎞⎠= −1 exp
Ã− 164 2
+ 83∆2
43
!+64
−1
2 /2 −
= (1) +³(6+2)−
´= (1) for
6
1− 2
28
-
We now consider the second term on the RHS of (7.2). Note that
sup∈X+
¯̄̄̄¯Ã2+()− 1
X=1
E¡2 ()
¢+ 2
1
X=1
X=1
E (()− ())
!¯̄̄̄¯
≤ 2 sup∈X+
¯̄̄̄¯ 1
X=1
(1− )X=1
E (()− ())
¯̄̄̄¯ (7.3)
+2 sup∈X+
¯̄̄̄¯ 1
X=+1
X=1
E(()− ())
¯̄̄̄¯
The first term on the RHS of (7.3) is (1) by the same argument as that used in Theorem 2 of Newey
and West (1987). Also, by Lemma 6.17 in White (1984), for 2
E (()− ()) ≤ −2−1var (())12 E k()k
and
sup∈X+
¯̄̄̄¯ 1
X=+1
X=1
E (()− ())
¯̄̄̄¯
≤ sup∈X+
var (())12 E k()k
X=+1
−2−1 = (1)
as 1 given Assumption A1, and noting that can be taken arbitrarily large because of the bound-
edness of ()
Finally, by the same argument as that used in the proof of (7.2), for all
sup∈X+
1
X=1
(1 { ≤ }− ()) = ¡−1¢
The statement in (7.1) follows immediately.
(ii) By noting that,
[ − ]+ − [ − ]+= (− )1{ ≥ }+ (− ) (1{ ≥ }− 1{ ≥ })
+ ( − ) (1{ ≥ }− 1{ ≥ })
the statement follows by the same argument as that used in part (i) of the proof.
Proof of Lemma 2: For notational simplicity, we suppress the subscript. Also, we suppress the
29
-
superscripts + and + as the proof follows by analogous argument. Note that
sup∈X+
¯̄̄b∗2 ()− E∗ (b∗())¯̄̄
≤ sup∈X+
X=1
¯̄̄̄¯̄̄⎛⎝ 1
X=1
∗(−1)+()
⎞⎠2 − E∗⎛⎜⎝⎛⎝ 1
X=1
∗(−1)+()
⎞⎠2⎞⎟⎠¯̄̄̄¯̄̄
= sup∈X+
X=1
¯̄̄̄¯̄ 12
X=1
X=1
∗(−1)+()∗(−1)+()− E∗
³∗(−1)+()
∗(−1)+()
´¯̄̄̄¯̄Now,
Pr
⎛⎝ sup∈X+
X=1
¯̄̄̄¯̄ 12
X=1
X=1
∗(−1)+()∗(−1)+()− E∗
³∗(−1)+()
∗(−1)+()
´¯̄̄̄¯̄ ≥ 1⎞⎠
≤ Pr⎛⎝ sup∈X+
X=1
¯̄̄̄¯̄ 12
X=1
X=1
∗(−1)+()∗(−1)+()− E∗
³∗(−1)+()
∗(−1)+()
´¯̄̄̄¯̄ ≥ 1 ⎞⎠
It suffices to show that, uniformly in
Pr
⎛⎝ sup∈X+
¯̄̄̄¯̄ 12
X=1
X=1
∗(−1)+()∗(−1)+()− E∗
³∗(−1)+()
∗(−1)+()
´¯̄̄̄¯̄ ≥ 1 ⎞⎠
This follows using the same "covering numbers" argument used in the proof of Lemma 1.
Proof of Theorem 1: We again suppress the superscripts + and + as the proof follows by the same
argument. We need to show that the statement in Lemma A1 in the Supplement Appendix of Andrews
and Shi (2013) holds. Then, the proof of the theorem will follow using the same arguments as those used
in the proof of their Theorem 1, as the proof is the same for independent and dependent observations. In
fact, our set-up differs from Andrews and Shi (2013) only because we have dependent observations, and
because we scale the statistic by a Newey-West variance estimator. For the rest of the proof, our set-up
is simpler as we can fix their at a given value, say zero. It suffices to show that:
(i) ()⇒ () as a process indexed by ∈ X+ where () is a zero-mean − 1−dimensional Gaussianprocess, with covariance kernel given Σ( 0)
(ii) sup0∈X+°°( 0)− ( 0)°° = (1)
Now, statement (ii) follows directly from Lemma 1. It remains to show that (i) holds. The key difference
between the independent and the dependent cases is that in the former we can rely on the concept of
manageability, while in the latter we cannot. Nevertheless, (i) follows if we can show that () satisfies
an empirical process. Given A1-A3, this follows from Lemma A2 in Jin, Corradi and Swanson (2017).
Proof of Theorem 2: (i) For notational simplicity, we omit the superscript + The proof of this
theorem mirrors the proof of Theorem 2(a) in the Supplement of Andrews and Shi (2013). Let 0 ( )
30
-
be the critical value of † as defined in (3.5). Given Theorem 1(i), it follows that for all 0
lim sup→∞
sup∈P0
( ≥ 0 ( ) + ) ≤
The statement follows if we can show that
lim sup→∞
sup∈P0
³∗0
³
∗
´≤ 0
³
∗
´´= 0 (7.4)
with 0³
∗
´defined as 0 ( ) ; but with
∗ an argument of this function rather than
(); and if we can show that
lim sup→∞
sup∈P0
³0
³
∗
´≤ 0 ( )
´= 0 (7.5)
For →∞ and → 0 →∞ and → 0
sup∈P0
³∗0
³
∗
´≤ 0
³
∗
´´≤ sup
∈P0¡−() ≤ () for some ∈ X+ and some = 2 ¢
≤ sup∈P0
¡() −1 AND − ≤ () for some ∈ X+ and = 2
¢≤ sup
∈P0³()12
−12 () (()−()) +()12
−12 ()() −
AND − ≤ () for some ∈ X+ and = 2 ¢
≤ sup∈P0
³− +()−12−12 ()() −
AND − ≤ () for some ∈ X+ and = 2 ¢
+ sup∈P0
³()12
−12 () (()−()) − for some ∈ X+ and = 2
´≤ sup
∈P0³−()−12−12 ()() − +
AND − ≤ () for some ∈ X+ and = 2 ¢
= (1)
This establishes that (7.4) holds. Finally, (7.5) follows from Lemma 1 and Lemma 2.
(ii) Recall that ∗01− ( ) is the (1 − )− percentile of ∗ as defined in (3.11); and define01−
¡
¢to the (1− )− percentile of where
= max∈X+
X=2
max
⎛⎝⎧⎨⎩0 ()− ()q()
⎫⎬⎭⎞⎠2
31
-
with = (2 )0 is a − 1 dimensional Gaussian process, with mean zero and covariance(
0) = b−12 ()Σ( 0) b−12 (0) Finally, let = (2 )0 is a − 1 dimensional Gaussianprocess, with mean zero and covariance ( 0) = −12()Σ( 0)−12(0)We first need to show
that
∗01− ( )− 01−¡
¢= (1) (7.6)
and then to prove that the statement holds when replacing ∗01− ( ) with 01−¡
¢
From Lemma 2, bΣ∗ ( 0) − bΣ ( 0) = ∗(1) and so Σ∗ ( 0) − Σ ( 0) = ∗(1) Then, byTheorem 2.3 in Peligrad (1998),
∗ ∗
=⇒ a.s.-
where ∗ ∗=⇒ denotes weak convergence, conditional on sample. As =⇒ (7.6) follows.Given Assumption A4, by Lemma B3 in the Supplement of Andrews and Shi (2013), the distribution
of †∞ as defined in (3.6), is continuous. It is also strictly increasing and its (1− )−quantile is strictlypositive, for all 12 The statement then follows by the same argument as that used in the proof of
Theorem 2(b) in the Supplement of Andrews and Shi (2013).
(iii)-(iv) follow by the same arguments as those used in the proof of (i) and (ii), respectively. In the
case of +
we rely on the the stochastic equicontinuity of1√
P=1 (1 {1 ≤ }− 1 {1 ≤ }) as |−
|→ 0When considering + we need to ensure the stochastic equicontinuity of 1√P
=1
³(1 − )+ − (1 − )+
´
Now,
1√
X=1
³(1 − )+ − (1 − )+
´=
1√
X=1
(− ) 1 {1 ≥ }+ 1√
X=1
(1 − )+ (1 {1 ≥ }− 1 {1 ≥ })
which, given Assumption 2, is stochastically equicontinous, by the same argument as those used for +
Hence, Theorem 2.3 in Peligrad (1998) also holds in this case.
Proof of Theorem 3: (i) Without loss of generality, let + = { ∈ X+ : 2() 0} and note thatfor all ∈ + max
½0√+2()
+22()
¾=√+2()
+22() Thus,
+ =
Z+
X=2
Ãmax
(0
√+()
+()
)!2d() +
ZX+\+
X=2
Ãmax
(0
√+()
+()
)!2d()
=
Z+
Ã√+2()
+22()
!2d() +
Z+
X=3
Ãmax
(0
√+()
+()
)−Ã√
+2()
+22()
!!2d()
+
ZX+\+
X=2
Ãmax
(0
√+()
+()
)!2d()
= + +
32
-
Now, diverges to infinity with probability approaching one, while Theorem 1 ensures that and
are (1). Thus, + diverges to infinity As ∗+ is ∗(1) conditional on the sample, the statement
follows.
(ii) Note that +
can be treated exactly as +
Proof of Theorem 4:
(i) Define, †+∞ as in (3.6), but with the vector +∞() having at least one component strictly
bounded away above from zero, and finite, for all ∈ + Let P+ denote the set of probabilitiesunder the sequence of local alternatives. We have that for all 0
lim sup→∞
sup∈P+
h¡+
¢− ³†+∞ ´i = 0and the distribution of †+∞ is continuous at its (1 − ) + quintile, for all 0 12 and ≥ 0Also, note that for all ∈ + + = 0 The statement then follows by the same argument as thatused in the proof of Theorem 2(ii). (ii) By the same argument as in part (i).
33
-
8 References
Aiolfi, M., C. Capistrán, and A. Timmermann (2011). Forecast Combinations. In M.P. Clements and
D.F. Hendry (eds.), Oxford Handbook of Economic Forecasting, pp. 355-390, Oxford University
Press, Oxford.
Andrews, D.W.K. (1991). Heteroskedasticity and Autocorrelation Robust Covariance Matrix Estimation.
Econometrica, 59, 817-858.
Andrews, D.W.K. and D. Pollard (1994). An Introduction to Functional Central Limit Theorems for
Dependent Stochastic Processes. International Statistical Review, 62, 119-132.
Andrews, D.W.K. and X. Shi (2013). Inference Based on Conditional Moment Inequalities. Econometrica,
81, 609-666.
Andrews, D.W.K. and X. Shi (2017). Inference Based on Many Conditional Moment Inequalities. Journal
of Econometrics, 196, 275-287.
Barendse, S. and A.J. Patton (2019). Comparing Predictive Accuracy in the Presence of a Loss Function
Shape Parameter. Working Paper, Duke University.
Bierens H.J. (1982). Consistent Model Specification Tests. Journal of Econometrics, 20, 105-134.
Bierens H.J. (1990). A Consistent Conditional Moment Tests for Functional Form. Econometrica, 58,
1443-1458.
Clark, T. and M. McCracken (2013). Advances in Forecast Evaluation. In G. Elliott, C.W.J. Granger
and A. Timmermann (eds.), Handbook of Economic Forecasting Vol. 2, pp. 1107-1201, Elsevier,
Amsterdam.
Corradi, V. and N.R. Swanson (2003). Predictive Density Evaluation. In G. Elliott, C.W.J. Granger
and A. Timmermann (eds.), Handbook of Economic Forecasting Vol. 1, pp. 197-284, Elsevier,
Amsterdam.
Corradi, V. and N.R. Swanson (2007). Nonparametric Bootstrap Procedures for Predictive Inference
Based on Recursive Estimation Schemes. International Economic Review, 48, 67-109.
Corradi, V. and N. R. Swanson (2013). A Survey of Recent Advances in Forecast Accuracy Comparison
Testing, with an Extension to Stochastic Dominance. In X. Chen and N.R. Swanson (eds.), Causality,
Prediction, and Specification Analysis: Recent Advances and Future Directions, Essays in
honor of Halbert L. White, Jr., pp. 121-144, Springer, New York.
Croushore, D. (1993). Introducing: The Survey of Professional Forecasters, The Federal Reserve Bank
of Philadelphia Business Review, November-December, 3-15.
34
-
Diebold, F. X. and Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of Business and
Economic Statistics, 13, 253-263.
Diebold, F.X. and M. Shin (2015). Assessing Point Forecast Accuracy by Stochastic Loss Distance.
Economics Letters, 130, 37-38.
Diebold, F.X. and M. Shin (2017). Assessing Point Forecast Accuracy by Stochastic Error Distance.
Econometric Reviews, 36, 588-598.
Elliott, G., I. Komunjer and A. Timmermann (2005). Estimation and Testing of Forecast Rationality
under Flexible Loss. Review of Economic Studies, 72, 1107-1125.
Elliott, G., I. Komunjer and A. Timmermann (2008). Biases in Macroeconomic Forecasts: Irrationality
of Asymmetric Loss? Journal of the European Economic Association, 6, 122-157.
Fair, R.C. and R.J. Shiller (1990). Comparing Information in Forecasts from Econometric Models. Amer-
ican Economic Review, 80, 375-389.
Genre, V., G. Kenny, A. Meyler, and A. Timmermann (2013). Combining the Forecasts in the ECB
Survey of Professional Forecasters: Can Anything Beat the Simple Average. International Journal of
Forecasting, 29, 108-121.
Gneiting, T. (2011). Making and Evaluating Point Forecast. Journal of the American Statistical Associ-
ation, 106, 746-762.
Granger, C. W. J. (1999). Outline of Forecast Theory using Generalized Cost Functions. Spanish
Economic Review, 1, 161-173.
Hansen, B.E. (2008). Uniform Convergence Rates for Kernel Estimators with Dependent Data. Econo-
metric Theory, 24, 726-748.
Holm, S. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of
Statistics, 6, 65-70.
Jin, S., V. Corradi and N.R. Swanson (2017). Robust Forecast Comparison. Econometric Theory, 33,
1306-1351.
Lahiri, K., H. Peng, and Y. Zhao (2015). Testing the Value of Probability Forecasts for Calibrated
Combining. International Journal of Forecasting, 31, 113-129.
Lahiri, K., H. Peng, and Y. Zhao (2017). Online Learning and Forecast Combination in Unbalanced
Panels. Econometric Reviews, 36, 257-288.
Linton, O., K. Song and Y.J. Whang (2010). An Improved Bootstrap Test of Stochastic Dominance.
Journal of Econometrics, 154, 186-202.
35
-
McCracken, M.W. (2000). Robust Out-of-Sample Inference. Journal of Econometrics, 9