robust forecast superiority testing with an application to...

Robust Forecast Superiority Testing with an Application toAssessing Pools of Expert Forecasters*

Valentina Corradi1, Sainan Jin2 and Norman R. Swanson31University of Surrey, 2Singapore Management University, and 3Rutgers University

September 2020

Abstract

We develop a forecast superiority testing methodology which is robust to the choice of loss function.Following Jin, Corradi and Swanson (JCS: 2017), we rely on a mapping between generic loss forecastevaluation and stochastic dominance principles. However, unlike JCS tests, which are not uniformlyvalid, and have correct asymptotic size only under the least favorable case, our tests are uniformly

asymptotically valid and non-conservative. These properties are derived by first establishing uniformconvergence (over error support) of HAC variance estimators and of their bootstrap counterparts, andby extending the asymptotic validity of generalized moment selection tests to the case of non-vanishingrecursive parameter estimation error. Monte Carlo experiments indicate good finite sample performanceof the new tests, and an empirical illustration suggests that prior forecast accuracy matters in the Surveyof Professional Forecasters. Namely, for our longest forecast horizons (4 quarters ahead), selecting poolsof expert forecasters based on prior accuracy results in ensemble forecasts that are superior to those basedon forming simple averages and medians from the entire panel of experts.

Keywords: Robust Forecast Evaluation, Many Moment Inequalities, Bootstrap, Estimation Error, Com-bination Forecasts, Survey of Professional Forecasters._________________________*Valentina Corradi, School of Economics, University of Surrey, Guildford, Surrey, GU2 7XH, UK, [email protected];Sainan Jin, School of Economics, Singapore Management University, 90 Stamford Road, Singapore 178903, [email protected];

and Norman R. Swanson, Department of Economics, Rutgers University, 75 Hamilton Street, New Brunswick, NJ 08901,

USA, [email protected]. We are grateful to Kevin Lee, Patrick Marsh, Luis Martins, Jams Mitchell, Alessia

Paccagini, Paulo Parente, Ivan Petrella, Valerio Poti, Barbara Rossi, Simon Van Norden, Claudio Zoli, and to the partici-

pants at the 2018 NBER-NSF Times Series Conference, the 2016 European Meeting of the Econometric Society, Conference

for 50 years of Keynes College at Kent University, and seminars at Mannheim University, the University of Nottingham,

University College Dublin, Instituto Universitário de Lisboa, Universita’ di Verona and the Warwick Business School for

useful comments and suggestions. Additionally, many thanks are owed to Mingmian Cheng for excellent research assistance.

1 Introduction

Forecast accuracy is typically measured in terms of a given loss function, with quadratic and absolute loss

being the most common choices. In recent years, there has been a growing discussion about the choice

of the “right” loss function. Gneiting (2011) stresses the importance of matching the quantity to be

forecasted and the choice of loss function (or scoring rule). The latter is said to be consistent for a given

statistical functional (e.g. the mean or the median), if expected loss is minimized when such a functional

is used. In a recent paper, Patton (2019) shows that if forecasts are based on nested information sets and

on correctly specified models, then in the absence of estimation error, forecast ranking is robust to the

choice of loss function within the class of consistent functions. On the other hand, if any of the above

conditions fail, then model ranking is dependent on the specific loss function used. This is an important

finding, given that it is natural for researchers to focus on the comparison of multiple misspecified models;

immediately implying that model rankings are loss function dependent.

In summary, given the importance of loss function dependence when comparing forecast accuracy, an

issue of key concern to empirical economists is the construction of loss function robust forecast accuracy

tests. A loss function free forecast evaluation criterion of interest should be based on the distribution

of raw forecast errors. Heuristically, one can define the best forecasting model as that producing errors

having a step cumulative distribution function that is equal to zero on the negative real line and equal to

one on the positive real line. Diebold and Shin (2015, 2017) build on this idea, and suggest choosing the

model for which the cumulative distribution of the forecast errors is closest to a step function. This idea

is also discussed in Corradi and Swanson (2013). Jin, Corradi and Swanson (JCS: 2017) establish a one to

one mapping between generalized loss (GL) forecast superiority and first order stochastic dominance, as

well as a one to one mapping between convex loss function (CL) and second order stochastic dominance.1

In particular, they show that the “best” model (regardless of loss function) according to a GL (CL)

function is the one which is first (second) order stochastically dominated on the negative real line and

first (second) order stochastically dominant on the positive real line, when comparing forecast errors. In

this sense, JCS (2017) establish that loss function free tests for forecast superiority can be framed in

terms of tests for stochastic dominance. In this paper, we note that tests for stochastic dominance can be

seen as tests for infinitely many moment inequalities. This allows us to utilize tools recently developed

by Andrews and Shi (2013, 2017) to derive asymptotically uniformly valid and non conservative forecast

superiority tests. Importantly, these tests improve over those introduced in JCS (2017), as the latter were

asymptotically non conservative only in the least favorable case under the null (i.e., when all moment

weak inequalities hold with equality). Needless to say, controlling for slack inequalities is crucial when

there are infinitely many of them.

1A loss function is a GL function if it is monotonically non-decreasing as the error moves away from zero. Additionally,

CL functions are the subset of convex GL functions.

1

The implementation of our tests require that sample moments are standardized by an estimator of

the standard deviation. Now, forecast errors are typically non martingale difference sequences, either

because they are based on dynamically misspecified models or because forecasters do not efficiently use

all the available information, in the case of subjective predictions. Hence, we require heteroskedasticity

and autocorrelation (HAC) robust variance estimators. In our set-up, each variance estimator depends

on a specific point in the forecasting error support. Thus, in order to introduce our new tests for

forecast superiority, we must establish the consistency of HAC variance estimators uniformly over the

error support. Moreover, in order to carry out inference, using our tests, we also establish uniform

convergence of the HAC variance estimator bootstrap counterparts. Because of the presence of the

lag truncation parameter, uniform convergence of HAC estimators and of their bootstrap analogs does

not follow straightforwardly from uniform convergence of (kernel) nonparametric estimators. To the

best of our knowledge this contribution is a novel addition to the vast literature on HAC covariance

matrix estimation. In the sequel, we focus on the case of judgmental forecasts, in which there is no

parameter estimation error. In a supplemental online appendix, we consider the case of predictions based

on estimated models, and extend all of our results to the case of non vanishing estimation error. This is

accomplished under a recursive estimation scheme, by extending the recursive block bootstrap introduced

in Corradi and Swanson (2007).

Linton, Song and Whang (2010) also develop tests for stochastic dominance which are correctly

asymptotically sized over the boundary of the null, for the pairwise comparison case. A key role in their

asymptotic analysis is played by the contact set (i.e., the set of over which the two CDFs are equal).

However, the notion of contact set does not extend straightforwardly to the multiple comparison case

considered in this paper. It should also be noted that other papers have addressed the problem of forecast

evaluation in the absence of full specification of the loss function. For example, Patton and Timmermann

(2007) have studied forecast optimality under only generic assumptions on the loss function. However,

they do not address the issue of forecast ranking under (partially) unknown loss. More recently, Barendse

and Patton (2019) introduce forecast multiple comparison under loss functions which are specified only

up to a shape parameter.

We assess the forecast superiority testing methodology discussed in this paper via a series of Monte

Carlo experiments. Simulation results show that our new tests are in some key cases much more accurately

sized and have much higher power than JCS tests. For example, in size experiments where DGPs contain

some models which are worse than the benchmark model, our new tests are substantially better sized

than the tests of JCS (2017). Additionally, our new tests exhibit notable power gains, relative to JCS

tests, in power experiments where DGPs contain some alternative models that dominate the benchmark,

while others are strictly dominated. These findings are as expected, given that JCS tests are undersized,

while our new tests are asymptotically non conservative.

2

In an empirical illustration, we apply our testing procedure to the Survey of Professional Forecasters

(SPF) dataset. In the SPF, participants are told which variables to forecast and whether they should

provide a point forecast or instead a probability interval, but they are not given a loss function (see

Crushore (1993) for a detailed description of the SPF). In the context of analyzing the predictive content of

the SPF, many papers find evidence of the usefulness of forecast combinations constructed using individual

SPF predictions, under quadratic or absolute loss. For example, Zarnowitz and Braun (1993) find that

using the mean or median provides a consensus forecast with lower average errors than most individual

forecasts. Aiolfi, Capistrán, and Timmermann (2011) and Genre, Kenny, Meyler, and Timmermann

(2013) find that equal weighted averages of SPF and ECB (European Central Bank) SPF forecasts often

outperform model based forecasts. In our illustration, we depart from these papers by noting that the

SPF naturally lends itself to loss function free forecast superiority testing, since participants are not given

loss functions. In light of this, we apply our new tests, and show that forecast averages (and medians)

from small pools of survey participants ranked according to recent forecast performance are preferred to

forecast averages based on the entire pool of experts, for our longest forecast horizon (1-year ahead). We

thus conclude that simple average and median forecasts can in some cases be “beaten”, regardless of loss

function.

The rest of the paper is organized as follows. Section 2 outlines the set-up and introduces our new

tests. Section 3 establishes the asymptotic properties of the tests in the context of generalized moment

selection. Section 4 contains the results of our Monte Carlo experiments, and Section 5 contains the

results of our analysis of GDP growth forecasts from the SPF. Finally, Section 6 provides a number of

concluding remarks. Proofs are gathered in an appendix. In a supplemental appendix, we establish the

asymptotic properties of our new tests in the context of non-vanishing parameter estimation error, for

the recursive estimation schemes.

2 Forecast Superiority Tests

Assume that we have a time series of forecast errors for each model/forecaster. Namely, we observe

for = 1 and = 1 , where denotes the number of models/forecasters, and denotes the

number of observations. As stated earlier, we focus on the case in which we can ignore estimation error,

such as when forecasts are judgmental or subjective. Surveys including the SPF are leading examples of

judgmental forecasts. The case of non-vanishing recursive estimation error is analyzed in the supplemental

appendix. Hereafter, the sequence 1 = 1 is called the “benchmark”. In the context of the SPF,

an example of a relevant benchmark against which to compare all other sequences is the consensus forecast

constructed as the simple arithmetic average of individual forecasts in the survey. Our goal is to test

whether there exists some competing forecast that is superior to the benchmark for any loss function, ,

satisfying Assumption A0.

3

Assumption A0 (i) ∈ L if : R→ R+ is continuously differentiable, except for finitely many points,with derivative 0 such that 0() ≤ 0 for all ≤ 0 and 0() ≥ 0 for all ≥ 0 (ii) ∈ L is a convexfunction belonging to LNote that L includes most of the loss functions commonly used by practitioners, including asymmet-

ric loss, and it basically coincides with notion of generalized loss in Granger (1999). The only restriction

is that the loss depends solely on the forecast errors. This rules out the class of loss function considered

in e.g. Section 3 of Patton and Timmermann (2007).

Hereafter, let () denote the cumulative distribution function (CDF) of forecast error Also,

define () = 1 ≥ 0 () = −1 0 Propositions 2.2 and 2.3 in JCS (2017) establishthe following results.

1. For any ∈ ((1)) ≤ ((2)) if and only if (2()− 1())() ≤ 0 for all ∈ X

2. For any ∈ ((1)) ≤ ((2)) if and only if³R −∞(1()− 2())1( 0) +

R∞(2()− 1())1( ≥ 0)

´≤ 0 for all ∈ X

The first statement establishes a mapping between GL forecast superiority and first order stochastic

dominance (FOSD). In particular, 1 is not GL dominated by 2 if 1() lies below 2() on the negative

real line, and lies above 2() on the positive real line. Indeed, this ensures that we choose the forecast

whose CDF has larger mass around zero. Likewise, the second statement establishes a mapping between

CL superiority and second order stochastic dominance.

In this framework, it follows that testing for loss function robust forecast superiority involves testing:

0 : max=2

(((1))−(())) ≤ 0 for all ∈ (2.1)

versus

: max=2

(((1))−(())) 0 for some ∈ (2.2)

with 0 and defined analogously, replacing with

Hereafter, let X = X− ∪ X+ be the union of the support of (1 ) Given the equivalence between () forecast superiority and first (second) order stochastic dominance, we can restate 0

0

and as

0 = −0 ∩+0

:¡1()− () ≤ 0 for = 2 and for all ∈ X−

¢∩ ¡()− 1() ≤ 0 for = 2 and for all ∈ X+¢

4

versus

= − ∪+

:¡1()− () 0 for some = 2 and for some ∈ X−

¢∪ ¡()− 1() 0 for some = 2 and for some ∈ X+¢

Analogously,

0 = −0 ∩+0

:

µZ −∞(1()− ()) ≤ 0 for = 2 , and for all ∈ X−

¶∩µZ ∞

(()− 1()) ≤ 0 for = 2 and for all ∈ X+¶

versus

= − ∪+

:

µZ −∞(1()− ()) 0 for some = 2 and for some ∈ X−

¶∪µmax

=2

Z ∞

(()− 1()) 0 for some = 2 and for some ∈ X+¶

It is immediate to see that 0 and 0 can be written as the intersection of (−1) moment inequalities,

which have to hold uniformly over X This gives rise to an infinite number of moment conditions. Andrewsand Shi (2013) develop tests for conditional moment inequalities, and as is well known in the literature

on consistent specification testing (e.g., see Bierens (1982, 1990)) a finite number of conditional moments

can be transformed into an infinite number of unconditional moments. The same is true in the case of

weak inequalities. Andrews and Shi (2017) consider tests for conditional stochastic dominance, which

are then characterized by an infinite number of conditional moment inequalities and so by a “twice”

infinite number of unconditional inequalities. Recalling that our interest is on testing GL or CL forecast

superiority as in (2.1) and (2.2), we confine our attention to unconditional testing of stochastic dominance.

Because of the discontinuity at zero in the tests, +0³

+

0

´and −0 (

−0 ) should be tested

separately, and then one can use Holm (1979) bounds to control the two resulting p-values (see Rules TG

and TC in JCS (2017)). In the sequel, for the sake of brevity, but without loss of generality, we focus our

discussion on testing +0 versus + and

+0 versus

+ However, when defining statistics, some

discussion of the statistics associated with the case where ∈ X− is also given, when needed for clarityof exposition.

We begin by testing GL forecast superiority. Let +() =¡+2 ()

+ ()

¢ with + () = ()−

1(), for ≥ 0 Define the empirical analog of +() as + () =³+2()

+()

´ and for ≥ 0

let

+() =b()− b1() (2.3)5

where b() denotes the empirical CDF of Similarly, let +() = ¡+2 () + ()¢ with + () =R∞(()− 1()) 1( ≥ 0). Define the empirical analog of +() as + () =

³+2()

+()

´

and let

+() =

Z ∞

³ b()− b1()´ 1( ≥ 0) (2.4)=

1

X=1

³[(1 − )]+ − [( − )]+

´

where []+ = max{0 }. Further, define

Σ+ ( 0) = acov¡√

+()√+(0)

¢(2.5)

and

Σ+ (

0) = bΣ+ ( 0) + −1 (2.6)where ≥ 0, and where bΣ+ ( 0) is the sample analog of Σ+ ( 0) In (2.6), the role of the additional−1 term is to correct for the possible singularity of the covariance estimator, for certain values of

This is the case when we compare forecast errors from nested models. Let b() = 1 { ≤ } −1

P=1 1 { ≤ }, so that the −th element of bΣ+ ( ) is given by

b2+ () = 1X=1

(b()− b1())2+21

X=1

X=+1

(b()− b1())(b− ()− b1− ()) (2.7)where = 1− 1+ with →∞ as →∞ Also, let

2+

( 0) be the −th element of Σ+ ( 0)

and let 2+ ( 0) be the −th element of Σ+ ( 0) Analogously,

Σ+ ( 0) = acov¡√

+()√+(0)

¢and

Σ+

( 0) = bΣ+ ( 0) + −1

wherebΣ+ ( 0) is the sample analog of Σ+ ( 0)Furthermore, b2+ () is constructed by replacing b1() and b() in the above expression with

b1() = [(1 − )]+ − 1X=1

[(1 − )]+

and b() = [( − )]+ − 1X=1

[( − )]+

6

Note that −() −() b2− () and b2− () can be defined by utilizing the function. Namely,

regardless of whether ≥ 0 or 0, one can construct () =³ b()− b1()´ () and

() =

Z −∞

³ b1()− b()´ 1( 0)− Z ∞

³ b()− b1()´ 1( ≥ 0)=

1

X=1

³[(1 − ) ()]+ − [( − ) ()]+

´

b2() = 1X=1

(b()− b1())2+21

X=1

X=+1

(b()− b1())()(b− ()− b1− ())()and b2() by replacing b1() and b() in the above expression with

b1() = [(1 − ) ()]+ − 1X=1

[(1 − ) ()]+

and b() = [( − ) ()]+ − 1X=1

[( − ) ()]+

Given the above framework, our new robust forecast superiority test statistics are:

+ =

Z∈X+

X=2

Ãmax

(0√()

()

)!2() and − =

Z∈X−

X=2

Ãmax

(0√()

()

)!2()

(2.8)

and

+ =

Z∈X+

X=2

Ãmax

(0√()

()

)!2() and − =

Z∈X−

X=2

Ãmax

(0√()

()

)!2()

(2.9)

where is a weighting function defined below; +() and +() are the −th components of + ()

and + () as defined in (2.3) and (2.4), respectively Here, + and

+ are “sum” functions, as in

equation (3.8) in Andrews and Shi (2013), and satisfy their Assumptions S1-S4, which are required to

guarantee that convergence is uniform over the null DGPs.2 ,3 If = 2 and () = 1 for all and

(i.e. no standardization), then + is the statistic used in Linton, Song and Whang (2010) for testing

FOSD.2Note that we could have constructed a different “sum” function, using the statistic in (3.9) of Andrews and Shi (2013).3Recall that one main drawback of the max=2 sup∈X+

√+ statistic in JCS (2017) is that it diverges to −∞

under some sequence of probability measures under the null, thus ruling out uniformity.

7

Of note is that in our context, potential slackness causes a discontinuity in the pointwise asymptotic

distribution of the statistic.4 This is because the pointwise asymptotic distribution is discontinuous,

unless all moment conditions hold with equality. On the other hand, the finite sample distribution is not

necessarily discontinuous. Thus, in the presence of slackness, the pointwise limiting distribution is not a

good approximation of the finite sample distribution, and critical values based on pointwise asymptotics

may be invalid. This is why we construct tests that are uniformly asymptotically valid (i.e., this is why

we study the limiting distribution of our tests under drifting sequences of probability measures belonging

to the null hypothesis). Moreover, in the infinite dimensional case, there is an additional source of

discontinuity. In particular, the number of moment inequalities which contributes to the statistic varies

across the different values of For example, the key difference between the case of = 2 and 2 is

that in the former case, for each value of there is only one moment inequality which can be binding (or

not). On the other hand, if = 3, say, then for each value of there can be either one or two moment

inequalities which may be binding (or not), and whether or not a particular inequality is binding (or not)

varies over . Under this setup, we require the following assumptions in order to analyze the asymptotic

behavior of our test statistics.

Assumption A1: For = 1 is strictly stationary and −mixing, with mixing coefficients, =

− where 61−2 0 12 and 1

Assumption A2: The union of the supports of 1 is the compact set, X = X− ∪X+.Assumption A3: () has a continuos bounded density.

Assumption A4: The weighting function has full support X+ (or X− )We use Assumption A2 in the proof of Lemma 1, where we require X+ in (2.8) and (2.9) to be a

compact set. However, for the case of generalized loss superiority, the union of the supports of 1

can be unbounded. This is because + is bounded, regardless of the boundedness of the support. On

the other hand, +

is bounded only when the union of the support of the forecasting error is bounded.

3 Asymptotic Properties

3.1 Uniform Convergence of the HAC Estimator

We now turn to a discussion of the estimation of the variance in our forecast superiority test statistics.

If 1 were martingale difference sequences, then we can still use the sample second moment

as a variance estimator, and uniform consistency will follow by application of an appropriate uniform

law of large numbers. In our set-up we can assume that 1 are martingale difference sequences if

either: (i) they are judgmental forecasts from professional forecasters, say, who efficiently use all available

information at time (a strong assumption, which is tested in the forecast rationality literature); or (ii)

4By pointwise asymptotic distribution we mean the limiting distribution under a fixed probability measure.

8

they are prediction errors from one-step ahead forecasts based on dynamically correctly specified models.

With respect to (i), it is worth noting that professional forecasters may be rational, ex-post, according

to some loss function (see Elliott, Komunjer and Timmermann (2005,2008), although it is not as likely

that they are rational according to a generalized loss function. With respect to (ii), it should be noted

that at most one model can be dynamically correctly specified for a given information set, and thus

cannot be a martingale difference sequence, for all = 1 In light of these facts, we allow for time

dependence in the forecast error sequences used in our statistics, and use a HAC variance estimator in

(2.8) and (2.9). In order to ensure that the HAC estimators converge uniformly over X+ it suffices toestablish the counterpart of Lemma A1 of Supplement A of Andrews and Shi (2013) to the case of mixing

sequences. This is done below.

Lemma 1: Let Assumptions A1-A3 hold. Then, if ≈ 0 12 with defined as in AssumptionA1:

(i)

sup∈X+

¯̄̄b2+ ()− 2+ ()¯̄̄ = (1) with 2+ () =

¡√+()

¢; and

(ii)

sup∈X+

¯̄̄b2+ ()− 2+ ()¯̄̄ = (1) with 2+ () =

¡√+()

¢

Lemma 1 establishes the uniform convergence over X+ of HAC estimators. It is the time seriescounterpart of Lemma A1 in Andrews and Shi (2013). Of note is that we require −mixing. This differsfrom the stationary pointwise HAC variance estimator case studied by Andrews (1991), where −mixingsuffices, and where the mixing coefficients decline to zero slightly slower than in our Assumption A1.

This is because there is a trade-off between the degree of dependence and the rate of growth of the lag

truncation parameter in the HAC estimator. Indeed, in the uniform case, the covering number (e.g., see

Andrews and Pollard (1994)) grows with both and the degree of dependence, thus leading to a trade-off

between the two. For example, in the case of exponential mixing series, can be arbitrarily close to 12

For carrying out inference on our forecast superiority tests, we require a bootstrap analog of the

HAC variance estimator, which can be constructed as follows. Using the block bootstrap, make

draws of length from 1 in order to obtain¡∗1

∗

¢=¡1+1 1+ +

¢

with = where the block size, is equal to the lag truncation parameter in the HAC estima-

tor described above.5 Now, let ∗1() = 1©∗1 ≤

ª − 1P=1 1 {1 ≤ } ∗() = 1©∗ ≤ ª −5We thus use the same notation, for both the lag truncation parameter and the block length.

9

1

P=1 1 { ≤ } and

b2∗+ () = 1X=1

Ã1

12

X=1

³∗(−1)+()− ∗1(−1)+()

´!2 (3.1)

Define b∗2+ () analogously, replacing ∗1() with ∗1() = £∗1 − ¤+ − 1P=1 1 £∗1 − ¤+ and∗() with

∗() =

£∗ −

¤+− 1

P=1 1

£∗ −

¤+ Additionally, define

b2∗+0 () = 1X=1

Ã1

12

X=1

³∗(−1)+()− ∗1(−1)+()

´³∗0(−1)+()− ∗1(−1)+()

´!

The following result holds.

Lemma 2: Let Assumptions A1-A3 hold. Then, if ≈ 0 12 with defined as in AssumptionA1:

(i)

sup∈X+

¯̄̄b∗+ ()− E∗ ³b∗+ ()´¯̄̄ = ∗ (1) and (ii)

sup∈X+

¯̄̄b∗+ ()− E∗ ³b∗+ ()´¯̄̄ = ∗ (1) where ∗(1) denotes convergence to zero according to the bootstrap law,

∗ conditional on the sample.

As in our above discussion, when constructing bootstrap counterparts for the statistics defined in

(2.8) and (2.9) on both the positive and negatives supports of , it suffices to utilize the func-

tion, and note that ()2 = 1. For example, replace ∗1() with ∗1() =

£(∗1 − )()

¤+−

1

P=1

£(∗1 − )()

¤+ replace ∗() with

∗() =

£(∗ − )()

¤+− 1

P=1

£(∗ − )()

¤+

and define

b2∗0() = 1X=1

Ã1

12

X=1

³∗(−1)+()− ∗1(−1)+()

´³∗0(−1)+()− ∗1(−1)+()

´!

and

b2∗0() = 1X=1

Ã1

12

X=1

³∗(−1)+()− ∗1(−1)+()

´³∗0(−1)+()− ∗1(−1)+()

´!

3.2 Inference Using the Bootstrap and Bounding Limiting Distributions

The statistics + and + are highly discontinuous over Exactly which moment conditions, and how

many of them are binding varies over Hence, + and + do not necessarily have a well defined

limiting distribution; and the continuous mapping theorem cannot be applied. However, following the

10

generalized moment selection (GMS) test approach of Andrews and Shi (2013) we can establish lower

and upper bound limiting distributions. Let

+() = diagΣ+ ( )

+() = +()−12

¡√+2 ()

√+ ()

¢0 (3.2)

+ ( 0) = +()−12

¡Σ+ + −1

¢( 0)+(0)−12 (3.3)

and

+() = (+2 () + ())

0 (3.4)

where +() is a ( − 1)−dimensional zero mean Gaussian process with correlation + ( 0). Also,let +() +()

+ (

0) +() be defined analogously, by replacing Σ+ ( ) +2 () + ()

with Σ+ ( ) +2 () + () Finally, define

†+ =ZX+

X=2

⎛⎝max⎧⎨⎩0

+ () +

+()q

+()

⎫⎬⎭⎞⎠2 d() (3.5)

where +() is the −th element of + ( ), and let

†+∞ =ZX+

X=2

⎛⎝max⎧⎨⎩0

+ () +

+∞()q

+()

⎫⎬⎭⎞⎠2 d() (3.6)

where +∞() = 0 if () = 0 and +∞() = −∞ , if () 0 Also, define †+ and †+∞

analogously, by replacing + () +()

+∞() and

+() with

+ ()

+()

+∞()

and +() Hereafter let

P+0 =© : +0 holds

ªso that P+0 is the collection of DGPs under which the null hypothesis holds. Let P+0 be definedanalogously, with +0 replaced by

+0 The following result holds.

Theorem 1: Let Assumptions A1-A4 hold. Then:

(i) under +0 there exists a 0 such that

lim sup→∞

sup∈P+0

h³+

+

´−

³†+ +

+

´i≤ 0

and

lim inf→∞ inf∈P+0

h³+

+

´−

³†+ − +

´i≥ 0;

and

11

(ii) under +0 there exists a 0 such that

lim sup→∞

sup∈P+0

h³+

+

´−

³†+ +

+

í≤ 0

and

lim inf→∞ inf∈P+0

h³+

+

´−

³†+ − +

í≥ 0

Theorem 1 provides upper and lower bounds for ³+

+

ánd

³+

+

´ uniformly,

over the probabilities under +0 and +0 respectively. Note that

+(·) and +(·) depend on the

degree of slackness, and do not need to converge. Indeed, + and/or + do not have to converge in

distribution for this result to hold.

Following Andrews and Shi (2013), we can construct bootstrap critical values which properly mimic

the critical values of †+∞ and †+∞ We rely on the block bootstrap to capture the dependence in the

data when constructing our bootstrap statistics. Consider the case of †+∞ Let¡∗1

∗

¢ and

be defined as in the previous subsection, and let:

∗+() =1

X=1

¡1©∗ ≤

ª− 1©∗1 ≤ ª¢ (3.7)and

∗+ () =√ b−12+ () ¡∗+ ()−+ ()¢ (3.8)

with ∗+ () =³∗+2 ()

∗+ ()

ánd b+ () = diagbΣ+ ( ) Then, define:

+ () = −1

12−12+ ()

+() (3.9)

with →∞ as →∞ Here, +() is the -th element of + () = diag

³Σ+ ( )

´ + () =³

+2 () +()

´ and

+ () = 1n+ () −1

o (3.10)

with a positive sequence, which is bounded away from zero. Thus, + () = when

+()

−−1212+ () (i.e., when the −th inequality is slack at ) and is zero otherwise.It it clear from the selection rule in (3.10), that we do need an estimator of the variance of the

moment conditions, despite the fact we use bootstrap critical values. In fact, standardization does not

play a crucial role in the statistics, as all positive sample moment conditions matter. On the other hand,

without the scaling factor in (3.9), the number of non-slack moment conditions would depend on the

scale, and hence our bootstrap critical values would no longer be scale invariant. Let

∗+ =ZX+

X=2

max

⎛⎝⎧⎨⎩0 ∗+ ()− + ()q

∗+ ()

⎫⎬⎭⎞⎠2 d() (3.11)

12

where ∗+() is the − element of

−12+ ()Σ

∗+ ( )

−12+ () and Σ

∗+ ( ) is the

bootstrap analog of Σ+

( ) 6 Note that if grows with then all slack inequalities are discarded,

asymptotically. It is immediate to see that ∗+ is the bootstrap counterpart of †+ in (3.5), with

+ () mimicking the contribution of the slackness of inequality (i.e., of −th element of +())However, + () is not a consistent estimator of

+() since the latter cannot be consistently estimated.

Now, consider the case of †+∞ Let:

∗+() =1

X=1

³£∗ −

¤+− £∗1 − ¤+´

and define ∗+ () b+ () + () and + () analogously to ∗+ () b+ () + () and + ()by replacing ∗+ () + () and bΣ+ ( ) with ∗+ () + () and bΣ+ ( ) Then, construct:

∗+ =ZX+

X=2

max

⎛⎝⎧⎨⎩0 ∗+ ()− + ()q

∗+ ()

⎫⎬⎭⎞⎠2 d() (3.12)

By comparing (2.8) and (2.9) with (3.11) and (3.12), it is immediate to see that +() (+()) does

not contribute to the test statistic when +() 0 (+() 0) while it does not contribute to

the bootstrap statistic when +() −−1212+

() (+() −−12

12+

()) with

−12 → 0 Heuristically, by letting grow with the sample size, we control the rejection rates in a

uniform manner.

It remains to define the GMS bootstrap critical values. Let ∗+1−³+

∗+

´be the (1 − )-th

critical value of ∗+ based on bootstrap replications, with + defined as in (3.10) and

∗+ () =b−12+ ()Σ∗+ ( ) b−12+ (). The (1−)-th GMS bootstrap critical value, ∗+01− ³+ ∗+ ´

is defined as:

∗+01−³+

∗+

´= lim

→∞∗+1−+

³+

∗+

´+

for 0 arbitrarily small. Further, ∗+1−³+

∗+

´and ∗+01−

³+

+

´are defined analo-

gously.

Here, the constant is used to guarantee uniformity over the infinite dimensional nuisance parameters,

+() +() uniformly on ∈ X+ and is termed the infinitesimal uniformity factor by Andrews and

Shi (2013). Heuristically, if all moment conditions are slack, then both the statistic and its bootstrap

counterpart are zero, and by having 0 though arbitrarily close to zero we control the asymptotic

rejection rate.

Finally, let

B+ =n ∈ X+ s.t. +∞ = 0 for some = 2

o(3.13)

6Thus, the diagonal elements of Σ∗+ ( ) are the 2∗+ () described in the previous subsection, while the off-diagonalelements of Σ∗+ ( ) are defined accordingly, as 2∗+0 (), with 6= 0

13

and

B+ =n ∈ X+ s.t. +∞ = 0 for some = 2

o (3.14)

where B+ and B+ define the sets over which at least one moment condition holds with strict equality,and these sets represent the boundaries of +0 and

+0 respectively.

Although we require that the block length grows at the same rate as the lag truncation parameter,

in Lemma 2 (i.e., we require that ≈ 0 12 with being the mixing coefficient in A1), forthe asymptotic uniform validity of the bootstrap critical values, we require that the block length grows

at a rate slower than 13 This slower rate is required for the bootstrap empirical central limit theorem

for a mixing process to hold (see Peligrad (1998)). Needless to say, even in the construction of b2+ (),we should thus use = (13) The following result holds.

Theorem 2: Let Assumptions A1-A4 hold, and let →∞ and 13− → 0 as →∞ Under +0 :(i) if as →∞ →∞ and → 0 then

lim sup→∞

sup∈P+0

³+ ≥ ∗+01−

³+

∗+

´´≤ ;

and

(ii) if as →∞ →∞ →∞, √ →∞ and ¡B+¢ 0 then

lim→0

lim sup→∞

sup∈P+0

³+ ≥ ∗+01−

³+

∗+

´´=

Also, under +0 :

(iii) if as →∞ →∞ and → 0 then

lim sup→∞

sup∈P+0

³+ ≥ ∗+01−

³+

∗+

´´≤ ;

and (iv) if as →∞ →∞ →∞,√ →∞ and

¡B+¢ 0 thenlim→0

lim sup→∞

sup∈P+0

³+ ≥ ∗+01−

³+

∗+

´´=

Statements (i) and (iii) of Theorem 2 establish that inference based on GMS bootstrap critical values

are uniformly asymptotically valid. Statements (ii) and (iv) of the theorem establish that inference

based on GMS bootstrap critical values is asymptotically non-conservative, whenever (B+) 0 or¡B+¢ 0 (i.e., whenever at least one moment condition holds with equality, over a set ∈ X+ with

non-zero −measure). Although the GMS based tests are not similar on the boundary, the degree ofnon similarity, which is

lim→0

lim sup→∞

sup∈P+0

³+ ≥ ∗+01−

³+

∗+

´´− lim

→0lim inf inf

∈P+0³+ ≥ ∗+01−

³+

∗+

´´

14

is much smaller than that associated with using the “usual” recentered bootstrap. In the case of pairwise

comparison (i.e., = 2) Theorem 2(ii) of Linton, Song and Whang (2010) establishes similarity of

stochastic dominance tests on a subset of the boundary.

For implementation of the tests discussed in this paper, it thus follows that one can use Holm bounds

as is done in JCS (2017), with modifications due to the presence of the constant . Estimate bootstrap

−values ++

= 1P

=1 1³(∗

+

+ ) ≥ +

´and −

−= 1

P=1 1

³(∗

− + ) ≥

−

´.

Estimate ++

and −−

in analogous fashion. Then, use the following rules (Holm (1979)):

Rule : Reject 0 at level if minn++

−−

o≤ (− )2.

Rule : Reject 0 at level if min

n++

−−

o≤ (− )2.

3.3 Power against Fixed and Local Alternatives

As our statistics are weighted averages over X+ they have non-trivial power only if the null is violatedover a subset of non zero −measure. This applies to both power against fixed alternative, as well as topower against

√−local alternatives. In particular, for power against fixed alternatives, we require the

following assumption.

Assumption FA: (i) (+ ) 0 where + = { ∈ X+ : () 0 for some = 2 } . (ii)

(+ ) 0 where + = { ∈ X+ : () 0 for some = 2 }


Theorem 3: Let Assumptions A0-A4 hold.

(i) If Assumption FA(i) holds, then under + :

lim→∞

³+ ≥ ∗+01−

³+

∗+

´´= 1

(ii) If Assumption FA(ii) holds, then under + :

lim→∞

³+ ≥ ∗+01−

³+

∗+

´´= 1

It is immediate to see that we have unit power against fixed alternatives, provided that the null hypothesis

is violated, for at least one = 2 over a subset of X+ of non-zero −measure. Now, if we insteadused a Kolmogorov type statistic (i.e., replace the integral over X+ with the supremum over X+) thenwe would not need Assumption FA, and it would suffice to have violation for some with possibly zero

−measure, or in general with zero Lebesgue measure.7 However, as pointed out in Supplement B of7The Kolmorogov versions of + and

+ are:

+ = max∈X+

=2

max

0

√+()

+()

2

+ = max∈X+

=2

max

0

√+()

+()

2

15

Andrews and Shi (2013) the statements in parts (ii) and (iv) of Theorem 2 do not apply to Kolmogorov

tests, and hence asymptotic non-conservativeness does not necessarily hold. This is because the proof of

those statements use the bounded convergence theorem, which applies to integrals but not to suprema.

We now consider the following sequences of local alternatives:

+ : +() =

+ () +

1()√

+ ³−12

´ for = 2 ∈ X+

and

+ : +() =

+ () +

2()√

+ ³−12

´ for = 2 ∈ X+

We have lim→∞√+()−12+() → +∞() + 1() and lim→∞

√+()−12+() →

+

∞() + 2() Define,

†+∞1 =ZX+

X=2

⎛⎝max⎧⎨⎩0

+ () +

+∞() + 1()q+()

⎫⎬⎭⎞⎠2 d()

and

†+∞2 =ZX+

X=2

⎛⎝max⎧⎨⎩0

+ () +

+∞() + 2()q+()

⎫⎬⎭⎞⎠2 d()

We require the following assumption.

Assumption LA:(i) (+ ) 0 where

+ =n :√+()−12+()→ +∞() + 1() 0 +∞() + 1() ∞ for some = 2

o.

(ii) (+ ) 0 where

+ =n :√+()−12+()→ +∞() + 2() 0 +∞() + 2() ∞ for some = 2

o


Theorem 4: Let Assumptions A1-A4 hold.

(i) If Assumption LA(i) holds, then under + :

lim→∞

³+ ≥ ∗+01−

³+

∗+

´´=

³†+∞1 ≥ 1−

³+∞

+∞

´´

with 1−³+∞

+∞

´denoting the ( 1 − )-th critical value of †+∞1, with 0 +∞() +

1() ∞ for some = 2 (ii) If Assumption LA(ii) holds, then under + :

lim→∞

³+ ≥ ∗+01−

³+

∗+

´´=

³†+∞2 ≥ 1−

³+∞

+∞

´´

with 1−³+∞

+∞

´denoting the ( 1 − )-th critical value of †+∞2 , with 0 +∞() +

2() ∞ for some = 2

16

Theorem 4 establishes that our tests have power against√−alternatives, provided that the drifting

sequence is bounded away from zero, over a subset of X+ of non-zero −measure. Note also that forgiven loss function, the sequence of local alternatives for the White reality check can be defined as:

: max=2

(((1))−(())) = √+

³−12

´ 0 (3.15)

For sake of simplicity, suppose that = 2 (this is the well known Diebold and Mariano (1995) test

framework). Here,

0 = 12((1))−(()) + (1)= 12

Z ∞−∞

() (1()− 2()) d

= −12Z 0−∞

0() (1()− 2()) d

−12Z ∞0

0() (1()− 2()) d

= 12Z 0−∞

³−∞() + 1()

´()d+ 12

Z ∞0

³+∞() + 1()

´()d (3.16)

where () = ()+1()√

, and 1 = 11− 12 Hence, in (3.15) is equivalent to + ∩− ,

whenever Assumption A0 holds and () = 0()()

Analogously, for any convex loss function, which satisfies Assumption A0, in (3.15) is equiv-

alent to − ∩+− , whenever () = 00()() In fact, it is easy to see that:

0 = 12((1))−(()) + (1)= 12

Z ∞−∞

() (1()− 2()) d

= −12Z 0−∞

0() (1()− 2()) d− 12Z ∞0

0() (1()− 2()) d

= −0()12Z −∞

(1()− 2()) d¯̄0−∞ +

12

Z 0−∞

00()µZ −∞

(1()− 2()) d¶d

+120()Z ∞

(1()− 2()) d |∞0 − 12Z ∞0

00()µZ ∞

(1()− 2()) d¶d

= 12Z 0−∞

00()µZ −∞

(1()− 2()) d¶d− 12

Z ∞0

00()µZ ∞

(1()− 2()) d¶d

= 12Z 0−∞

µZ −∞

³−∞() + 2()

´d

¶()d− 12

Z ∞0

µZ ∞

³+∞() + 2()

´d

¶()d

17

4 Monte Carlo Experiments

In this section, we evaluate the finite sample performance of GL and CL forecast superiority tests when

there are multiple competing sequences of forecast errors, under stationarity. In addition to analyzing the

performance of our tests based on + and − (GL forecast superiority) as well as based on

+ and

− (CL forecast superiority), we also analyze the performance of the related test statistics from JCS

(2017), here called + , −

+ , and

− . For the sake of brevity, these two classes of

tests are called and tests, respectively.8 For each experiment we carry out 1000 Monte Carlo

replications, and the number of bootstrap samples is = 500. Additionally, four different values of

the smoothing parameter, are examined for the tests, including = {020 035 050 060};and four different values of the uniformity constant, are examined for the tests, including =

{00015 0002 00025 0003}.9 Additionally, for tests, when constructing Σ+ (as well as Σ− , etc.),

we set = integer[02] and = 1 − 4 Finally, when implementing the bootstrap counterpart of ,we set =

p03 log() and =

p04 log() log(log()) following Andrews and Shi (2013, 2017).

Sample sizes of ∈ {300 600 900} are generated using each of the following eight data generatingprocesses (DGPs), with independent forecast errors.

DGP1: 1 ∼ (0 1) and ∼ (0 1) = 2 3DGP2: 1 ∼ (0 1) and ∼ (0 1) = 2 3 4 5DGP3: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 142) = 4 5DGP4: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 162) = 4 5DGP5: 1 ∼ (0 1), ∼ (0 082) = 2 3 and ∼ (0 122) = 4 5DGP6: 1 ∼ (0 1) ∼ (0 082) = 2 3 4 5 and ∼ (0 122) = 6 7 8 9DGP7: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 082) = 4 5DGP8: 1 ∼ (0 1), ∼ (0 1) = 2 3 and ∼ (0 062) = 4 5DGP9: 1 ∼ (0 1) and ∼ (0 082) = 2 3 4 5DGP10: 1 ∼ (0 1) and ∼ (0 062) = 2 3 4 5Additionally, we conducted experiments using DGPs specified with autocorrelated errors. For the

sake of brevity, these finding are reported in the supplemental online appendix. Denoting e = e−1+8 In the construction of the statistics

+ =

∈X+

=2

max

0√()

()

2() and − =

∈X−

=2

max

0√()

()

2()

we set () = 11506

. Thus, (·) is still uniform. For inference using our tests, once is determined, estimate bootstrap−values, + =

1

=1 1

(∗

+

+ ) ≥ +

and − =

1

=1 1

(∗− + ) ≥ −

. Then, use the

following rules (Holm, (1979): Reject 0 at level if min+

−

≤ ( − )2. Reject 0 at level if

min+

−

≤ (− )2.

9 In JCS (2017), the constant that we call is called .

18

(1− 2)12with ∼ (0 1) = 1 5 the DGPs for these experiments are as follows.DGP11: 1 = e1 and = e = 2 3 4 5DGP12: 1 = e1, = e = 2 3 and = 14e = 4 5DGP13: 1 = e1, = 08e = 2 3 and = 12e = 4 5DGP14: 1 = e1, = e = 2 3 and = 06e = 4 5In the above setup, DGPs 1-4 and DGPs 11-12 are used to conduct size experiments, while DGPs 4-10

and DGPs 13-14 are used to conduct power experiments. In all cases, 1 denote the forecast errors from

the benchmark model. Note that DGPs 1-2 correspond to the least favorable elements in the null, while

in DGPs 3-4 and DGPs 11-12, some models underperform the benchmark. This is the case where we

expect significant improvement when using our new tests instead of JCS tests. In DGPs 5-6 and DGP13,

one half of the competing models outperform the benchmark model and the other half underperform. In

DGPs 7-8, one half of the competing models outperform, while is DGPs 9-10 and DGP14, the competing

models all outperform the benchmark model. The above DGPs are similar to those examined in JCS

(2017), and are utilized in our experiments because they clearly illustrate the trade-offs associated with

using and forecast superiority tests.

We now discuss the experimental findings gathered in Tables 1 and 2. All reported results are rejection

frequencies based on carrying out the and tests using a nominal size equal to 0.1. Turning first

to Table 1, note that results in this table are based on tests. Summarizing, tests have

reasonably good size under DGPs 1-2 (the least favorable case under the null). However, they are often

undersized (and in some cases severely so) in some sample size / permutations when some models

are worse than the benchmark (see DGPs 3-4), as should be expected given that the tests are not

asymptotically correctly sized under these two DGPs. Moreover, in these cases the empirical size is non

monotonic, in paricular for forecast superiority. Turning to Table 2, note that tests, which are

asymptotically non conservative, often exhibit better size properties under DGPs 3-4 (compare DGPs 3-4

in Tables 1 and 2) than tests. For example, for the CL forecast superiority test the empirical size

of the test is 0020 for all values of , when = 900 (see Table 1). The analogous value based

on implementation of the test is 0083 for all values of (see Table 2) Again, it is worth stressing

that this finding comes as no surprise, given that the test is asymptotically non conservative on the

boundary of the null hypotheses, while the test is conservative. Turning to power, note that the

power of the test is sometimes quite low relative to that of the test. For example, under DGP7,

power is 0445 for the GL forecast superiority test and 0845 for the CL forecast superiority

test, when = 300). Note that analogous rejection frequencies for the tests are 0870 and 0923 (see

Table 2, DGP7, = 300). As expected, thus, tests exhibit improved power relative to tests,

when some models are worse than the benchmark. All of the above findings pertain to the analysis of

DGPS 1-10, in which forecast errors are serially uncorrelated. Results for DGPS 11-14, in which errors

19

are serially correlated are gathered in the supplemental appendix. Examination of the results for these

DGPSs (see Tables Supplemental S1 and S2) are qualitatively the same as those reported on above.

Finally, it should be pointed out that the tests are not overly sensitive to the choice of , and the

empirical size of tests appears “best” when is very small, as should be expected. In conclusion,

there is a clear performance improvement when comparing our new robust predictive superiority tests

with tests.10

5 Empirical Illustration: Robust Forecast Evaluation of SPF Ex-

pert Pools

In the real-time forecasting literature, predictions from econometric models are often compared with

surveys of expert forecasters.11 Such comparisons are important when assessing the implications asso-

ciated with using econometric models in policy setting contexts, for example. One key survey dataset

collecting expert predictions is the Survey of Professional Forecasters (SPF), which is maintained by

the Philadelphia Federal Reserve Bank (see Croushore (1993)). This dataset, formerly known as the

American Statistical Association/National Bureau of Economic Research Economic Outlook Survey, col-

lects predictions on various key economic indicators (including, for example, nominal GDP growth, real

GDP growth, prices, unemployment, and industrial production). For further discussion of the variables

contained in the SPF, refer to Croushore (1993) and Aiolfi, Capistrán, and Timmermann (2011). The

SPF has been examined in numerous papers. For example, Zarnowitz and Braun (1993) comprehensively

study the SPF, and find, among other things, that use of the mean or median provides a consensus

forecast with lower average errors than most individual forecasts. More recently, Aiolfi, Capistrán, and

Timmermann (2011) consider combinations of SPF survey forecasts, and find that equal weighted aver-

ages of survey forecasts outperform model based forecasts, although in some cases these mean forecasts

can be improved upon by averaging them with mean econometric model-based forecasts. When uti-

lizing European data from the recently released ECB SPF, Genre, Kenny, Meyler, and Timmermann

(2013) again find that it is very difficult to beat the simple average. This well known result pervades

the macroeconometric forecasting literature, and reasons for the success of such simple forecast averaging

10For a discussion of simulation results based on application of the Diebold and Mariano (DM: 1995) test (in which

specific loss functions are utilized) in our experimental setup, refer to JCS(2017). Summarizing from that paper, it is clear

that when the loss function is unknown, there is an advantage to using our approach of testing for forecast superiority.

However the DM test for pairwise comparison or a reality check test for multiple comparisons might yield improved power,

for a given loss function. Indeed, under quadratic loss, JCS (2017) show that when the sample size is small, the DM test

has better power performance than type tests. When the sample size increases, the power difference between the two

tests becomes smaller. This is as expected.11 See Fair and Shiller (1990), Swanson and White (1997a,b), Aiolfi, Capistrán and Timmermann (2011), and the references

cited therein for further discussion.

20

are discussed in Timmermann (2006). He notes, among other things, that model misspecification related

to instability (non-stationarities) and estimation error in situations where there are many models and

relatively few observations may account to some degree for the success of simple forecast and model av-

eraging. Our empirical illustration attempts to shed further light on the issue of simple model averaging

and its importance in forecasting macroeconomic variables.

Our approach is to address the issue of forecast averaging and combination (called pooling) by viewing

the problem through the lens of forecast superiority testing. Our use of loss function robust tests is unique

to the SPF literature, to the best of our knowledge. Since we use robust forecast superiority tests, we do

not evaluate pooling by using loss function specific tests, such as those discussed in Diebold and Mariano

(1995), McCracken (2000), Corradi and Swanson (2003), and Clark and McCracken (2013). Additionally,

our approach differs from that taken by Elliott, Timmermann, and Komunjer (2005, 2008), where the

rationality of sequences of forecasts is evaluated by determining whether there exists a particular loss

function under which the forecasts are rational. We instead evaluate predictive accuracy irrespective of

the loss function implicitly used by the forecaster, and determine whether certain forecast combinations

are superior when compared against any loss function, regardless of how the forecasts were constructed.

In our tests, the benchmarks against which we compare our forecast combinations are simple average and

median consensus forecasts. We aim to assess whether the well documented success of these benchmark

combinations remains intact when they are compared against other combinations, under generic loss.12

In all of our experiments, we utilize SPF predictions of nominal GDP growth. The SPF is a quarterly

survey, and the dataset is available at the Philadelphia Federal Reserve Bank (PFRB) website. The

original survey began in 1968:Q4, and PFRB took control of it in 1990:Q2; but from that date, there

are only around 100 quarterly observations prior to 2018:Q1, where we end our sample. In our analysis

we thus use the entire dataset, which, after trimming to account for differing forecast horizons in our

calculations, is 166 observations.13,14

For our analysis, we consider 5 forecast horizons (i.e., = 0 1 2 3 4) The reason we use = 0 for

one of the horizons is that the first horizon for which survey participants predict GDP growth is the

quarter in which they are making their predictions. In light of this, forecasts made at = 0 are called

nowcasts. Moreover, it is worth noting that nowcasts are very important in policy making settings, since

12For an interesting discussion of machine learning and forecast combination methods, see Lahiri, Peng, and Zhao (2017);

and for a discussion of probability forecasting and calibrated combining using the SPF, see Lahiri, Peng, and Zhao (2015).

In these papers, various cases where consensus combinations do not “win” are discussed.13 It should be noted that the “timing” of the survey was not known with certainty prior to 1990. However, SPF

documentation states that they believe, although are not sure, that the timing of the survey was similar before and after

they took control of it.14Note that the number of experts for which forecasts are recorded for each calendar date, was approximately 90 experts

during each of the 4 quarters of 1968, while there where only approximately 40 experts in each quarter in 2017. For further

details on the SPF dataset, refer to the documentation at https://www.philadelphiafed.org/research-and-data/real-time-

center/survey-of-professional-forecasters.

21

first release GDP data are not available until around the middle of the subsequent quarter. The nominal

GDP variable that we examine is called NGDP in the SPF. All test statistics are constructed using

NGDP growth rate prediction errors. In particular, assume that one survey participant makes a forecast

of NGDP, say +|F.15 The associated forecast error is:

= {ln(+)− ln()}−nln(+|F)− ln()

o= ln(+)− ln(+|F)

where the actual NGDP value, + is reported in the SPF, along with the NGDP predictions of each

survey participant. Note that when = 0, F does not include However, for 0, F includes .As discussed previously, this is due to the release dates associated with the availability of NGDP data.

Figure 1 illustrates some of the key properties of the NGDP data that we utilize. Namely, note that

the distributions of the expert forecasts vary over time, and exhibit interesting skewness and kurtosis

properties (compare Panels A-D of the figure, and the skewness and kurtosis statistics reported below

the plots in the figure). Based on examination of the densities in Figure 1, one might wonder whether

“trimming” experts from the panel, say those experts that provided the forecasts appearing in the left

tails of the distributions, might improve overall predictive accuracy of the panel. Although this question

is not directly addressed in our analysis, we do construct and analyze the performance of various “pools”

formed by trimming experts that exhibit sub-par predictive accuracy, for example.

In addition to constructing + −

− and

+ tests in our empirical investigation, we also test

for forecast superiority using the tests discussed above, which have correct size only under the

least favorable case under the null. In particular, we construct + −

− and

+

test statistics (see Section 2 and JCS (2017) for further details). All test statistics are calculated using

the same parameter values (for , and ) as used in our Monte Carlo experiments. However,

results are only reported for = 020 and = 0002 since our findings remain unchanged when other

values of and from our Monte Carlo experiments are used.

Two different benchmark models are considered, including (i) the arithmetic mean prediction from all

participants; and (ii) the median prediction from all participants. Additionally, a variety of alternative

model “groups” are considered. In all alternative models, mean and median predictions are again formed,

but this time using subsets of the total available panel of experts, chosen in a number of ways, as outlined

below.

Group 1 - Experts Chosen Based on Experience: Three expert pools (i.e. three alternative models)

consisting of experts with 1, 3, and 5 years of experience.

In all of the remaining groups of combinations, individuals are ranked according to average absolute

forecast errors, as well as according to average squared forecast errors. Mean (or median) predictions

from these groups are then compared with our benchmark combinations.

15Here, F denotes the information set available to the expert forecaster at the time their predictions are made.

22

Group 2 - Experts Chosen Based on Forecast Accuracy I: Three expert pools consisting of most accurate

expert over last 1, 3, and 5 years.

Group 3 - Experts Chosen Based on Forecast Accuracy II: Three expert pools consisting of most accurate

group of 3 experts over last 1, 3, and 5 years.

Group 4 - Experts Chosen Based on Forecast Accuracy III: Three expert pools consisting of top 10%

most accurate group of experts over last 1, 3, and 5 years.

Group 5 - Experts Chosen Based on Forecast Accuracy III: Three expert pools consisting of top 25%

most accurate group of experts over last 1, 3, and 5 years.

Finally, 3 additional groups which combine models from each of Groups 1-5 are analyzed. These

include:

Group 6: Five expert pools, including one pool with experts that have 1 year of experience, and 4

additional pools, one from each of Groups 2-5, all defined over the last 1 year.

Group 7: Five expert pools, including one pool with experts that have 3 years of experience, and 4

additional pools, one from each of Groups 2-5, all defined over the last 3 years.

Group 8: Five expert pools, including one pool with experts that have 5 years of experience, and 4

additional pools, one from each of Groups 2-5, all defined over the last 5 years.

As an example of how testing is performed, note that when implementing the test using Group

1, there are three alternative models. The same is true when implementing tests using Groups 2-5.

For Groups 6-8, tests are implemented using 5 alternative models, where one alternative is taken from

each of Groups 1-5. Summarizing, we consider: (i) two benchmark models, against which each group of

alternatives is compared; (ii) alternative models that are based on either mean or median pooled forecasts

for, Groups 2-8 ; (iii) forecast accuracy pools used in 1-8 that are based on either average absolute

forecast errors or average squared forecast errors; (iii) 5 forecast horizons.

We now discuss our empirical findings. In Tables 3-4, statistics are reported for all forecast superiority

tests. Entries are , ,

, and

test statistics reported for forecast horizons = 0 1 2 3 4.

More specifically, = + if

+

+≤ −

−; otherwise =

− . The other statistics reported

in the tables (i.e., , , and

) are defined analogously. Rejections of the null of no forecast

superiority at a 10% level are denoted by a superscript *. In Table 3, the benchmark model is always the

arithmetic mean prediction from all participants, and expert pool forecasts are also arithmetic means.

Analogously, in Table 4 the benchmark is the median prediction from all participants, and expert pool

forecasts are also medians. To understand the layout of the tables, turn to Table 3, and note that for

1, the 4 statistics defined above (i.e., , ,

, and

) are given, for each forecast

horizon, = 0 1 2 3 and 4 Superscripts denote rejection of the null hypothesis based on a particular

test. For example, note that application of the test in 2 yields a test rejection for horizons

= 2 and 4 Turning to the results summarized in the tables, a number of clear conclusions emerge.

23

First, the majority of test rejections occur for = 4, as can be seen by inspection of the results

in both Tables 3 and 4 In particular, note that for = 4, there are 13 test rejections in Table 3 and

11 test rejections in Table 4, across 1-8. On the other hand, for all other forecast horizons

combined (i.e., = {0 1 2 3}), there are 11 test rejections in Table 3 and 8 test rejections in Table 4.This suggests that expert pools which are constructed by “trimming” the least effective experts are most

useful for longer horizon forecasts. These findings make sense if one assumes that it is easier to make

short term forecasts than long term forecasts. Namely, some experts are simply not “up to the task”

when forecasting at longer horizons. Summarizing, our main finding indicates that simple average or

median forecasts can be beaten, in cases where forecasts are more difficult to make (i.e., longer horizons).

Second, “experience” as measured by the length of time an expert has taken part in the SPF is not a

direct indicator of forecast superiority, since there are no rejections of our tests for 1 when either

mean (see Table 3) or median (see Table 4) forecasts are used in our tests. This does not necessarily

mean that experience does not matter, at least indirectly (notice that test rejections sometimes occur for

6-8, where experience and accuracy traits are combined.16 Finally, note that Tables S1 and S2

in the supplemental appendix report root mean square forecast errors (RMSFEs) from the benchmark

and competing models utilized in our empirical analysis. In these tables, we see that in the majority of

cases considered, combination forecasts that utilize the mean have lower RMSFEs than when the median

is used for constructing combination forecasts. For example, when comparing the benchmark RMSFEs

of 1 that are reported in Tables S1 and S2, RMSFEs associated with mean combination forecasts

(see Table S1) are lower for = {0 2 3 4} than the RMSFEs associated with median combinationforecasts (see Table S2). This is interesting, given the clear asymmetry and long left tails associated

with the distributions of expert forecasts exhibited in Figure 1, and suggests that outlier forecasts from

“less accurate” experts are not overly influential when using measures of central tendency as ensemble

forecasts.

Summarizing, we have direct evidence that judicious selection of pools of experts can lead to loss

function robust forecast superiority. However, it should be stressed that in this illustration of the testing

techniques developed in this paper, we do not consider various combination methods, including Bayesian

model averaging, for example. Additionally, we only look at nominal GDP, although the SPF has various

other variables in it. Extensions such as these are left to future research.16To explore this finding in more detail, we also constructed additional tables that are closely related to Tables 3 and 4,

except that in these tables, RMSFEs are reported for all of the models used in each test (see supplemental appendix, Tables

S1 and S2). In these tables, we see that combining experience with prior predictive accuracy can lead to lower RMSFEs,

relative to the case where the entire pool of experts is used. However, RMSFEs are even lower for various alternative models

for which we only use prior predictive accuracy to select expert pools (compare RMSFEs for 3-5 with those for

6-8 in the supplemental tables).

24

6 Concluding Remarks

We develop uniformly valid forecast superiority tests that are asymptotically non conservative, and that

are robust to the choice of loss function. Our tests are based on principles of stochastic dominance, which

can be interpreted as tests for infinitely many moment inequalities. In light of this, we use tools from

Andrews and Shi (2013, 2017) when developing our tests. The tests build on earlier work due to Jin,

Corradi, and Swanson (2017), and are meant to provide a class of predictive accuracy tests that are not

reliant on a choice of loss function, such as the Diebold and Mariano (1995) test discussed in McCracken

(2000). In developing the new tests, we establish uniform convergence (over error support) of HAC

variance estimators, and of their bootstrap counterparts. In a Supplement, we also extend the theory

of generalized moment selection testing to allow for the presence of non-vanishing parameter estimation

error. In a series of Monte Carlo experiments, we show that finite sample performance of our tests is quite

good, and that the power of our tests dominates those proposed by JCS (2017). Additionally, we carry

out an empirical analysis of the well known Survey of Professional Forecasters, and show that utilizing

expert pools based on past forecast quality can lead to loss function robust forecast superiority, when

compared with pools that include all survey participants. This finding is particularly prevalent for our

longest forecast horizon (i.e., 1-year ahead).

25

7 Appendix

Proof of Lemma 1: (i) The proof is the same for all Thus, let () = (1 { ≤ }− ()) −(1 {1 ≤ }− 1()) and define

bb2+ () = 1X=1

2 () + 21

X=1

()− ()

We first show that

sup∈X+

¯̄̄bb2+ ()− 2+()¯̄̄ = (1) and then we show that

sup∈X+

¯̄̄bb2+ ()− b2+ ()¯̄̄ = (1) (7.1)Now,

sup∈X+

¯̄̄bb2+ ()− 2+()¯̄̄≤ sup

∈X+

¯̄̄̄¯ 1

X=1

¡2 ()− E

¡2 ()

¢¢+ 2

1

X=1

X=1

(()− ()− E (()− ()))¯̄̄̄¯

+ sup∈X+

¯̄̄̄¯Ã2()− 1

X=1

E¡2 ()

¢+ 2

1

X=1

X=1

E (()− ())

!¯̄̄̄¯ (7.2)

We begin with the first term on the RHS of (7.2). First note that,

sup∈X+

¯̄̄̄¯ 1

X=1

¡2 ()− E

¡2 ()

¢¢+ 2

1

X=1

X=1

(()− ()− E (()− ()))¯̄̄̄¯

≤ sup∈X+

2

X=0

¯̄̄̄¯ 1

X=1

(()− ()− E(()− ()))¯̄̄̄¯

Now,

Pr

Ãsup∈X+

2

X=0

¯̄̄̄¯ 1

X=1

(()− ()− E (()− ()))¯̄̄̄¯ ≥

!

≤ 2X=0

Pr

Ãsup∈X+

¯̄̄̄¯ 1

X=1

(()− ()− E (()− ()))¯̄̄̄¯ ≥

!

so that we need to show that,

Pr

Ãsup∈X+

¯̄̄̄¯ 1

X=1

(()− ()− E (()− ()))¯̄̄̄¯ ≥

!

26

Given Assumption A2, WLOG, we can set X+ = [0∆] so that it can be covered by −1 balls = 1 ∆−1 centered at with radius Then,

sup∈X+

¯̄̄̄¯ 1

X=1

(()− ()− E (()− ()))¯̄̄̄¯

≤ max=1∆−1

¯̄̄̄¯ 1

X=1

(()− ()− E(()− ()))¯̄̄̄¯

+ max=1∆−1

sup∈

2

¯̄̄̄¯Ã1

X=1

− () (()− ())!

−Ã1

X=1

E(− () (()− ()))!¯̄̄̄¯

+smaller order

= +

Now,

≤ max=1∆−1

sup∈

¯̄̄̄¯ 1

X=1

− () (()− ())¯̄̄̄¯

+ max=1∆−1

sup∈

¯̄̄̄¯ 1

X=1

E (− () (()− ()))¯̄̄̄¯

= +

Given Assumption A1, noting that by Cauchy - Schwarz,

max=1∆−1

sup∈

¯̄̄̄¯ 1

X=1

E(− () (()− − ()))¯̄̄̄¯

≤ max=1∆−1

sup∈

qE(− ())

2max

=1−1sup∈

qE(()− ())2

= ³12

´

27

for some constant Recalling given that () = (1 { ≤ }− ()) − (1 {1 ≤ }− 1()) and() stay between −1 and 1

max=1∆−1

sup∈

¯̄̄̄¯ 1

X=1

− () (()− ())¯̄̄̄¯

≤ 2 max=1∆−1

sup∈

1

X=1

|()− ()|

≤ 2

X=1

1 {− ≤ 1 ≤ + }+ 2

X=1

1 {− ≤ ≤ + }

+2 sup∈X+

(1() + ())

= () = (12 )

Hence, by Chebyshev inequality

Pr

µ

¶=

¡

3

¢= (1)

for = ¡−3¢

Now, consider By the Lemma on page 739 of Hansen (2008), setting = −4 =∆42

and =

with 12 and recalling that given Assumption A1, var (P

=1 (()− ()− E(()− ()))) ≤ it follows that for some constant

Pr

Ãmax

=1−1

¯̄̄̄¯ 1

X=1

(()− ()− E (()− ()))¯̄̄̄¯ ≥

!

≤ −1 PrÃ¯̄̄̄¯X=1

(()− ()− E (()− ()))¯̄̄̄¯ ≥

!

≤ 4−1

⎛⎝exp⎛⎝− 22 2

64+ 83∆2

43

⎞⎠+ 162

µ4

∆

2

¶−⎞⎠= −1 exp

Ã− 164 2

+ 83∆2

43

!+64

−1

2 /2 −

= (1) +³(6+2)−

´= (1) for

6

1− 2

28

We now consider the second term on the RHS of (7.2). Note that

sup∈X+

¯̄̄̄¯Ã2+()− 1

X=1

E¡2 ()

¢+ 2

1

X=1

X=1

E (()− ())

!¯̄̄̄¯

≤ 2 sup∈X+

¯̄̄̄¯ 1

X=1

(1− )X=1

E (()− ())

¯̄̄̄¯ (7.3)

+2 sup∈X+

¯̄̄̄¯ 1

X=+1

X=1

E(()− ())

¯̄̄̄¯

The first term on the RHS of (7.3) is (1) by the same argument as that used in Theorem 2 of Newey

and West (1987). Also, by Lemma 6.17 in White (1984), for 2

E (()− ()) ≤ −2−1var (())12 E k()k

and

sup∈X+

¯̄̄̄¯ 1

X=+1

X=1

E (()− ())

¯̄̄̄¯

≤ sup∈X+

var (())12 E k()k

X=+1

−2−1 = (1)

as 1 given Assumption A1, and noting that can be taken arbitrarily large because of the bound-

edness of ()

Finally, by the same argument as that used in the proof of (7.2), for all

sup∈X+

1

X=1

(1 { ≤ }− ()) = ¡−1¢

The statement in (7.1) follows immediately.

(ii) By noting that,

[ − ]+ − [ − ]+= (− )1{ ≥ }+ (− ) (1{ ≥ }− 1{ ≥ })

+ ( − ) (1{ ≥ }− 1{ ≥ })

the statement follows by the same argument as that used in part (i) of the proof.

Proof of Lemma 2: For notational simplicity, we suppress the subscript. Also, we suppress the

29

superscripts + and + as the proof follows by analogous argument. Note that

sup∈X+

¯̄̄b∗2 ()− E∗ (b∗())¯̄̄

≤ sup∈X+

X=1

¯̄̄̄¯̄̄⎛⎝ 1

X=1

∗(−1)+()

⎞⎠2 − E∗⎛⎜⎝⎛⎝ 1

X=1

∗(−1)+()

⎞⎠2⎞⎟⎠¯̄̄̄¯̄̄

= sup∈X+

X=1

¯̄̄̄¯̄ 12

X=1

X=1

∗(−1)+()∗(−1)+()− E∗

³∗(−1)+()

∗(−1)+()

´¯̄̄̄¯̄Now,

Pr

⎛⎝ sup∈X+

X=1

¯̄̄̄¯̄ 12

X=1

X=1

∗(−1)+()∗(−1)+()− E∗

³∗(−1)+()

∗(−1)+()

´¯̄̄̄¯̄ ≥ 1⎞⎠

≤ Pr⎛⎝ sup∈X+

X=1

¯̄̄̄¯̄ 12

X=1

X=1

∗(−1)+()∗(−1)+()− E∗

³∗(−1)+()

∗(−1)+()

´¯̄̄̄¯̄ ≥ 1 ⎞⎠

It suffices to show that, uniformly in

Pr

⎛⎝ sup∈X+

¯̄̄̄¯̄ 12

X=1

X=1

∗(−1)+()∗(−1)+()− E∗

³∗(−1)+()

∗(−1)+()

´¯̄̄̄¯̄ ≥ 1 ⎞⎠

This follows using the same "covering numbers" argument used in the proof of Lemma 1.

Proof of Theorem 1: We again suppress the superscripts + and + as the proof follows by the same

argument. We need to show that the statement in Lemma A1 in the Supplement Appendix of Andrews

and Shi (2013) holds. Then, the proof of the theorem will follow using the same arguments as those used

in the proof of their Theorem 1, as the proof is the same for independent and dependent observations. In

fact, our set-up differs from Andrews and Shi (2013) only because we have dependent observations, and

because we scale the statistic by a Newey-West variance estimator. For the rest of the proof, our set-up

is simpler as we can fix their at a given value, say zero. It suffices to show that:

(i) ()⇒ () as a process indexed by ∈ X+ where () is a zero-mean − 1−dimensional Gaussianprocess, with covariance kernel given Σ( 0)

(ii) sup0∈X+°°( 0)− ( 0)°° = (1)

Now, statement (ii) follows directly from Lemma 1. It remains to show that (i) holds. The key difference

between the independent and the dependent cases is that in the former we can rely on the concept of

manageability, while in the latter we cannot. Nevertheless, (i) follows if we can show that () satisfies

an empirical process. Given A1-A3, this follows from Lemma A2 in Jin, Corradi and Swanson (2017).

Proof of Theorem 2: (i) For notational simplicity, we omit the superscript + The proof of this

theorem mirrors the proof of Theorem 2(a) in the Supplement of Andrews and Shi (2013). Let 0 ( )

30

be the critical value of † as defined in (3.5). Given Theorem 1(i), it follows that for all 0

lim sup→∞

sup∈P0

( ≥ 0 ( ) + ) ≤

The statement follows if we can show that

lim sup→∞

sup∈P0

³∗0

³

∗

´≤ 0

³

∗

´´= 0 (7.4)

with 0³

∗

´defined as 0 ( ) ; but with

∗ an argument of this function rather than

(); and if we can show that

lim sup→∞

sup∈P0

³0

³

∗

´≤ 0 ( )

´= 0 (7.5)

For →∞ and → 0 →∞ and → 0

sup∈P0

³∗0

³

∗

´≤ 0

³

∗

´´≤ sup

∈P0¡−() ≤ () for some ∈ X+ and some = 2 ¢

≤ sup∈P0

¡() −1 AND − ≤ () for some ∈ X+ and = 2

¢≤ sup

∈P0³()12

−12 () (()−()) +()12

−12 ()() −

AND − ≤ () for some ∈ X+ and = 2 ¢

≤ sup∈P0

³− +()−12−12 ()() −


+ sup∈P0

³()12

−12 () (()−()) − for some ∈ X+ and = 2

´≤ sup

∈P0³−()−12−12 ()() − +


= (1)

This establishes that (7.4) holds. Finally, (7.5) follows from Lemma 1 and Lemma 2.

(ii) Recall that ∗01− ( ) is the (1 − )− percentile of ∗ as defined in (3.11); and define01−

¡

¢to the (1− )− percentile of where

= max∈X+

X=2

max

⎛⎝⎧⎨⎩0 ()− ()q()

⎫⎬⎭⎞⎠2

31

with = (2 )0 is a − 1 dimensional Gaussian process, with mean zero and covariance(

0) = b−12 ()Σ( 0) b−12 (0) Finally, let = (2 )0 is a − 1 dimensional Gaussianprocess, with mean zero and covariance ( 0) = −12()Σ( 0)−12(0)We first need to show

that

∗01− ( )− 01−¡

¢= (1) (7.6)

and then to prove that the statement holds when replacing ∗01− ( ) with 01−¡

¢

From Lemma 2, bΣ∗ ( 0) − bΣ ( 0) = ∗(1) and so Σ∗ ( 0) − Σ ( 0) = ∗(1) Then, byTheorem 2.3 in Peligrad (1998),

∗ ∗

=⇒ a.s.-

where ∗ ∗=⇒ denotes weak convergence, conditional on sample. As =⇒ (7.6) follows.Given Assumption A4, by Lemma B3 in the Supplement of Andrews and Shi (2013), the distribution

of †∞ as defined in (3.6), is continuous. It is also strictly increasing and its (1− )−quantile is strictlypositive, for all 12 The statement then follows by the same argument as that used in the proof of

Theorem 2(b) in the Supplement of Andrews and Shi (2013).

(iii)-(iv) follow by the same arguments as those used in the proof of (i) and (ii), respectively. In the

case of +

we rely on the the stochastic equicontinuity of1√

P=1 (1 {1 ≤ }− 1 {1 ≤ }) as |−

|→ 0When considering + we need to ensure the stochastic equicontinuity of 1√P

=1

³(1 − )+ − (1 − )+

´

Now,

1√

X=1

³(1 − )+ − (1 − )+

´=

1√

X=1

(− ) 1 {1 ≥ }+ 1√

X=1

(1 − )+ (1 {1 ≥ }− 1 {1 ≥ })

which, given Assumption 2, is stochastically equicontinous, by the same argument as those used for +

Hence, Theorem 2.3 in Peligrad (1998) also holds in this case.

Proof of Theorem 3: (i) Without loss of generality, let + = { ∈ X+ : 2() 0} and note thatfor all ∈ + max

½0√+2()

+22()

¾=√+2()

+22() Thus,

+ =

Z+

X=2

Ãmax

(0

√+()

+()

)!2d() +

ZX+\+

X=2

Ãmax

(0

√+()

+()

)!2d()

=

Z+

Ã√+2()

+22()

!2d() +

Z+

X=3

Ãmax

(0

√+()

+()

)−Ã√

+2()

+22()

!!2d()

+

ZX+\+

X=2

Ãmax

(0

√+()

+()

)!2d()

= + +

32

Now, diverges to infinity with probability approaching one, while Theorem 1 ensures that and

are (1). Thus, + diverges to infinity As ∗+ is ∗(1) conditional on the sample, the statement

follows.

(ii) Note that +

can be treated exactly as +

Proof of Theorem 4:

(i) Define, †+∞ as in (3.6), but with the vector +∞() having at least one component strictly

bounded away above from zero, and finite, for all ∈ + Let P+ denote the set of probabilitiesunder the sequence of local alternatives. We have that for all 0

lim sup→∞

sup∈P+

h¡+

¢− ³†+∞ ´i = 0and the distribution of †+∞ is continuous at its (1 − ) + quintile, for all 0 12 and ≥ 0Also, note that for all ∈ + + = 0 The statement then follows by the same argument as thatused in the proof of Theorem 2(ii). (ii) By the same argument as in part (i).

33

8 References

Aiolfi, M., C. Capistrán, and A. Timmermann (2011). Forecast Combinations. In M.P. Clements and

D.F. Hendry (eds.), Oxford Handbook of Economic Forecasting, pp. 355-390, Oxford University

Press, Oxford.

Andrews, D.W.K. (1991). Heteroskedasticity and Autocorrelation Robust Covariance Matrix Estimation.

Econometrica, 59, 817-858.

Andrews, D.W.K. and D. Pollard (1994). An Introduction to Functional Central Limit Theorems for

Dependent Stochastic Processes. International Statistical Review, 62, 119-132.

Andrews, D.W.K. and X. Shi (2013). Inference Based on Conditional Moment Inequalities. Econometrica,

81, 609-666.

Andrews, D.W.K. and X. Shi (2017). Inference Based on Many Conditional Moment Inequalities. Journal

of Econometrics, 196, 275-287.

Barendse, S. and A.J. Patton (2019). Comparing Predictive Accuracy in the Presence of a Loss Function

Shape Parameter. Working Paper, Duke University.

Bierens H.J. (1982). Consistent Model Specification Tests. Journal of Econometrics, 20, 105-134.

Bierens H.J. (1990). A Consistent Conditional Moment Tests for Functional Form. Econometrica, 58,

1443-1458.

Clark, T. and M. McCracken (2013). Advances in Forecast Evaluation. In G. Elliott, C.W.J. Granger

and A. Timmermann (eds.), Handbook of Economic Forecasting Vol. 2, pp. 1107-1201, Elsevier,

Amsterdam.

Corradi, V. and N.R. Swanson (2003). Predictive Density Evaluation. In G. Elliott, C.W.J. Granger

and A. Timmermann (eds.), Handbook of Economic Forecasting Vol. 1, pp. 197-284, Elsevier,

Amsterdam.

Corradi, V. and N.R. Swanson (2007). Nonparametric Bootstrap Procedures for Predictive Inference

Based on Recursive Estimation Schemes. International Economic Review, 48, 67-109.

Corradi, V. and N. R. Swanson (2013). A Survey of Recent Advances in Forecast Accuracy Comparison

Testing, with an Extension to Stochastic Dominance. In X. Chen and N.R. Swanson (eds.), Causality,

Prediction, and Specification Analysis: Recent Advances and Future Directions, Essays in

honor of Halbert L. White, Jr., pp. 121-144, Springer, New York.

Croushore, D. (1993). Introducing: The Survey of Professional Forecasters, The Federal Reserve Bank

of Philadelphia Business Review, November-December, 3-15.

34

Diebold, F. X. and Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of Business and

Economic Statistics, 13, 253-263.

Diebold, F.X. and M. Shin (2015). Assessing Point Forecast Accuracy by Stochastic Loss Distance.

Economics Letters, 130, 37-38.

Diebold, F.X. and M. Shin (2017). Assessing Point Forecast Accuracy by Stochastic Error Distance.

Econometric Reviews, 36, 588-598.

Elliott, G., I. Komunjer and A. Timmermann (2005). Estimation and Testing of Forecast Rationality

under Flexible Loss. Review of Economic Studies, 72, 1107-1125.

Elliott, G., I. Komunjer and A. Timmermann (2008). Biases in Macroeconomic Forecasts: Irrationality

of Asymmetric Loss? Journal of the European Economic Association, 6, 122-157.

Fair, R.C. and R.J. Shiller (1990). Comparing Information in Forecasts from Econometric Models. Amer-

ican Economic Review, 80, 375-389.

Genre, V., G. Kenny, A. Meyler, and A. Timmermann (2013). Combining the Forecasts in the ECB

Survey of Professional Forecasters: Can Anything Beat the Simple Average. International Journal of

Forecasting, 29, 108-121.

Gneiting, T. (2011). Making and Evaluating Point Forecast. Journal of the American Statistical Associ-

ation, 106, 746-762.

Granger, C. W. J. (1999). Outline of Forecast Theory using Generalized Cost Functions. Spanish

Economic Review, 1, 161-173.

Hansen, B.E. (2008). Uniform Convergence Rates for Kernel Estimators with Dependent Data. Econo-

metric Theory, 24, 726-748.

Holm, S. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of

Statistics, 6, 65-70.

Jin, S., V. Corradi and N.R. Swanson (2017). Robust Forecast Comparison. Econometric Theory, 33,

1306-1351.

Lahiri, K., H. Peng, and Y. Zhao (2015). Testing the Value of Probability Forecasts for Calibrated

Combining. International Journal of Forecasting, 31, 113-129.

Lahiri, K., H. Peng, and Y. Zhao (2017). Online Learning and Forecast Combination in Unbalanced

Panels. Econometric Reviews, 36, 257-288.

Linton, O., K. Song and Y.J. Whang (2010). An Improved Bootstrap Test of Stochastic Dominance.

Journal of Econometrics, 154, 186-202.

35

McCracken, M.W. (2000). Robust Out-of-Sample Inference. Journal of Econometrics, 9

robust forecast superiority testing with an application to...

Documents