optimal forecast under structural breaks with application
Post on 18-Dec-2021
5 Views
Preview:
TRANSCRIPT
Optimal Forecast under Structural Breaks with Application to
Equity Premium
Shahnaz Parsaeian∗
October 28, 2019
Abstract
This paper develops an optimal combined estimator to forecast out-of-sample under structuralbreaks. When it comes to forecasting, using only the post-break observations after the mostrecent break point may not be optimal, especially when there are not enough observations in thepost-break sample. In this paper we propose different estimation methods that exploit the pre-break information. In particular, we show how to combine the estimator using the full-sample(i.e., both the pre-break and post-break data) and the estimator using only the post-breaksample. The full-sample estimator is inconsistent when there is a break while it is efficientwhen there is no break. The post-break estimator is consistent when there is a break while itis less efficient when there is no break. Hence, depending on the severity of the breaks, thefull-sample estimator and the post-break estimator can be combined to balance the consistencyand efficiency. We derive several Stein-like combined estimators of the full-sample estimatorand the post-break estimator, to balance the bias-variance trade-off. The combination weightdepends on the break severity, which we measure by the Hausman statistic. We derive theoptimal combining weight by minimizing the mean square forecast error. We also introduce asemi-parametric estimator which facilitates the idea without having to compute the Hausmanstatistic in combining full-sample information and post-break sample information. We examinethe properties of the proposed methods, analytically in theory, numerically by simulation, andalso empirically in predicting the equity premium.
Keywords: Forecasting, Structural breaks, Stein-like combined estimator, Predicting equitypremium
∗Department of Economics, University of California, Riverside, CA 92521. E-mail: spars002@ucr.edu
1 Introduction
In a regression framework, structural breaks can be considered as a regression equation with a
shift in some/all of the regression coefficients and/or error term. One of the biggest concerns of
economists is how to have an accurate forecast in the case of the structural break(s). Since the
seminal work by Bates and Granger (1969), forecast combination is an effective way to improve
forecasting performance. Especially under model instability, the performance of the forecast can
be boosted by forecast combinations method, see Diebold and Pauly (1987), Clements and Hendry
(1998, 1999, 2006), Stock and Watson (2004), Pesaran and Timmermann (2005, 2007), Timmermann
(2006), Pesaran and Pick (2011), Rossi (2013), and Pesaran et al. (2013) inter alia. Normally
in the forecast combinations, we need to closely pay attention to two points: first the selected
forecast models, and secondly the combination weights. The common method for forecasting under
structural break is to use a post-break estimator. But this estimator by itself cannot be accurate
in the case that the break point is close to the end of the sample, and thus there is not enough
observations. If the distance to break is short, then the post-break parameters are likely to be
poorly estimated relative to those obtained using pre-break data. Therefore, as pointed out by
Pesaran and Timmermann (2007), and Pesaran et al. (2013), it is not always optimal to base the
forecast only on post-break observations. Actually, using pre-break data biases the forecast but
also reduces the forecast error variance which is helpful to lower the MSFE.
A key question that comes to mind under the presence of structural break is how to include the
pre-break data to estimate the parameters of the forecasting model such that minimizes the loss
function like MSFE. In this paper we propose the combined estimator of the post-break estimator
and the full-sample estimator which uses all observations in the sample, t = 1, . . . , T. The
combination weight takes the form of James-Stein weight, see Stein (1956) and James and Stein
(1961). Massoumi (1978) introduced the Stein-like estimator for the simultaneous equation. Also,
Hansen (2016, 2017) has used this Stein-type weight in different context. See Saleh (2006) for a
comprehensive explanation for the Stien-type estimators.
The current paper does not focus on classical problem of identifying the break points. Instead,
the goal of this paper is to propose an estimator with a minimum MSFE under the assumption that
structural break has in fact occurred. For estimation of the break points, we follow Bai and Perron
1
(1998, 2003) which is a consistent global minimizer method of the error term. Bai and Perron
(2003) propose an efficient dynamic programming algorithm that even with a large sample size, has
a small computing cost and is super fast. We also assume that the break points are bounded away
from the beginning or end of the sample which is crucial for consistency of the estimated break
points, Andrew (1993), Bai and Perron (1998, 2003).
This paper has two main contributions. First, we introduce the Stein-like combined estimator
of the full-sample estimator and the post-break estimator which is a trade-off between bias and
forecast error variance. The full-sample estimator (which uses the pre-break and post-break data) is
bias under structural break, but at the same time it decreases the forecast error variance because it
uses more observations. Post-break estimator (which is the standard solution under the structural
break model) is an unbiased estimator under structural breaks but less efficient. As discussed in
papers by Pesaran, and Timmermann (2007), Pesaran and Pick (2011), Pesaran et al. (2013), we
are able to improve the performance of forecast measured by MSFE by the trade-off between bias
and forecast error variance. Our proposed Stein-like combined estimator is the trade-off between
bias and variance of the two estimators. The main difference between our paper and Hansen (2017)
paper which uses the Stein-type weigh in different context is that, here we minimized the weighted
risk under the general positive semi-definite form of weight and did not restrict our theoretical
proof for a specific choice of weight. We show that if we set this weight to the inverse of the
difference between variances of estimators, then we get similar result as Hansen (2017). Besides,
we investigate the performance of our combined estimator with an alternative weight κ ∈ [0 1] and
compare the properties of this estimator with the Stein-like combined estimator.
The second contribution of this paper is that we introduce the semi-parametric method for the
combined estimator which facilitate implementation of our theoretical point that using pre-break
data can improve the forecasting performance. This method is a kernel weight of the observations.
Motivation for the semi-parametric method comes from the paper by Li, Ouyang, Racine (2013),
in which they focus on the categorical variables. Examples of the categorical variables are race,
sex, age group and educational level. Inspired by the categorical data, we introduce the semi-
parametric method for the time series analysis, and specifically apply that for the structural break
models. The point of using this method is that we can find the combination weights numerically
by Cross-Validation (CV) without being involved to the estimation errors of parameters. We prove
2
theoretically that the weight which is derived by CV is optimal in the sense of optimality which
was introduced by K.-C Li (1987), the weight derives by CV is asymptotically equivalent to the
infeasible optimal weight.
We undertake an empirical analysis for the equity premium that compares the forecasting
performance of our proposed Stein-like combined estimator, and some of the alternative methods,
including post-break estimator, by using monthly, quarterly and annual data frequencies starting
from 1927 up to 2017. This analysis which allows for multiple breaks at unknown times confirms
the out-performance of using pre-break data in estimation of the forecasting model relative to
forecasting with only the post-break data. Besides, Welch and Goyal (2008) find that it is hard
to outperform a simple historical average with more complex models for predicting excess equity
returns. We also compare our results with historical average, and the results confirms the out
performance of our combined estimator over the simple historical average.
The outline of the paper is as follows. Section 2 sets up the model under the structural break and
introduces the Stein-like combined estimator and its risk. For simplicity, we discuss the problem
under a single break, but generalization of the method to the multiple breaks is straightforward.
Section 3 proposes combined estimator with weight κ ∈ [0 1] and compares the asymptotic risk of
this estimator with the Stein-like combined estimator. Section 4 introduces the semi-parametric
combined estimator. Section 5 discusses an alternative combined estimator proposed by Pesaran
et al. (2013). Section 6 extends the model with multiple structural break. Section 7 reports Monte
Carlo simulation and empirical example is in Section 8. Finally Section 9 concludes.
2 The structural break model
Consider the linear structural break model as yt = x′tβt + σtεt, in which the k × 1 vector of
coefficients, βt, and the error variance, σt, are subject to one break (m = 1) at time T1:
βt =
β(1) if 1 < t ≤ T1
β(2) if T1 < t < T(1)
and
σt =
σ(1) if 1 < t ≤ T1
σ(2) if T1 < t < T.(2)
3
So we can rewrite the model as:
yt =
x′tβ(1) + σ(1)εt if 1 < t ≤ T1
x′tβ(2) + σ(2)εt if T1 < t < T,(3)
where xt is k × 1, εt ∼ i.i.d.(0, 1), t ∈ 1, . . . , T, and T1 is the break point with 1 < T1 < T . In
this set up we have only one break (two regimes). Assume that we know the break point.
2.1 Stein-like combined estimator
Our proposed combined estimator of β is
βα = αβFull + (1− α)β(2), (4)
where βα is the combined estimator which is k×1, βFull is the estimator under the assumption that
there is no break in the model, so we estimate the coefficient using all observations in the sample,
t ∈ 1, . . . , T. β(2) is for the case of having a break, and estimate the coefficient only by using the
post-break observations. We call this estimator, post-break estimator. Following Hansen (2017),
we define the combination weight as:
α =
τHT
if HT ≥ τ1 if HT < τ,
(5)
where τ controls the degree of shrinkage, HT is the Hausman statistics test which is equal to
HT = T (β(2) − βFull)′(V(2) − VF )−1(β(2) − βFull), (6)
where VF and V(2) are the variances of the full-sample estimator and the post-break estimator,
respectively. Originally this type of shrinkage weight introduced by James-Stein (1961). The
degree of shrinkage depends on the ratio of τHT
. When HT < τ then α = 1 and βα = βFull. When
HT > τ then βα is a weighted average of the full-sample estimator and the post-break estimator.
Note that it can be shown that cov(β(2), βFull) = VF , where VF < V(2) and it means that
the covariance between the estimators is equal to the variance of the efficient estimator. The
interesting idea behind the Hausman test is that the efficient estimator, βFull, must have zero
asymptotic covariance with β(2) − βFull under the null. If this condition does not hold, it means
that we can find another linear combination which would have smaller asymptotic variance than
βFull which is assumed to be asymptotically efficient. Since this condition holds in our combined
4
estimator, we apply Hausman test. See Hausman (1978) for more discussion.
2.2 Asymptotic distribution
In this part, we develop the asymptotic distributions of the estimators under a local asymptotic
framework. Assume that the break size is local to zero, specifically, β(1) = β(2) + δ1√T
, in which
δ1 shows the break size between the coefficients. Actually, to obtain a meaningful asymptotic
distribution for the full-sample estimator, we consider such a local to zero assumption otherwise
the risk explodes.
2.2.1 Full-sample estimator: There is no break in the model
The full-sample estimator is constructed under the null hypothesis that there is no break in the
model, β(1) = β(2), so it uses all of the observations to estimate β. This assumption lines up with
the fact that for the small break sizes, usually ignoring the break and estimating the coefficient by
all of the observations results in better forecast (lower MSFE), see Boot and Pick (2017). Under
the alternative hypothesis the break size is β1 = β2 + δ1√T
. We denote this full-sample estimator as
βFull, and estimate the coefficient by the Generalized Least Square (GLS) as:
βFull =(X ′Ω−1X
)−1X ′Ω−1Y
=
(T∑t=1
xtx′t
1
σ2t
)−1 T∑t=1
xtyt1
σ2t
=
(T∑t=1
xtx′t
1
σ2t
)−1 T1∑t=1
xtx′t
β(1)
σ2(1)
+
T∑t=T1+1
xtx′t
β(2)
σ2(2)
+
T∑t=1
xtσtεtσ2t
=
(T∑t=1
xtx′t
1
σ2t
)−1 T1∑t=1
xtx′t
β(1)
σ2(1)
−T1∑t=1
xtx′t
β(2)
σ2(1)
+
T1∑t=1
xtx′t
β(2)
σ2(1)
+T∑
t=T1+1
xtx′t
β(2)
σ2(2)
+T∑t=1
xtσtεtσ2t
=
(T∑t=1
xtx′t
1
σ2t
)−1 [ T1∑t=1
xtx′t
β(1) − β(2)
σ2(1)
+T∑t=1
xtx′t
β(2)
σ2t
+T∑t=1
xtσtεtσ2t
],
(7)
where Ω = diag(σ2(1), . . . , σ
2(1), σ
2(2), . . . , σ
2(2)) is a T ×T matrix and X = (x1, x2, . . . , xT )′ = (X ′1 X
′2)′
is a T × k matrix of regressors. So X1 is T1 × k matrix of observations before the break point,
and X2 is (T − T1)× k matrix of observations after the break point. Assume that T − T1 ≥ k+ 1,
so we will have enough observations in the post-break sample. This is the assumption to have
5
well defined estimator otherwise the number of unknown parameters would be larger than number
of observations. Also, assume thatX′iΩ
−1i Xi
Ti−Ti−1, where i = 0, . . . ,m + 1 and T0 = 0, Tm+1 = T ,
converges in probability to some nonrandom positive definite matrix not necessarily the same for
all i. Thus,(X′Ω−1X
T
)p−→ Q, and
(X′i Ω−1
i Xi∆Ti
)p−→ Qi, where ∆Ti = Ti − Ti−1. Throughout this
Section, m = 1 since we have only one break in the model. By simplification:
√T(βFull − β(2)
)=
(T∑t=1
xtx′t
1
Tσ2t
)−1 T1∑t=1
xtx′t
b1√T(β(1) − β(2)
)Tb1 σ2
(1)
+T∑t=1
xtσtεt√Tσ2
t
=
(X ′Ω−1X
T
)−1(X ′1Ω−1
1 X1
Tb1
)b1δ1 +
(X ′Ω−1X
T
)−1( T∑t=1
xtσtεt√Tσ2
t
)d−→ N
(Q−1Q1b1δ1, Q
−1),
(8)
where VF =(X′Ω−1X
T
)−1 p−→ Q−1,(X′1Ω−1
1 X1
T1
)p−→ Q1, b1 = T1
T shows the proportion of pre-break
observations and Ω1 = diag(σ2(1), . . . , σ
2(1)) is a T1 × T1 matrix. Note that in practice, we need to
estimate the value of the unknown parameters, σ2(1) and σ2
(2). For solving this problem, we can use
the two-step GLS estimator method. The two-step estimator is computed by first obtaining the
estimates of σ2(1) and σ2
(2) by using ordinary least square residuals for each regime, and then plug
it back into equation (8) to find the estimation of the coefficient.
Remark 1: The full-sample estimator is under the null hypothesis that there is no break in the
coefficient β(1) = β(2), but we allow break in variance, σ(1) 6= σ(2). Because we have variance
heteroskedasticity in the full-sample estimator, we use GLS which is more efficient than ordinary
least square (OLS).
2.2.2 Post-break estimator
As we introduced post-break estimator earlier, it is for the case that we only focus on the observations
after the most recent break point. The post-break estimator is:
β(2) =(X ′2 Ω−1
2 X2
)−1X ′2 Ω−1
2 Y2, (9)
where Ω2 = diag(σ2(2), . . . , σ
2(2)) is a (T − T1)× (T − T1) matrix. Basically, post-break estimator is
the simple OLS estimator, since Ω−12 will be cancel out in this equation. But we are following the
GLS format to be consistent with the full-sample estimator. By simplification:
6
√T(β(2) − β(2)
)=
(X ′2 Ω−1
2 X2
T − Tb1
)−1(X ′2 Ω−1
2 σ(2)ε2√1− b1
√T − Tb1
)
d−→ N
(0,
1
1− b1Q−1
2
),
(10)
where V(2) = 11−b1
(X′2 Ω−1
2 X2
T−T1
)−1 p−→ 11−b1Q
−12 .
2.2.3 Stein estimator
For writing the distribution of the Stein-like combined estimator, at first we write the joint
asymptotic distribution of the full-sample estimator and the post-break estimator. Theorem 1
shows this joint distribution, distribution of the Hausman statistics and finally distribution of the
Stein-like combined estimator.
Theorem 1: The joint distribution of the full-sample estimator and the post-break estimator is:
√T
[βFull − β(2)
β(2) − β(2)
]d−→ V 1/2Z, (11)
where Z ∼ N(θ, I2k), θ = V −1/2
[Q−1Q1b1δ1
0
], V =
[VF VFVF V(2)
], VF = plim
T→∞
(X′Ω−1X
T
)−1= Q−1
and V(2) = 11−b1 plim
T→∞
(X′2Ω−1
2 X2
T−T1
)−1= 1
1−b1Q−12 .
Besides, the distribution of the Hausman statistic is:
HT = T (β(2) − βFull)′(V(2) − VF )−1(β(2) − βFull)
d−→ Z ′V 1/2 G (V(2) − VF )−1 G′ V 1/2Z
≡ Z ′MZ,
(12)
where G = (−I I)′ and M ≡ V 1/2 G (V(2) − VF )−1 G′ V 1/2 is an idempotent matrix with rank k.
Finally, the distribution of the Stein-like combined estimator is:
βα − β(2) =(β(2) − β(2)
)− α
(β(2) − βFull
)d−→ G′2 V
1/2Z −( τ
Z ′MZ
)1G′ V 1/2Z,
(13)
where G2 = (0 I)′ and (a)1 = min[1, a].
7
Based on Theorem 1, the joint asymptotic distribution of the full-sample and post-break
estimators is normal. The Hausman statistics has asymptotic noncentral chi-square distribution,
which later we deal with this non-centrality when we want to calculate the asymptotic risk. Finally,
the asymptotic distribution of the Stein-like combined estimator is a function of the normal random
vector, Z, and a function of the non-centrality parameter which appears because of having the
Hausman statistics in the combination weight.
2.3 Asymptotic risk for the Stein-like combined estimator
In this section we want to find the asymptotic risk for the Stein-like combined estimator for the
case that we have only one break (m = 1). Since our focus is on forecasting, it seems reasonable to
consider β(2) as the true parameter vector in the definition of the risk.
Lemma 1: When an estimator has asymptotic distribution,√T (β − β)
d−→ $, then we can derive
risk for this estimator as ρ(β,W) = E($′W$). See Lehmann and Casella (1998).
Based on Lemma 1, we can write the asymptotic risk for our estimator. Note that if we set
up the risk with W = X ′Ω−1X, we will have the definition of MSFE. Having that, we can derive
the shrinkage weight, α. In this case, α is the optimal weight in which it minimizes the MSFE.
Then, by having the combination weight, α, we can derive the h step ahead out of sample forecast.
Throughout the calculation of the risk, we did not plug W = X ′Ω−1X, and solve for general W.
The risk is:
ρ(βα,W) = E[T (βα − β(2))
′W (βα − β(2))]
= T E[(β(2) − β(2)
)− α
(β(2) − βFull
)]′W[(β(2) − β(2)
)− α
(β(2) − βFull
)]= ρ(β(2),W) + τ2 E
[Z ′ V 1/2 GWG′ V 1/2 Z
(Z ′MZ)2
]− 2τ E
[Z ′ V 1/2 GWG′2 V
1/2 Z
Z ′MZ
]
= ρ(β(2),W) + τ2 tr[WG′ V 1/2 E
((Z ′MZ)−2ZZ ′
)V 1/2 G
]− 2τ tr
[WG′2 V
1/2 E(
(Z ′MZ)−1ZZ ′)V 1/2 G
].
(14)
The challenging part for calculation of this risk function is that we have to take expectation from
the noncentral chi-square distribution, with noncentrality parameter equal to θ′Mθ2 . For calculating
these expectations, we need to use the following Lemmas.
8
Lemma 2: Let χ2p(µ) denote a noncentral chi-square random variable with noncentral parameter
µ and p degree of freedom. Besides, let p denote a positive integer such that p > 2r. Then we have:
E[(χ2p(µ)
)−r]= 2−re−µ
Γ(p2 − r)Γ(p2)
1F1
(p2− r; p
2;µ),
where 1F1(.; .; .) is the confluent hypergeometric function which is defined as 1F1(a; b;µ) =∑∞
n=0(a)n µn
(b)n n! ,
where (a)n = a(a+ 1) . . . (a+ n− 1), (a)0 = 1. See Ullah (1974) for the proof and details.
Lemma 3: The definition of the confluent hypergeometric function implies the following relations:
1. 1F1(a; b;µ) = 1F1(a+ 1; b;µ)− µb 1F1(a+ 1; b+ 1;µ),
2. 1F1(a; b;µ) = b−ab 1F1(a; b+ 1;µ) + a
b 1F1(a+ 1; b+ 1;µ), and
3. (b− a− 1) 1F1(a; b;µ) = (b− 1) 1F1(a; b− 1;µ)− a 1F1(a+ 1; b+ 1;µ)
Proof: See Lebedev (1972), pp. 262.
Lemma 4: Let the T×1 vector Z is distributed normally with mean vector θ and covariance matrix
IT , and M is any T × T idempotent matrix with rank r. Also assume φ(.) is a Borel measurable
function. Then:
E[φ(Z ′MZ)ZZ ′
]= E
[φ(χ2r+2(µ)
)]M + E
[φ(χ2r(µ)
)](IT −M)
+ E[φ(χ2r+4(µ)
)]Mθθ′M + E
[φ(χ2r(µ)
)](IT −M)θθ′(IT −M)
+ E[φ(χ2r+2(µ)
)](θθ′M +Mθθ′ − 2Mθθ′M
),
where µ ≡ θ′Mθ2 is the non-centrality parameter. Proof is based on Judge and Bock (1978).
Now by using Lemmas 2 - 4, we can simplify the risk in equation (14). Appendix A.1 has the
complete proof for this part. After simplifying the risk we have:
ρ(βα,W) = ρ(β(2),W)− 1
k − 2
2τ
(tr(W(V(2) − VF )
)− 2 θ′V 1/2GWG′V 1/2θ
θ′Mθ
)
− τ2
(θ′V 1/2GWG′V 1/2θ
θ′Mθ
)e−µ 1F1
(k2− 1;
k
2;µ)
− 1
k − 2
(τ2 + 4τ)
(θ′V 1/2GWG′V 1/2θ
θ′Mθ−
tr(W(V(2) − VF
))k
)e−µ 1F1
(k2− 1;
k
2+ 1;µ
).
(15)
9
Thus the risk of combined estimator is lower than the risk of the post-break estimator, if the
terms inside the curly brackets be positive. We use Lemma 5 to derive the required conditions.
Lemma 5: Let C be an m×m symmetric matrix, with λmin(C) and λmax(C) denoting its minimum
and maximum eigenvalues. Then,
SupX
X ′CX
X ′X= λmax(C), and Inf
X
X ′CX
X ′X= λmin(C).
See Rao (1973) for the proof.
Based on Lemma 5, we can derive the necessary conditions for which the risk of the combined
estimator is lower than the post-break estimator. Given equation (15), the following three conditions
are necessary for the results:
1.
2τ
(tr(W(V(2) − VF
))− 2 θ′V 1/2GWG′V 1/2θ
θ′Mθ
)− τ2
(θ′V 1/2GWG′V 1/2θ
θ′Mθ
)≥ 0, (16)
which is satisfied if 0 ≤ τ ≤ 2
(tr(W(V(2)−VF )
) (θ′Mθ
)θ′ V 1/2 GW G′ V 1/2θ
− 2
).
2.
tr(W(V(2) − VF )
)− 2 θ′V 1/2GWG′V 1/2θ
θ′Mθ> 0, (17)
which is satisfied if k > 2.
3.
θ′V 1/2GWG′V 1/2θ
θ′Mθ−
tr(W(V(2) − VF )
)k
> 0, (18)
which is satisfied if k >tr(W(V(2)−VF )
)λmin
(W(V(2)−VF )
) .
See Appendix A.2 for details and proof.
Therefore, the necessary condition for the Stein-like combined estimator to have lower risk than
the post-break estimator is that the number of regressors to be k > max
(2,
tr(W(V(2)−VF )
)λmin
(W(V(2)−VF )
)).
Having that, we can find the optimal τ∗ in which it minimizes the ρ(βα,W). For 0 ≤ τ ≤
10
2(
tr(W(V(2)−VF )
) (θ′Mθ
)θ′ V 1/2 GW G′ V 1/2θ
− 2)
, the optimal τ which depends on W is:
τ∗(W) =tr(W(V(2) − VF )
) (θ′Mθ
)θ′ V 1/2 GW G′ V 1/2θ
− 2. (19)
Remark 2: The difference between this paper and Hansen (2017) paper is that, here we calculate
the risk for the general form of W , and derive the optimal τ∗ which depends on W. For the special
case that W =(V(2) − VF
)−1, then τ∗ = k − 2 which is the well-known results of James and Stein
(1961), and τ∗ is positive when the number of regressors is larger than two, k > 2. This choice of
W is the special case of Hansen (2017) paper. Actually, the risk in our paper is for any positive
definite W, and we theoretically derive the risk for any given form of W.
By plugging back the optimal τ∗(W) into the risk function in equation (15), we can derive the
optimal risk. Theorem 2 summarizes the result for risk with general form for W > 0.
Theorem 2: If 0 ≤ τ ≤ 2(
tr(W(V(2)−VF )
) (θ′Mθ
)θ′ V 1/2 GW G′ V 1/2θ
− 2)and k > max
(2,
tr(W(V(2)−VF )
)λmin
(W(V(2)−VF )
)), then
ρ(βα,W) = ρ(β(2),W)−
[tr(W(V(2) − VF )
)(θ′Mθ
)− 2
(θ′V 1/2 GWG′ V 1/2θ
) ]2
(k − 2) (θ′Mθ) (θ′V 1/2 GWG′ V 1/2θ)
×
e−µ 1F1
(k2− 1;
k
2;µ)− 1
k − 2
[tr(W(V(2) − VF )
)]2(θ′Mθ
)2
(θ′V 1/2 GWG′ V 1/2θ
)2 − 4
×
(θ′V 1/2 GWG′ V 1/2θ
)θ′Mθ
e−µ 1F1
(k2− 1;
k
2+ 1;µ
),
(20)
where ρ(β(2),W) = tr(WV(2)), and µ = θ′Mθ2 is the non-centrality parameter.
Note that in Theorem 2, the terms inside the curly brackets are positive, so we could prove that
the Stein-like combined estimator has a smaller risk than the post-break estimator.
Corollary 2.1: For the special case that W = (V(2) − VF )−1, the risk of the Stein-like combined
estimator is equal to:
ρ(βα,W) = ρ(β(2),W)− (k − 2)
e−µ 1F1
(k2− 1;
k
2;µ)
, (21)
where the risk of the combined estimator is less than the post-break estimator if k > 2.
11
Remark 3: As b1 increases, V(2) increases. Thus τ∗ increases. Besides, as b1 increases, V(2)
increases, and because HT inversely depends on V(2), so HT decreases which results in having
bigger τ∗
HT. Thus the full-sample estimator gets a higher weight when b1 is large. Remember that
βα = αβFull + (1− α)β(2).
Remark 4: Based on Theorem 2, the gain obtained by using combined estimator can be derived
by taking difference between ρ(β(2),W) and ρ(βα,W).
Figure 1 shows the relationship between breaksize that appears in θ in the above equation
and the difference between risks which we draw it by Matlab program. Based on the figure, as k
increases, the difference between risks increases in favor of the Stein-like combined estimator. Also,
when the variance of the pre-break data is lower than variance of the post-break data, q < 1, then
it is more gain to use pre-break observations. Besides, for the cases that the break point is near the
end of the sample, b = 0.9, the gain obtained from the Stein-like combined estimator is higher than
the case that the break point is near the beginning of the sample, b = 0.1. The reason is that when
there is not enough observations in the post-break sample, post-break estimator performs poorly
due to the lack of observations.
3 Combined estimator with weight κ ∈ [0 1]
In this section we want to answer to the possible question that why do we need to consider the Stein-
type weight as a combination weight? So, we consider our combined estimator with an alternative
combination weight κ ∈ [0 1]. Here we propose to combine the post-break and full-sample estimators
with a weight κ ∈ [0 1] as:
βκ = κβFull + (1− κ)β(2). (22)
By using Theorem 1 results, we can derive the asymptotic risk for this combined estimator. Here,
calculation of risk is simple, because we do not have random part in the combination weight. Thus
we do not need to deal with taking expectation from the inverse of the chi-square distribution.
Actually we do not need to use Lemmas 2 - 4. The asymptotic risk for this combined estimator
12
can be written as:
ρ(βκ,W) = E[T (βκ − β(2))
′W (βκ − β(2))]
= T E[(β(2) − β(2)
)− κ(β(2) − βFull
)]′W[(β(2) − β(2)
)− κ(β(2) − βFull
)]= ρ(β(2),W) + κ2 E
[Z ′ V 1/2 GWG′ V 1/2 Z
]− 2κE
[Z ′ V 1/2 GWG′2 V
1/2 Z]
= ρ(β(2),W) + κ2 tr[V 1/2 GWG′ V 1/2 E(ZZ ′)
]− 2κ tr
[V 1/2 GWG′2 V
1/2 E(ZZ ′)]
= ρ(β(2),W) + κ2 tr[V 1/2 GWG′ V 1/2 (I2k + θθ′)
]− 2κ tr
[V 1/2 GWG′2 V
1/2 (I2k + θθ′)]
= ρ(β(2),W)− κ(
2 tr(W(V(2) − VF
))− κ
[tr(W(V(2) − VF
))+ θ′V 1/2GWG′V 1/2θ
] ).
(23)
If 0 ≤ κ ≤2 tr
(W(V(2)−VF
))tr
(W(V(2)−VF
))+θ′V 1/2GWG′V 1/2θ
, the optimal κ∗ is:
κ∗(W) =tr(W(V(2) − VF
))tr(W(V(2) − VF
))+ θ′V 1/2GWG′V 1/2θ
, (24)
which is derived by minimizing the risk function. Finally by plugging the optimal κ∗ into the risk
function, we have the following theorem.
Theorem 3: The asymptotic risk for the combined estimator with weight κ after plugging the
optimal κ∗(W) into the risk function is:
ρ(βκ,W) = ρ(β(2),W)−
[tr(W(V(2) − VF )
)]2
tr(W(V(2) − VF )
)+ θ′V 1/2GWG′V 1/2θ
. (25)
Theorem 3 shows that under the stated condition for κ, the risk of the combined estimator is
less than the post-break estimator.
Remark 5: For the special case that W = (V(2) − VF )−1, κ∗ = kk+θ′Mθ .
Remark 6: We can compare the combination weights α and κ for the special case that W = (V(2)−
VF )−1. Remember that α = τHT
, where HT = T (β(2) − βFull)′(V(2) − VF )−1(β(2) − βFull) = Z ′MZ.
Also, τ∗ = k − 2 when W = (V(2) − VF )−1. Thus, α = k−2HT
. Besides, κ = kk+HT
under this
13
specification for W. So:
α− κ =(k − 1)2 − 2HT − 1
HT (k +HT ),
which means that for large k and small break size, α is larger than κ. Since α is the weight for
the full-sample estimator, the intuition is that for the small break size and large k, the Stein-like
combined estimator gives a larger weight to the full-sample estimator and it gives the lower weight
to the post-break estimator. Thus, we expect that the Stein-type weight to have a lower risk than
the κ type weight. Because when the break size is small, the full-sample estimator performs better
than the post-break estimator, as sometimes ignoring the very small break sizes is better than
considering them. For the large break size (when HT is large), these weights are almost equal to
each other, α ≈ κ.
4 Semi-parametric shrinkage estimator for forecast averaging
Li, Ouyang, Racine (2013) introduce the semi-parametric method for estimation of the categorical
data. Inspired by their approach, we want to propose an estimator for our time series structural
break models. Using the same model as equation (3), define Y1 = Ω−1/21 Y1, Y2 = Ω
−1/22 Y2, X1 =
Ω−1/21 X1 and X2 = Ω
−1/22 X2. Based on this semi-parametric method, we can estimate the post-
break coefficients by GLS and using the following discrete kernel estimator:
βγ = argminβ(2)
T∑t=1
(yt − x′tβ(2)
)2L(t, γ), (26)
where
L(t, γ) = γ 1(t ≤ T1) + 1(t > T1) (27)
and t goes over all of the observations. βγ is the estimator for the post-break observations. The idea
is that for calculating the post-break estimator, it is worth to use information from the first regime
as well. This kernel-weight estimator gives the constant weight 1 to the post-break observations
and down-weight pre-break observations by weight γ ∈ H = [0 1]. In other words, the post-break
coefficient is estimated by weighting observations over the whole sample. For the case that γ = 0,
the estimator is only using the post-break observations, and when γ = 1, all of the observations
are weighted equally. For other cases that γ ∈ (0 1), we have the combination of pre-break and
14
post-break observations. The first order condition of equation (26) is
T∑t=1
L(t, γ) xt
(yt − x′tβ(2)
)= 0, (28)
leading to
βγ =
(T∑t=1
L(t, γ) xtx′t
)−1 T∑t=1
L(t, γ) xtyt
=(γX ′1X1 + X ′2X2
)−1 (γX ′1Y1 + X ′2Y2
)=
(γ
σ2(1)
X ′1X1 +1
σ2(2)
X ′2X2
)−1(γ
σ2(1)
X ′1Y1 +1
σ2(2)
X ′2Y2
)
=
(γ
q2X ′1X1 +X ′2X2
)−1(γ
q2X ′1X1 β(1) +X ′2X2 β(2)
)= ∆ β(1) + (I −∆) β(2),
(29)
where ∆ =(γq2X ′1X1+X ′2X2
)−1 (γq2X ′1X1
), q =
σ(1)σ(2)
, β(1) = (X ′1X1)−1X ′1Y1 and β(2) = (X ′2X2)−1X ′2Y2.
We can find γ by minimizing the following Cross Validation (CV) criterion function:
CV (γ) =1
T − T1
T∑τ=T1+1
(yτ − x′τ β(−τ)
γ
)2, (30)
where β(−τ)γ denotes the leave-one-out Cross Validation (CV) estimate of β in which τ goes over the
post-break observations, τ = T1 + 1, . . . , T and at each time deletes τ ′th row of the observation.
So:
β(−τ)γ =
T∑t6=τ
L(t, γ)xtx′t
−1T∑t6=τ
L(t, γ)xtyt, for t = T1 + 1, . . . , T. (31)
Note that CV for finding γ is based on post-break observations only. The reason is that from
t = 1, . . . T1 the coefficient is β(1), and we are trying to find the estimator for the post-break
observations (t > T1) in order to use it for forecasting. Once we found γ, we can calculate βγ and
MSFE accordingly. The good point about this semi-parametric estimator, βγ , is that γ does not
depend on the estimation of the break size or the ratio of break in the error term. Actually the
estimation error in this estimator is lower than the previous introduced combined estimator. Since
this method finds the weight by CV, it can fit to data better than other methods which involve the
estimation of some other parameters as well. We just need to prove that the γ weight that we derive
by CV is optimal. Define L(γ) =[(βγ − β(2)
)′X ′2 Ω−1
2 X2
(βγ − β(2)
)]=(µ(γ) − µ
)′(µ(γ) − µ
)
15
to be the loss function, where µ = Ω−1/22 X2 β(2) and µ(γ) = Ω
−1/22 X2 βγ , and the expected loss is
R(γ) = E[L(γ)
]. Theorem 4 is about optimality of the γ weight.
Theorem 4: As T →∞, then
L(γ)
infγ∈H
L(γ)
p−→ 1, (32)
R(γ)
infγ∈H
R(γ)
p−→ 1. (33)
Theorem 4 shows that the optimal weight obtained by CV is asymptotically equivalent to the
infeasible optimal weight. For proving this Theorem, at first we need to derive L(γ) and R(γ). See
Zhang et al. (2013) and Ullah et al. (2017).
4.1 Deriving MSFE and the optimality for the cross-validation weight
This part, shows the step by step procedure that is needed for the proof of optimality. Rewrite
equation (29) as:
βγ − β(2) =( γq2
X ′1X1 +X ′2X2
)−1( γq2
X ′1X1λ+γ
q2X ′1σ(1)ε1 +X ′2σ(2)ε2
), (34)
where λ = β(1) − β(2), ε1 = (ε1, . . . , εT1)′, ε2 = (εT1+1, . . . , εT )′ and we define γq2≡ γ
q2for ease of
notation. Based on the loss function:
R(γ) = E[(βγ − β(2)
)′W(βγ − β(2)
)]= E
[( γq2
λ′X ′1X1 +γ
q2σ(1)ε
′1X1 + σ(2)ε
′2X2
)( γq2
X ′1X1 +X ′2X2
)−1W( γ
q2X ′1X1 +X ′2X2
)−1( γq2
X ′1X1λ+γ
q2X ′1σ(1)ε1 +X ′2σ(2)ε2
)]=( γq2
)2λ′X ′1X1
( γq2
X ′1X1 +X ′2X2
)−1W( γq2
X ′1X1 +X ′2X2
)−1X ′1X1λ
+( γq2
)2E[σ2
(1) ε′1X1
( γq2
X ′1X1 +X ′2X2
)−1W( γq2
X ′1X1 +X ′2X2
)−1X ′1ε1
]+ E
[σ2
(2) ε′2X2
( γq2
X ′1X1 +X ′2X2
)−1W( γq2
X ′1X1 +X ′2X2
)−1X ′2ε2
]= λ′A1λ+ γ2σ2
(2) tr(A2) + σ2(2) tr(A3),
(35)
16
where A1 =(γq2
)2X ′1X1
(γq2X ′1X1 +X ′2X2
)−1W(γq2X ′1X1 +X ′2X2
)−1X ′1X1,
A2 = X ′1X1
(γq2X ′1X1 +X ′2X2
)−1W(γq2X ′1X1 +X ′2X2
)−1and
A3 = X ′2X2
(γq2X ′1X1 +X ′2X2
)−1W(γq2X ′1X1 +X ′2X2
)−1. Since λ is the break size which is not
known, we can plug the unbiased estimator for this term instead. The unbiased estimator for λ′A1λ
is equal to:
λ′A1λ− σ2(1) tr
(A1(X ′1X1)−1
)− σ2
(2) tr
(A1(X ′2X2)−1
)(36)
and by plugging it back into equation (35) we have:
R(γ) = λ′A1λ+ σ2(1) tr
(( γq2
)2A2 −A1(X ′1X1)−1
)+ σ2
(2) tr
(A3 −A1(X ′2X2)−1
)= λ′A1λ+ 0 + 2σ2
(2) tr
(W( γq2
X ′1X1 +X ′2X2
)−1)− σ2
(2) tr
((X ′2X2
)−1W
),
(37)
where(γq2
)2A2−A1(X ′1X1)−1 = 0. Also, A3−A1(X ′2X2)−1 = 2σ2
(2) tr
(W(γq2X ′1X1+X ′2X2
)−1)−
σ2(2) tr
((X ′2X2
)−1W
), and σ2
(2) =(Y2−X2β(2))
′(Y2−X2β(2))
T−T1−k is an unbiased estimator for the variance
of the post-break observations. Besides, we can rewrite λ′A1λ as:
λ′A1λ = β′(2) W β(2) − 2 β′(2) W βγ + β′γ W βγ
= β′(2) X′2 Ω−1
2 X2 β(2) − 2 Y ′2 Ω−1/22 Ω
−1/22 X2 βγ +
(µ(γ)− Y2 + Y2
)′(µ(γ)− Y2 + Y2
)= β′(2) X
′2 Ω−1
2 X2 β(2) − 2 Y ′2 µ(γ) +(µ(γ)− Y2
)′(µ(γ)− Y2
)+ Y ′2 Y2 + 2
(µ(γ)− Y2
)′Y2
= β′(2) X′2 Ω−1
2 X2 β(2) +(µ(γ)− Y2
)′(µ(γ)− Y2
)− Y ′2 Y2,
(38)
where we plug W = X ′2 Ω−12 X2 to derive the MSFE.
Thus, based on equation (37), the MSFE for this semi-parametric estimator up to the relevant
term to γ can be written as:
R(γ) =(µ(γ)− Y2
)′(µ(γ)− Y2
)+ 2 σ2
(2) tr
(X ′2 Ω
−1/22 X2
( γq2
X ′1X1 +X ′2X2
)−1). (39)
17
Define P (γ) ≡ X2
(γq2X ′1X1 +X ′2X2
)−1X ′2. Thus we can rewrite µ(γ) as:
µ(γ) = Ω−1/22 X2 βγ = Ω
−1/22 X2
( γq2
X ′1X1 +X ′2X2
)−1( γq2
X ′1Y1 +X ′2Y2
)= Ω
−1/22 X2
( γq2
X ′1X1 +X ′2X2
)−1( γq2
X ′1Y1
)+ Ω
−1/22 X2
( γq2
X ′1X1 +X ′2X2
)−1X ′2Y2
= Ω−1/22
(A+P (γ)Y2
),
(40)
where A ≡ X2
(γq2X ′1X1 + X ′2X2
)−1(γq2X ′1Y1
). Based on this definition, we can rewrite the risk
in equation (39) as:
R(γ) =(µ(γ)− Y2
)′(µ(γ)− Y2
)+ 2 σ2
(2) tr
(X ′2 Ω
−1/22 X2
( γq2
X ′1X1 +X ′2X2
)−1)
=(µ(γ)− Ω
−1/22 Y2
)′(µ(γ)− Ω
−1/22 Y2
)+ 2 tr
(P (γ)
)=(µ(γ)− µ− Ω
−1/22 σ(2)ε2
)′(µ(γ)− µ− Ω
−1/22 σ(2)ε2
)+ 2 tr
(P (γ)
)=(µ(γ)− µ− ε2
)′(µ(γ)− µ− ε2
)+ 2 tr
(P (γ)
)= L(γ) + ε′2ε2 − 2 ε′2
(µ(γ)− µ
)+ 2 tr
(P (γ)
)= L(γ) + ε′2ε2 − 2 ε′2 µ(γ) + 2ε′2µ+ 2 tr
(P (γ)
)= L(γ) + ε′2ε2 − 2 ε′2 Ω
−1/22
(A+P (γ)Y2
)+ 2 ε′2µ+ 2 tr
(P (γ)
)= L(γ) + ε′2ε2 − 2 ε′2 Ω
−1/22 A−2 ε′2 Ω
−1/22 P (γ) Ω1/2µ− 2 ε′2Ω
−1/22 P (γ) σ(2)ε2
+ 2 ε′2 µ+ 2 tr(P (γ)
)= L(γ) + ε′2ε2 − 2 ε′2 Ω
−1/22 A−2 µ′ P (γ) ε2 − 2 ε′2P (γ) ε2 + 2 ε′2 µ+ 2 tr
(P (γ)
)
(41)
We want to show that the estimate of the weight, γ, that we get from CV is optimal in the sense of
optimality in Theorem 4 which was introduced by K.-C Li (1987). The concept of this optimality
is that the average squared error of the CV is asymptotically as small as the average squared error
of the infeasible best possible estimator. Also notice that:
R(γ) = E[L(γ)
]= E
[(µ(γ)− µ
)′(µ(γ)− µ
)]= L(γ)− ε′2 P 2(γ) ε2 − 4 A′Ω−1/2
2 P (γ) ε2 − 2(P (γ)µ− µ
)′(Ω−1/22 A+P (γ) ε2
)+ γ tr
(P (γ)
)+(
1− γ)
tr(P 2(γ)
)− C,
(42)
18
where C = γ2 σ2(2)ε′1X1
(γq2X ′1X1 +X ′2X2
)−1X ′2Ω−1
2 X2
(γq2X ′1X1 +X ′2X2
)−1X ′1ε1. See Appendix
A.3 for details. Note that we can rewrite optimality condition as:
R(γ)
R(γ)=R(γ)−R(γ) +R(γ)
R(γ)= 1 +
R(γ)−R(γ)
R(γ). (43)
Thus, to prove the optimality condition in Theorem 4, it suffices to prove that the terms in R(γ)−R(γ)R(γ)
are negligible. In other words, we need to prove that the following conditions hold.
Supγ∈H
|ε′2 Ω−1/22 A |R(γ)
= op(1), (44)
Supγ∈H
|µ′P (γ)ε2|R(γ)
= op(1), (45)
Supγ∈H
|ε′2P (γ)ε2|R(γ)
= op(1), (46)
Supγ∈H
|tr(P (γ))|R(γ)
= op(1), (47)
Supγ∈H
|C|R(γ)
= op(1), (48)
Supγ∈H
|ε′2P 2(γ)ε2|R(γ)
= op(1), (49)
Supγ∈H
|A′Ω−1/22 P (γ)ε2|R(γ)
= op(1), (50)
Supγ∈H
|(P (γ)µ− µ)′ Ω−1/22 A |
R(γ)= op(1), (51)
Supγ∈H
|(P (γ)µ− µ)′P (γ)ε2|R(γ)
= op(1), (52)
Supγ∈H
|tr(P (γ))|R(γ)
= op(1), (53)
19
Supγ∈H
|tr(P 2(γ))|R(γ)
= op(1). (54)
Based on the above conditions, the proof of Theorem 4 thus follows. Therefore, the weight derives
by CV is optimal. See Appendix A.4 for proof and details for each of the above conditions.
5 Alternative combined estimator proposed by Pesaran et al. (2013)
In a particularly insightful paper, Pesaran et al (2013), hereafter PPP, propose that we can decrease
the MSFE under the structural break by using the whole observations in the sample instead of only
post-break observations. Their proposed estimator is :
βPPP = (X ′WX)−1(X ′WY ). (55)
They derive the optimal weight, W , such that MSFE of the one-step-ahead forecast is minimized,
and found that the optimal weight takes one value for observations before the break point and one
value for the post-break observations, W = diag(w(1), . . . , w(1), w(2), . . . , w(2)
), where w(1) is a fixed
weight for the pre-break observations and w(2) is a fixed weight for the post-break observations and
these are defined as: w(1) = 1
T1
b1+(1−b1)(q2+Tb1φ2),
w(2) = 1T
q2+Tb1φ2
b1+(1−b1)(q2+Tb1φ2),
(56)
where b1 = T1T is the proportion of observation before the break, q =
σ(1)σ(2)
is the ratio of break in the
error term, φ =x′T+1λ
σ2(2)
(x′T+1Q−1xT+1)1/2
, Q = E(xtx′t) and λ = β(1)−β(2) is the break size. By knowing
this form of W , we can rewrite their estimator as:
βPPP =(w(1) X
′1X1 + w(2) X
′2X2
)−1(w(1) X
′1X1β1 + w(2) X
′2X2β(2)
)= Λβ(1) + (I − Λ)β(2),
(57)
which is the combined estimator of the pre-break and post-break estimators with combination
weight Λ =(w(1) X
′1X1 + w(2) X
′2X2
)−1(w(1) X
′1X1
). See Pesaran et al. (2013) for details.
Remark 7: By looking carefully at this estimator and comparing this with the semi-parametric
estimator introduced in Section 4, we can see that these two estimators are very close to each other.
Actually if we write the normalized weights for the semi-parametric estimator βγ , we can see that
20
this method gives one value as a weight to pre-break observations, and one value for post-break
observations. Thus:w(1),γ =
(γ
q2
)T1(γ
q2
)+T−T1
= 1T
1
b1+(1−b1)(q2
γ
) for 1 < t ≤ T1,
w(2),γ = 1
T1(γ
q2
)+T−T1
= 1T
(q2
γ
)b1+(1−b1)
(q2
γ
) for T1 < t < T.
(58)
By comparing these weights with those in equation (56), we can see that PPP’s estimator is the
especial case of this semi-parametric estimator under the condition that:
γ =q2
q2 + T1φ2. (59)
6 Extension to multiple breaks
So far, we have talked about the case of having a single break in the model, but in practice the time
series model may be subject to multiple breaks. The case of multiple breaks is the straightforward
extension of the aforementioned sections for the Stein-like combined estimator. The combined
estimator always can be defined as the combination of βFull and the estimator after the most
recent break point. Here we write the model with two breaks in details, and then we extend that
for the case of having more than two breaks. Suppose we have two breaks (three regime) in the
model. The break dates are T1, T2, and we have breaks in both the coefficients and the error
terms, such that
yt =
x′tβ(1) + σ(1)εt if 1 < t ≤ T1
x′tβ(2) + σ(2)εt if T1 < t ≤ T2
x′tβ(3) + σ(3)εt if T2 < t < T.
(60)
Thus the combined estimator in the case of having two breaks is:
βα = αβFull + (1− α)β(3), (61)
where α is defined as in equation (5), in which HT = T (β(3)− βFull)′(V(3)− VF )−1(β(3)− βFull) and
V(3) is the variance of the post-break estimator. It is clear that βα− β(3) =(β(3)− β(3)
)−α
(β(3)−
βFull). So, ρ(βα,W) = E
[T (βα − β(3))
′W (βα − β(3))]. In order to derive the risk, at first we need
to find the distribution of(β(3) − β(3)
)and
(β(3) − βFull
). Let b1 = T1
T , b2 = T2T , and b3 = T3
T where
21
b1 < b2 < b3.
Suppose under the local alternative, the break sizes are: (β(1) − β(2), β(2) − β(3)) = ( δ1√T, δ2√
T).
So, the full-sample estimator is:
√T(βFull − β(3)
)=( T∑t=1
xtx′t
1
Tσ2t
)−1[
T1∑t=1
xtx′t
√T(β(1) − β(3)
)Tσ2
(1)
+
T2∑t=T1+1
xtx′t
√T(β(2) − β(3)
)Tσ2
(2)
+T∑t=1
xtσtεt√Tσ2
t
]
=(X ′Ω−1X
T
)−1(X ′1Ω−11 X1
Tb1
)b1(δ1 + δ2) +
(X ′Ω−1X
T
)−1(X ′2Ω−12 X2
T (b2 − b1)
)(b2 − b1)δ2
+(X ′Ω−1X
T
)−1T∑t=1
xtσtεt√Tσ2
t
d−→ N(Q−1Q1b1(δ1 + δ2) +Q−1Q2(b2 − b1)δ2, VF
).
(62)
Remember that X = (X ′1 X′2 X
′3)′ is T ×k, and Ω = diag
(σ2
(1), . . . , σ2(1), σ
2(2), . . . , σ
2(2), σ
2(3), . . . , σ
2(3)
)is a T × T matrix.
When we have two breaks in the model, the post-break estimator is:
√T(β(3) − β(3)
)=(X ′3 Ω−1
3 X3
T − T2
)−1( X ′3 Ω−13 σ(3)ε3√
1− b2√T − T2
)d−→ N
(0, V(3)
),
(63)
where V(3) = 11−b2 plim
T→∞
(X′3Ω−1
3 X3
T−T2
)−1= 1
1−b2Q−13 . Like what we did in Theorem 1, we can write
the joint asymptotic distribution of the estimators as:
√T
[βFull − β(3)
β(3) − β(3)
]d−→ V 1/2Z, (64)
where Z ∼ N(θ, I2k), θ = V −1/2
[Q−1Q1b1(δ1 + δ2) +Q−1Q2(b2 − b1)δ2
0
]and V =
[VF VFVF V(3)
].
Actually the main difference between multiple breaks in compare to the single break is the bias
term, θ, but still it has the same dimension as the single case. We can extend it to the case of
22
having m breaks which breaks happen at times T1, T2, . . . , Tm:
yt =
x′tβ(1) + σ(1)εt if 1 < t ≤ T1,
x′tβ(2) + σ(2)εt if T1 < t ≤ T2,
...
x′tβ(m+1) + σ(m+1)εt if Tm < t < T.
(65)
In the case of having m breaks in model, the matrix θ which is 2k × 1 can be written as:
θ = V −1/2
[Q−1
(Q1b1(δ1 + · · ·+ δm) + · · ·+Qm(bm −Bm−1)δm
)0
]. (66)
Regarding the extension of the semi-parametric estimator introduced in Section 4, if we have
m breaks in the model, we need to estimate m unknown weight parameter by CV, subject to the
condition that weights are in the range of [0 1]. Once we found these weights, we are able to
estimate the coefficient, βγ .
7 Monte Carlo evidence on forecasting performance
This section provides some Monto Carlo results based on the Stein-like combined estimator, combined
estimator with weight κ ∈ [0 1], semi-parametric estimator with weight γ, PPP’s estimator, and
the post-break estimator. The goal is to compare the MSFEs for these estimators. To do this, Let
t = 1, ..., T with T = 100, 200, q1 =σ(1)σ(2)∈ 0.5, 1, 2 and k = 3, 5, 8. We try different values
for T1 which are proportional to the pre-break sample observations, b1 = T1T ∈ 0.2, 0.4, 0.6, 0.8.
Suppose xt ∼ N(0, 1) and εt ∼ i.i.d. N(0, 1). The data generating process for the single break case
is:
yt =
x′tβ(1) + q1εt if 1 < t ≤ T1
x′tβ(2) + εt if T1 < t ≤ T. (67)
Let β(2) be a vector of ones, and β(1) = β(2) + δ1√T
under local alternative. Note that the magnitude
of distance between parameters is determined by the localizing parameter δ1 and sample size T .
Assume that λ ≡ δ1√T∈ 0, 0.5, 1 represents different break sizes. The number of replications is
1000.
We report the results based on the relative MSFE with respect to the post-break estimator
23
which is the benchmark estimator, i.e.,
RMSFEi =MSFE(yi)
MSFE(y2), (68)
where MSFE(yi) is MSFE(yα), MSFE(yκ), MSFE(yγ) and MSFE(yPPP ). Tables 1 through
6 show the results of the Monte Carlo for the single break case.
As we may have more than one break in practice, we consider multiple breaks in our simulation
design. Accordingly, we extend our simulation experiments to allow for two breaks that happen at
T1 = T3 , and T2 = 2T
3 respectively. The data generating process for the two breaks model follows:
yt =
x′tβ(1) + q1εt if 1 < t ≤ T1
x′tβ(2) + q2εt if T1 < t ≤ T2
x′tβ(3) + εt if T2 < t ≤ T,(69)
where q2 =σ(2)σ(3)
. In order to consider different possibilities that may happen under multiple breaks,
we do the Monte Carlo under different experiments. Table 7 shows break points specifications with
the assigned experiment numbers. The first experiment is under no structural breaks. We allow
for both moderate break (0.5) and large break (1) in coefficients in either direction (experiment
numbers 2 to 5). Also, we consider the cases that the direction of break in coefficients change from
decreasing to increasing or vice versa (experiment numbers 6 to 9). Experiment 10 represents the
case that we have partial change in the coefficients, i.e β(1) = β(2) 6= β(3). Experiment 11 also shows
the partial break case which β(1) 6= β(2) = β(3). Finally, experiments 12 and 13 represent the higher
and lower post-break volatility respectively.
7.1 Simulation results
Tables 1 to 6 show the results of Monte Carlo for the single break case for different sample size and
different q. q = 0.5 means the variance of the pre-break data is less than the post-break data. In
such case, considering the pre-break data in forecasting will improve the forecast accuracy. q = 1
happens when there is no break in the variance. Finally q = 2 means that the pre-break data is
more volatile than the post-break data, so one cannot gain a lot by considering the pre-break data.
Based on the results in Tables, the performance of the Stein-like combined estimator is better than
(in some cases equivalent) the post-break estimator in the sense of having lower MSFE. As we
increase the number of regressors, the MSFE of the Stein-like combined estimator decreases more.
24
When the break size is large (λ = 1), all estimators perform almost the same as the post-break
estimator. As we expected, when we increase the number of regressors, semi-parametric estimator
performs well especially for the small and moderate break sizes (λ ≤ 0.5) and this is because of
having a lower estimation error. PPP estimator has smaller (almost equal) MSFE than post-break
estimator when k = 3 and break size is of moderate size (large size), but as we increase the k, for the
large break sizes it has under-performance relative to the post-break estimator. We have the similar
pattern for the results when q = 1 and q = 2, but MSFE of estimators are lower with q = 0.5.
The similar results hold as we increase the sample size from 100 to 200, but when the number of
observations increase, the MSFE of estimators are closer to each other relative to T = 100.
Tables 8 and 9 show the results for the two breaks case under the specified set up in Table 7.
Based on the results Stein-like combined estimator has a lower MSFE than post-break estimator.
Also the combined estimator with weight γ performs very well under this set up and even has a
lower MSFE than Stein estimator.
8 Empirical analysis
This section presents an empirical illustration of our method. We consider the application of the
forecasting procedure for equity premium. The data are from Welch and Goyal (2008) which we
examine part of it (1927-2018) where data for all variables are available based on relevant frequency.
We consider all data frequencies (monthly, quarterly and yearly) to assess the performance of our
model, and report the results based on quarterly data in this section. The results with monthly
data and annual data can be found in Appendix B.1 and B.2, respectively. We refer to Welch and
Goyal (2008) for detailed description of the data and sources.
The equity premium (or market premium) is the return on the stock market minus the return
on a short-term risk-free treasury bill. Based on the economic variables used to predict the equity
premium, we consider all 14 variables from Welch and Goyal (2008) which data are available from
1927:Q1-2018:Q4, giving a time-series dimension T = 368. The 14 variables are: D/P (dividend
price ratio), D/Y (dividend yield), E/P (earning price ratio), D/E (dividend payout ratio), SV AR
(stock variance), B/M(book-to-market ratio), NTIS(net equity expansion), TBL(treasury bill
rate), LTY (long term yield), LTR(long term return), TMS(term spread), DFY (default yield
spread), DFR(default return spread), INFL (inflation).
25
We recursively compute one-step-ahead forecasts using different forecasting methods for the
models described in this paper. Each time that we expand the window, initially we apply the
Schwarz’s Bayesian Information Criterio (BIC) to choose the effective variables out of k = 14
available ones which are critical in predicting the equity premium. In other words, we select the
forecasting model using this criterion from all 2k possible specifications which k = 14. As we did
not put any restriction on the number of predictors, this criterion may choose all 14 predictors or
choose nothing as the extreme cases. Since BIC has a larger penalty term, it is a good choice to
use it against in-sample over fitting, Pesaran and Timmermann (1995). Besides, we identify break
points by the sequential procedure introduced by Bai and Perron (1998, 2003), hereafter BP, where
we search for up to eight breaks and set the trimming parameter to 0.1 and the significance level
to 5%. Using an initial estimation period of n1 = 92 quarters (23 years) forecasts are recursively
generated at each point in the out-of-sample period using only the information available at the
time the forecast is made. As the selection of the forecast evaluation period is always somewhat
arbitrary, we also report the results with larger estimation window sizes of 132, 172. The results
are qualitatively similar when a larger number of estimation period is used. The baseline forecast
uses the observations after the last break identified by the sequential procedure of BP. We compare
our proposed Stein-like combined estimator forecast with the BP post-break forecast, κ weight
combined estimator forecast and forecast from the semi-parametric estimator which we derive the
weight (γ) by CV. Also we compare our results with the one proposed by PPP (2013) and historical
average (HA) forecast.
Table 10 displays the summary statistics for the equity premium and its predictors. The min,
max and high standard deviation shows the higher volatility which can be attributed to D/P ,
D/Y , E/P and D/E in compare to other predictors.
8.1 Forecast evaluation with quarterly data
In order to evaluate the performance of our proposed estimator, we compute its out of sample
MSFE and compare it with MSFE from other competing estimators. For this purpose, we divide
the sample of T observations into two parts. The first n1 observation is used as an in-sample
estimation period, and the remaining n2 = T − n1 observations is the out-of-sample period which
we recursively do one step ahead forecast. We consider different estimation window sizes to check
26
the sensitivity of the estimators to this choice. The MSFE for the predictive regression model (65)
over the forecast evaluation period is given by:
MSFEi =1
n2
n2∑t=1
(yn1+t − yi,n1+t), (70)
whereMSFEi stands for different forecasting method such as the BP post-break estimator (MSFEBP ),
our proposed Stein-like combined estimator (MSFEα), combined estimator with weight κ (MSFEκ),
semi-parametric estimator (MSFEγ) and PPP estimator (MSFEPPP ). We also report the forecast
from historical average (MSFEHA) where the historical average of the equity premium corresponds
to a constant expected equity premium.
Table 11 shows the recursively out-of-sample forecasting results under different in-sample estimation
period (n1), so the beginning of the various forecast evaluation periods runs from 1950:Q1 (n1 = 92)
through 1970:Q1 (n1 = 172). We evaluate the forecasts for horizons h = 1, 2, 3, 4 quarters. Based
on the results of the Table 11, the Stein-like combined estimator delivers vastly improved forecasts
(lower MSFE) compared to the BP post-break estimator for all horizons. Besides, the semi-
parametric estimator and combined estimator with weight κ have a lower MSFE than post-break
estimator but still combined estimator with the Stein type weight outperforms these estimators.
Our proposed combined estimator performs almost the same as PPP’s estimator when h = 1, but
for h > 1, the Stein-like combined estimator outperforms the PPP’s estimator. Moreover, the Stein-
like combined estimator has a lower MSFE than HA which is hard to beat. The reason that Stein
type weight performs well can be attributed to the form of its weight that beautifully captures the
break size and adjust the weight between the full-sample estimator and the post-break estimator
accordingly.
We also use the out-of-sample R2 statistics suggested by Campbell and Thompson (2008)
to compare the forecasts based on our proposed Stein-like combined estimator relative to the
introduced alternative models such as “κ”, “γ”, “PPP” and “HA”. We compute the out-of-sample
R2 by:
R2i = 1− MSFEα
MSFEi, (71)
where MSFEα is the MSFE for the Stein-like combined estimator and MSFEi shows the MSFE
for each of the introduced models including “κ”, “γ”, “PPP” and “HA”. Obviously positive R2
27
shows the out-performance of our proposed Stein-like combined estimator relative to the benchmark
model, and negative R2 value indicates the under-performance of our Stein estimator. Finally, we
evaluate whether any of the out/under-performance of our model is statistically significant or not
corresponding to testing H0 : R2 = 0 (MSFEi = MSFEα) against the alternative hypothesis that
H1 : R2 > 0 (MSFEi > MSFEα). We report the well known Diebold and Mariano (1995) and
West (1996) statistic for testing the null of equal MSFE (or equal predictive ability). The results
are displayed in Table 12. Based on the results, the Stein-like combined estimator has a larger
positive R2 compare to the κ estimator, semi-parametric estimator and the post-break estimator,
and the value of R2 is significant across different estimation window size at 5% significance level
for all h. Besides, the semi-parametric estimator is performing well in terms of having lower
MSFE which actually this estimator facilitate implementing of our theoretical point that using pre-
break observations improve forecast accuracy. Based on the results, the semi-parametric estimator
performs better for the longer forecast horizons.
8.2 Cumulative difference square forecast error
In order to check whether our proposed Stein-like combined estimator has consistent out-performance
relative to the other estimators or not, we graph the cumulative difference squared forecast errors
(CDSFE) for our proposed estimator versus the other models under different forecast horizons.
This informative graph is recommended by Welch and Goyal (2008), see also Rapach et al (2013),
to assess the consistency of any out-performance in the model. In order to interpret this graph,
we can check whether CDSFE has a positive values during any period of time or not. A strong
forecasting model should perform well across the majority of the out-of-sample period and not only
around one or two short periods. Thus, a model with positive (negative) values show the consistent
out-performance (under-performance) of our proposed estimator relative to the benchmark model.
Actually, rising values of CDSFE represents a higher MSFE for benchmark model relative to our
model. Similarly, the flat spots reflects the equal performance of the two estimator. CDSFE can
be calculated by:
CDSFEi,t =
t∑τ=1
(e2i, n1+τ − e2
α, n1+τ
), (72)
where t = 1, . . . , n2 and we set n1 ∈ 92, 132, 172 to be the size of initial in-sample estimation
period, so the out-of sample forecast evaluation period starts from early 1950s, 1960s and 1970s,
28
respectively. Also, e2i, n1+τ shows the MSFE for various benchmark estimators at time τ , and
e2α, n1+τ represents the MSFE for our proposed Stein-like combined estimator at time τ . Figure 2
displays the results for CDSFE for different benchmark models with quarterly data. The solid dark
blue line shows the CDSFE for the BP post-break estimator. As it is clear from the graph, we have
an increasing positive values for this estimator over 1950-2018 for all forecast horizons, h. This
shows the consistent out-performance of the Stein-like combined estimator over the BP post-break
estimator.
The next best forecast is the one with semi-parametric method (green dashed line) that the
Stein estimator has consistent out-performance relative to this semi-parametric estimator. This
estimator is not sensitive to the choice of initial in-sample estimation period and has the consistent
performance regardless of the choice of n1, but for the longer forecast horizons, h = 3, 4, the
CDSFE becomes smaller which emphasize the excellence performance of this estimator for longer
forecast horizons.
The out-performance of the Stein-like combined estimator relative to the combined estimator
with κ weight (pink dashed line) can be seen in the figures for all h. Based on the graphs, the
κ weight estimator is sensitive to the choice of the initial in-sample estimation period because as
we increase n1, the CDSFE becomes smaller but still the Stein-like estimator out-performs the κ
estimator. Also, the Stein-like combined estimator forecast outperforms the forecast based on the
PPP estimator (dotted red line) almost for all h except for h = 4 and n1 = 92, 132 that we can
see the insignificant under-performance.
Finally we can see the out-performance of the Stein-like combined estimator relative to the
historical average except for for h = 4. For h = 4, we can see a period of under-performance around
2000s. Overall, the Stein-like combine estimator is the best estimator amongst the introduced
estimators in this paper.
9 Conclusion
In this paper we introduce the Stein-like combined estimator of the full-sample estimator (using
all observations in the sample), and the post-break estimator which uses the observations after the
most recent break point. The standard solution for forecasting under structural break is to use
the post-break estimator, but it has been shown that using pre-break observations can decrease
29
the MSFE. As our combined estimator uses the pre-break observations, we are able to reduce the
variance of forecast error at the cost of adding some bias. We also introduce the semi-parametric
estimator which is a discrete kernel weight of all observations in the sample. This estimator gives
weight one to the post-break observations, and down-weight pre-break observations by weight
γ ∈ [0 1], which we can find it numerically by CV. Both the Stein-like combined estimator and
the semi-parametric estimator have lower MSFE than the post-break estimator. Especially for
the case of large number of regressors, the semi-parametric estimator performs well because it
does not involve with estimation error of many parameters, and we only need to estimate the
weight γ. We also compare the performance of our proposed Stein-like combined estimator with
two alternative estimators. One of them is the combined estimator with weight κ ∈ [0 1] and
the other one is proposed by Pesaran et al. (2013). Based on the simulation results, all of the
estimators perform well, but when the number of regressors is high and break size is large, MSFE
for PPP estimator is larger than post-break estimator, but we do not have such a case in our
proposed estimators. Besides, the results from the empirical example with equity premium shows
that the Stein-like combined estimator perform better than other introduced estimators in this
paper (including historical average) and is more robust to the frequency of data and selection of
initial in-sample estimation window.
30
References
Andrews, D.W.K. (1993). “Tests for parameter instability and structural change with unknown
change point.” Econometrica 61, 821-856.
Bai, J. and Perron, P. (1998). “Estimating and testing linear models with multiple structural
changes.” Econometrica 66, 47-78.
Bai, J. and Perron, P. (2003). “Computation and analysis of multiple structural change models.”
Journal of Applied Econometrics 18, 1-22.
Bates, J.M. and C.W.J. Granger (1969). “The Combination of Forecasts.” Operations Research
Quarterly 20, 451-468.
Boot, T. and Andreas Pick (2017). “A near optimal test for structural breaks when forecasting
under square error loss.” Tinbergen Institute Discussion Paper, No.17-039/III, Tinbergen
Institute, Amsterdam and Rotterdam
Campbell, J. Y, and S. B. Thompson (2008). “Predicting the Equity Premium Out of Sample:
Can Anything Beat the Historical Average?” Review of Financial Studies 21, 1509-31.
Clements, M.P. and D.F. Hendry (1998). “Forecasting Economic Time Series.” Cambridge
University Press.
Clements, M.P. and D.F. Hendry (1999). “Forecasting Non-stationary Economic Time Series.”
The MIT Press.
Clements, M.P. and D.F. Hendry (2006). “Forecasting with Breaks.” In Elliott, G., C.W.J.
Granger and A. Timmermann (Eds.), Handbook of Economic Forecasting vol. 1. North-
Holland, 605-658
Diebold, F.X. and P. Pauly (1987). “Structural Change and the Combination of Forecasts.”
Journal of Forecasting 6, 21-40.
Diebold, F. X. and R. S. Mariano (1995). “Comparing predictive accuracy.”Journal of Business
and Economic Statistics 13, 253-263.
Hansen, Bruce (2016). “Efficient shrinkage in parametric models.” Journal of Econometrics 190,
115-132
Hansen, Bruce (2017). “Stein-like 2SLS estimator.” Econometrics Reviews 36, 840-852
Hausman, J. A. (1978). “Specification tests in econometrics.” Econometrica 46, 1251-1271
31
James, W., Stein, Charles M. (1961). “Estimation with quadratic loss.” In Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 361–380.
Judge, G.G. and M.E. Bock (1978). “The statistical implications of pretest and Stein-rule
estimators in econometrics.” North-Holland, Amsterdam.
Lebedev, N. N. (1972). “Special Functions and Their Applications.” Dover Publications, Inc.;
New York.
Li, K.-C. (1987). “Asymptotic optimality for Cp , CL , cross-validation and generalized cross-
validation: discrete index set.” The Annals of Statistics 15, 958-975.
Li, Q., Ouyang, D., Racine, J. (2013). “Categorical semiparametric varying coefficient models.”
Journal of Applied Econometrics 28,551–579.
Lehmann, E. L., Casella, G. (1998). “Theory of Point Estimation.” 2nd ed. New York: Springer.
Maasoumi, E. (1978). “A modified Stein-like estimator for the reduced form coefficients of
simultaneous equations.” Econometrica 46,695–703.
Pesaran, M.H. and A. Timmermann (2005). “Small Sample Properties of Forecasts from Auto-
regressive Models under Structural Breaks.” Journal of Econometrics 129, 183-217.
Pesaran, M.H., and A. Timmermann (1995). “Predictability of Stock Returns: Robustness and
Economic Significance.” Journal of Finance 50, 1201-28.
Pesaran, M.H., Andreas Pick and Mikhail Pranovich (2013). “Optimal forecasts in the presence
of structural breaks.” Journal of Econometrics, 177(2), 134-152.
Pesaran, M.H., Timmermann, Allan (2007). “Selection of estimation window in the presence of
breaks.” Journal of Econometrics 137, 134-161.
Pesaran, M.H., and Andreas Pick (2011). “Forecast Combination Across Estimation Windows.”
Journal of Business and Economic Statistics, 29:2, 307-318, DOI: 10.1198/ jbes.2010.09018
Rao, C. R. (1973). “Linear Statistical Inference and Its Applications.” New York: John Wiley
and Sons, Inc.
Rapach, D. E., and Zhou, G. (2013). “Forecasting stock returns.” in Graham Elliott and Allan
Timmermann, eds: Handbook of Economic Forecasting, Volume 2 (Elsevier, Amsterdam).
Rossi, B. (2013). “Advances in forecasting under instability.” In G. Elliott, & A. Timmermann
(Eds.), Handbook of Economic Forecasting, Chapter 21, 1203 - 1324. Elsevier volume 2, Part
B.
32
Saleh, A.K.Md. Ehsanses (2006). “Theory of Preliminary Test and Stein-Type Estimation with
Applications.” Wiley, Hoboken.
Stein, Charles M. (1956). “Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution.” In: Proc. Third Berkeley Symp. Math. Statist. Probab., vol. 1 pp.
197–206.
Stock, J. H., and Watson, M. W. (2004). “Combination Forecasts of Output Growth in a Seven-
Country Data Set.” Journal of Forecasting 23, 405–430.
Timmerman, Allan (2006). “Forecast Combinations.” In G. Elliott, C. W. Granger, and A.
Timmermann (Eds.), Handbook of economic forecasting (pp. 135-196). Elsevier.
Ullah, A. (1974). “On the sampling distribution of improved estimators for coefficients in linear
regression.” Journal of Econometrics 2, 143-50.
Ullah, A., Wan, A. T. K., Wang, H., Zhang, X., Zou, G. (2017). “A semiparametric generalized
ridge estimator and link with model averaging.” Econometric Reviews 36, 370-384.
Welch, I., and A. Goyal (2008). “A Comprehensive Look at the Empirical Performance of Equity
Premium Prediction.” Review of Financial Studies 21, 1455-508.
West, K.D., (1996). “Asymptotic inference about predictive ability.” Econometrica 64, 1067-1084.
Zhang, X., Wan, A. T. K., Zou, G. (2013). “Model averaging by Jackknife criterion in models
with dependent data.” Journal of Econometrics 174(2), 82–94.
33
Table 1: Simulation results for a single break, T = 100, q = 0.5
k = 3 k = 5 k = 8
λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
b = 0.2
0 0.988 0.984 0.986 0.996 0.968 0.970 0.975 0.992 0.952 0.962 0.967 0.989
0.5 1.000 1.000 1.001 1.002 0.998 0.997 0.996 1.002 0.996 0.995 0.995 1.008
1 1.000 1.000 1.002 1.006 1.000 1.000 0.999 1.005 0.999 0.999 0.999 1.026
b = 0.4
0 0.985 0.980 0.976 0.988 0.959 0.959 0.957 0.982 0.926 0.935 0.926 0.971
0.5 0.998 0.999 1.000 1.005 0.996 0.994 0.996 1.003 0.993 0.992 0.995 1.000
1 0.999 1.000 1.001 1.003 1.000 1.000 1.000 1.012 0.999 0.999 1.000 1.019
b = 0.6
0 0.975 0.972 0.951 0.977 0.925 0.928 0.917 0.959 0.871 0.884 0.857 0.922
0.5 1.000 1.000 1.006 0.997 0.993 0.991 0.991 0.995 0.986 0.982 0.980 0.993
1 0.999 0.999 1.001 0.999 0.999 1.000 0.996 1.012 0.997 0.997 0.992 1.012
b = 0.8
0 0.926 0.915 0.867 0.923 0.833 0.837 0.797 0.880 0.705 0.732 0.641 0.785
0.5 0.998 0.999 1.008 0.992 0.977 0.974 0.976 0.974 0.953 0.956 0.921 0.929
1 1.000 1.000 1.003 1.007 1.000 1.000 1.002 1.010 0.997 0.994 0.984 1.014
Note: This table represents the results of the relative MSFE. The bench mark forecast is post-break estimator. The first column showsthe ratio of pre-break observations while the second column shows different break sizes, λ. RMSEα ≡ MSFE(yα)
MSFE(y2) , RMSEκ ≡ MSFE(yκ)MSFE(y2) ,
RMSEγ ≡ MSFE(yγ)MSFE(y2) , and RMSEPPP ≡ MSFE(yPPP )
MSFE(y2) .
34
Table 2: Simulation results for a single break, T = 100, q = 1
k = 3 k = 5 k = 8
λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
b = 0.2
0 0.994 0.993 0.992 0.999 0.986 0.987 0.982 0.997 0.979 0.983 0.979 0.995
0.5 1.000 0.999 1.000 1.001 0.999 0.997 0.996 1.004 0.996 0.995 0.995 1.003
1 1.000 1.001 1.002 1.005 1.000 0.999 0.999 1.004 0.999 0.999 0.999 1.010
b = 0.4
0 0.992 0.988 0.987 0.993 0.977 0.977 0.974 0.990 0.957 0.962 0.952 0.983
0.5 0.999 0.998 0.999 1.004 0.996 0.994 0.995 1.001 0.994 0.992 0.994 0.997
1 0.999 1.000 1.001 1.002 1.000 1.000 1.000 1.014 0.998 0.999 1.000 1.011
b = 0.6
0 0.982 0.980 0.963 0.984 0.947 0.949 0.940 0.971 0.902 0.911 0.885 0.939
0.5 1.000 0.999 1.004 0.996 0.991 0.988 0.989 0.994 0.986 0.981 0.979 0.988
1 0.999 0.999 1.000 1.000 0.999 0.998 0.996 1.013 0.996 0.996 0.991 1.009
b = 0.8
0 0.940 0.931 0.882 0.930 0.860 0.859 0.822 0.893 0.737 0.759 0.672 0.803
0.5 0.999 0.985 1.003 0.989 0.973 0.978 0.972 0.973 0.940 0.938 0.918 0.931
1 1.000 1.001 1.003 1.005 0.999 0.999 1.001 1.011 0.994 0.999 0.982 1.013
Note: See the footnote of Table 1.
35
Table 3: Simulation results for a single break, T = 100, q = 2
k = 3 k = 5 k = 8
λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
b = 0.2
0 0.997 0.998 0.998 1.000 0.995 0.995 0.995 1.001 0.993 0.994 0.993 0.998
0.5 1.000 0.999 1.000 1.000 0.998 0.997 0.996 1.001 0.996 0.996 0.994 0.998
1 1.000 1.000 1.002 1.002 1.000 0.999 0.999 1.002 0.999 0.999 0.999 1.002
b = 0.4
0 0.996 0.995 0.997 0.997 0.991 0.991 0.990 0.996 0.983 0.985 0.986 0.993
0.5 0.999 0.997 0.998 1.000 0.996 0.993 0.994 0.997 0.995 0.993 0.994 0.995
1 0.999 0.999 1.001 1.000 1.000 1.000 1.000 1.010 0.999 0.999 1.000 1.003
b = 0.6
0 0.993 0.986 0.985 0.991 0.976 0.976 0.974 0.986 0.949 0.954 0.946 0.970
0.5 0.999 0.999 1.000 0.995 0.990 0.992 0.986 0.988 0.985 0.980 0.978 0.984
1 0.999 0.999 1.000 0.999 0.998 0.997 0.995 1.005 0.994 0.993 0.991 1.001
b = 0.8
0 0.964 0.957 0.911 0.951 0.910 0.907 0.890 0.925 0.807 0.819 0.772 0.843
0.5 0.994 0.990 0.989 0.982 0.968 0.970 0.969 0.968 0.924 0.926 0.914 0.934
1 1.000 0.999 1.003 1.002 0.998 0.999 1.000 1.009 0.986 0.989 0.979 0.997
Note: See the footnote of Table 1.
36
Table 4: Simulation results for a single break, T = 200, q = 0.5
k = 3 k = 5 k = 8
λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
b = 0.2
0 0.998 0.997 0.997 0.999 0.991 0.991 0.992 0.996 0.979 0.982 0.984 0.995
0.5 0.999 0.998 0.998 1.000 0.999 0.999 0.998 0.999 0.999 0.999 1.000 1.009
1 1.000 1.000 1.000 1.003 1.000 1.000 1.000 1.008 1.000 1.000 1.000 1.007
b = 0.4
0 0.992 0.986 0.984 0.994 0.972 0.976 0.974 0.988 0.952 0.959 0.953 0.981
0.5 0.998 0.997 0.996 0.996 0.998 0.998 0.997 0.997 0.999 0.999 0.999 1.010
1 1.000 0.999 1.000 1.002 1.000 0.999 1.000 1.005 1.000 1.000 1.000 1.004
b = 0.6
0 0.983 0.975 0.970 0.984 0.955 0.958 0.951 0.976 0.914 0.928 0.916 0.960
0.5 0.997 0.995 0.994 0.995 0.997 0.997 0.996 0.999 0.996 0.996 0.998 1.013
1 1.000 0.999 1.000 1.003 0.999 0.999 1.000 1.007 0.999 0.999 0.999 1.005
b = 0.8
0 0.968 0.958 0.931 0.963 0.903 0.907 0.882 0.931 0.819 0.841 0.784 0.884
0.5 0.996 0.993 0.994 0.996 0.994 0.994 0.995 0.997 0.987 0.984 0.984 1.005
1 1.000 1.000 1.000 1.008 0.998 0.998 1.000 1.006 0.997 0.997 0.995 1.009
Note: See the footnote of Table 1.
37
Table 5: Simulation results for a single break, T = 200, q = 1
k = 3 k = 5 k = 8
λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
b = 0.2
0 0.999 1.000 1.002 1.000 0.997 0.998 0.999 0.999 0.994 0.994 0.995 0.998
0.5 0.999 0.998 0.997 0.999 0.999 0.999 0.998 0.997 1.000 1.000 1.000 1.002
1 1.000 1.000 1.000 1.002 1.000 1.000 1.000 1.004 1.000 1.000 1.000 1.009
b = 0.4
0 0.998 0.995 0.994 0.997 0.986 0.988 0.986 0.994 0.974 0.978 0.974 0.988
0.5 0.998 0.997 0.996 0.996 0.999 0.998 0.997 0.998 0.999 0.999 1.000 1.005
1 1.000 1.000 1.000 1.001 1.000 1.000 1.000 1.005 1.000 1.000 1.000 1.006
b = 0.6
0 0.992 0.986 0.982 0.991 0.972 0.973 0.967 0.984 0.944 0.953 0.942 0.975
0.5 0.997 0.999 0.994 0.996 0.998 0.997 0.996 0.999 0.997 0.997 0.999 1.014
1 1.000 0.999 1.000 1.002 0.999 0.999 1.000 1.007 0.999 0.999 0.999 1.005
b = 0.8
0 0.978 0.969 0.946 0.973 0.922 0.923 0.900 0.942 0.848 0.863 0.811 0.900
0.5 0.996 0.998 0.993 0.995 0.994 0.995 0.995 0.996 0.987 0.984 0.985 1.005
1 1.000 1.000 1.000 1.008 0.998 0.998 1.001 1.009 0.997 0.997 0.995 1.010
Note: See the footnote of Table 1.
38
Table 6: Simulation results for a single break, T = 200, q = 2
k = 3 k = 5 k = 8
λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
b = 0.2
0 1.000 1.001 1.003 1.000 1.001 1.000 1.004 1.000 1.000 0.999 1.001 0.999
0.5 0.999 0.999 0.998 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 0.999
1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.002 1.000 1.000 1.000 0.999
b = 0.4
0 1.001 1.000 1.007 1.000 0.998 0.997 1.000 0.998 0.992 0.993 0.994 0.996
0.5 0.998 0.997 0.996 0.996 0.999 0.998 0.998 0.998 1.000 0.999 1.000 1.000
1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.004 1.000 1.000 1.001 1.001
b = 0.6
0 0.999 0.996 1.002 0.997 0.990 0.989 0.990 0.994 0.978 0.981 0.982 0.991
0.5 0.997 0.994 0.994 0.996 0.998 0.997 0.997 0.998 0.998 0.997 0.999 1.007
1 1.000 0.999 1.000 1.000 0.999 0.999 1.000 1.004 0.999 0.998 0.999 1.003
b = 0.8
0 0.990 0.988 0.979 0.986 0.957 0.955 0.944 0.964 0.901 0.909 0.880 0.935
0.5 0.996 0.999 0.993 0.993 0.995 0.995 0.996 0.997 0.988 0.989 0.986 1.005
1 1.000 1.000 1.000 1.007 0.998 0.998 1.001 1.009 0.997 0.996 0.996 1.011
Note: See the footnote of Table 1.
39
Table 7: Break point specifications by experiments
Experiment No. β(1) β(2) β(3) σ(1) σ(2) σ(3)
#1 : No break 1 1 1 1 1 1
#2 : Moderate break in coefficients (decline) 1.5 1 0.5 1 1 1
#3 : Large break in coefficients (decline) 2 1 0 1 1 1
#4 : Moderate break in coefficients (increase) 0.5 1 1.5 1 1 1
#5 : Large break in coefficients (increase) 0 1 2 1 1 1
#6 : Moderate decreasing and increasing break 1.5 1 1.5 1 1 1
#7 : Large decreasing and increasing break 2 1 2 1 1 1
#8 : Moderate increasing and decreasing break 0.5 1 0.5 1 1 1
#9 : Large increasing and decreasing break 0 1 0 1 1 1
#10 : No break, then increasing break 1 1 2 1 1 1
#11 : Decreasing, then no break 2 1 1 1 1 1
#12 : Higher post-break volatility 1 1 1 1 1 2
#13 : Lower post-break volatility 1 1 1 1 1 0.5
40
Table 8: Simulation results for multiple breaks, T = 100
k = 3 k = 5 k = 8
Experiment RMSFEα RMSFEκ RMSFEγ RMSFEPPP RMSFEα RMSFEκ RMSFEγ RMSFEPPP RMSFEα RMSFEκ RMSFEγ RMSFEPPP
#1 0.951 0.942 0.957 0.998 0.887 0.894 0.906 0.996 0.820 0.840 0.830 0.982
#2 0.999 1.000 0.994 1.039 0.995 0.999 0.983 1.068 0.991 0.997 0.979 1.089
#3 0.999 1.000 1.000 1.010 0.998 0.998 0.995 1.041 0.997 0.998 0.993 1.079
#4 0.999 0.999 0.997 1.008 0.996 0.997 0.992 1.042 0.995 0.996 0.978 1.083
#5 1.000 1.000 1.004 1.055 1.000 1.000 1.004 1.064 1.000 1.000 1.003 1.088
#6 0.983 0.983 0.964 1.004 0.967 0.958 0.921 0.992 0.911 0.909 0.860 0.990
#7 0.998 0.999 0.968 1.011 0.995 0.998 0.942 1.009 0.990 0.992 0.876 1.010
#8 0.991 1.001 0.975 1.001 0.960 0.965 0.929 0.996 0.955 0.960 0.868 0.995
#9 0.994 0.999 0.962 1.009 1.000 1.001 0.935 1.023 1.000 1.001 0.878 1.089
#10 0.999 1.000 1.002 1.009 0.999 0.999 0.997 1.039 0.995 0.996 0.994 1.045
#11 0.984 1.000 0.966 1.011 1.000 1.000 0.934 1.024 1.000 0.999 0.876 1.095
#12 0.934 0.921 0.949 1.011 0.859 0.869 0.892 1.012 0.782 0.808 0.816 1.011
#13 0.965 0.958 0.968 1.000 0.909 0.914 0.921 0.997 0.849 0.863 0.854 0.975
Note: This table represents the results of the relative MSFE with respect to post-break estimator for multiple breaks when T = 100.The first column shows the experiment number which represents the specification of breaks based on Table 7. RMSEα ≡ MSFE(yα)
MSFE(y2) ,
RMSEκ ≡ MSFE(yκ)MSFE(y2) , RMSEγ ≡ MSFE(yγ)
MSFE(y2) , and RMSEPPP ≡ MSFE(yPPP )MSFE(y2) .
41
Table 9: Simulation results for multiple breaks, T = 200
k = 3 k = 5 k = 8
Experiment RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP
#1 0.982 0.977 0.981 0.999 0.953 0.952 0.959 0.999 0.933 0.934 0.938 0.995
#2 0.999 0.989 0.987 1.002 0.999 1.000 0.972 1.007 0.997 1.000 0.950 1.021
#3 0.999 0.999 0.983 1.007 0.998 1.000 0.974 1.016 0.996 1.000 0.939 1.011
#4 0.999 1.000 1.001 1.008 1.000 1.000 0.994 1.040 1.000 1.000 0.989 1.057
#5 0.999 1.000 1.001 1.038 0.999 1.000 1.000 1.040 0.999 1.001 0.999 1.066
#6 0.999 1.000 0.991 1.002 0.990 0.999 0.969 1.009 0.970 0.979 0.949 1.015
#7 0.999 1.001 0.986 1.001 1.000 1.001 0.973 1.015 1.000 1.000 0.950 1.039
#8 0.995 0.997 0.991 1.007 0.983 0.980 0.976 1.012 0.980 0.980 0.967 1.011
#9 0.998 0.988 0.988 1.007 0.993 0.991 0.976 1.023 0.991 0.989 0.956 1.055
#10 0.994 0.982 1.003 1.015 0.988 0.990 1.003 1.013 0.984 0.988 1.007 1.040
#11 1.000 1.000 0.991 1.007 0.993 0.999 0.965 1.010 0.989 0.990 0.939 1.029
#12 0.979 0.974 0.978 1.000 0.943 0.942 0.950 1.001 0.917 0.920 0.924 0.996
#13 0.983 0.979 0.985 0.998 0.959 0.960 0.969 1.000 0.941 0.944 0.950 0.998
Note: This table represents the results of the relative MSFE with respect to post-break estimator for multiple breaks when T = 200.For further details, see the footnote of Table 8
42
Table 10: Summary statistics for quarterly data
Variables Mean St. dev. Min. Max. Median
Equity Premium 0.0208 0.1117 -0.3937 0.8929 0.0308
D/P -3.3785 0.4660 -4.4932 -1.9039 -3.3567
D/Y -3.3643 0.4600 -4.4966 -2.0331 -3.3291
E/P -2.7414 0.4223 -4.8074 -1.7750 -2.7926
D/E -0.6372 0.3341 -1.2442 1.3795 -0.6304
SV AR 0.0086 0.0147 0.0004 0.1144 0.0039
BM 0.5695 0.2678 0.1252 2.0285 0.5462
NTIS 0.0167 0.0257 -0.0529 0.1635 0.0168
TBL 0.0339 0.0309 0.0001 0.1549 0.0293
LTY 0.0509 0.0279 0.0179 0.1482 0.0420
LTR 0.0144 0.0466 -0.1451 0.2437 0.0094
TMS 0.0170 0.0130 -0.0350 0.0453 0.0175
DFY 0.0113 0.0070 0.0034 0.0559 0.0090
DFR 0.0010 0.0219 -0.1184 0.1629 0.0018
INFL 0.0074 0.0130 -0.0411 0.0909 0.0072
Note: D/P is dividend price ratio, D/Y is dividend yield, E/P is earning price ratio,D/E is dividend-payout ratio, SV AR is stock variance, BM is book-to-market ratio,NTIS is net equity expansion, TBL is treasury bill rate, LTY is long term yield,LTR is long term return, TMS is term spread, DFY is default yield spread, DFRis default return spread, and INFL is inflation.
43
Table 11: Out-of-sample forecasting performance with quarterly data
h n2 MSFEα MSFEκ MSFEγ MSFEBP MSFEPPP MSFEHA
1 1950:Q1-2018:Q4 0.0067 0.0078 0.0076 0.0088 0.0068 0.0072
1960:Q1-2018:Q4 0.0073 0.0086 0.0081 0.0097 0.0074 0.0078
1970:Q1-2018:Q4 0.0078 0.0086 0.0089 0.0099 0.0080 0.0081
2 1950:Q1-2018:Q4 0.0068 0.0091 0.0083 0.0105 0.0087 0.0073
1960:Q1-2018:Q4 0.0073 0.0100 0.0089 0.0116 0.0096 0.0078
1970:Q1-2018:Q4 0.0077 0.0087 0.0097 0.0105 0.0086 0.0081
3 1950:Q1-2018:Q4 0.0073 0.0113 0.0086 0.0128 0.0239 0.0073
1960:Q1-2018:Q4 0.0079 0.0121 0.0089 0.0138 0.0272 0.0078
1970:Q1-2018:Q4 0.0077 0.0089 0.0093 0.0109 0.0082 0.0077
4 1950:Q1-2018:Q4 0.0081 0.0139 0.0079 0.0155 0.0078 0.0073
1960:Q1-2018:Q4 0.0087 0.0155 0.0085 0.0173 0.0084 0.0078
1970:Q1-2018:Q4 0.0080 0.0097 0.0089 0.0119 0.0087 0.0077
Note: This table reports the MSFE for different estimators. h in the first column shows the forecasthorizon. The second column (n2) shows the start date of the out-of sample period which all ends at2018 : Q4. Using the initial in-sample estimation period n1, forecasts are recursively generated ateach point in the out-of-sample period using only the information available at the time the forecastis made. In the heading of table, MSFEBP is for the case that we only use post-break observations.MSFEα represents the results based on our proposed Stein-like combined estimator with weightα = 1− τ
HT. MSFEκ shows the results for the combined estimator with weight κ ∈ [0 1]. MSFEγ
column represents the results for the semi-parametric estimator which we estimate the weight (γ)by cross validation. MSFEPPP represents the results based on Pesaran et al. (2013) estimator.Finally MSFEHA represents the results for the historical average.
44
Table 12: Statistical significance of improved predictive accuracy with quarterly data
h n2 R2κ R2
γ R2BP R2
PPP R2HA
1 1950:Q1-2018:Q4 0.1410 0.1184 0.2299 0.0147 0.0694(0.0067) (0.0103) (0.0002) (0.3723) (0.1173)
1960:Q1-2018:Q4 0.1512 0.0988 0.2474 0.0135 0.0641(0.0053) (0.0266) (0.0001) (0.4017) (0.1734)
1970:Q1-2018:Q4 0.0930 0.1236 0.2121 0.0250 0.0370(0.0527) (0.0167) (0.0016) (0.3647) (0.3194)
2 1950:Q1-2018:Q4 0.2527 0.1807 0.3524 0.2184 0.0685(0.0002) (0.0040) (0.0000) (0.0943) (0.1935)
1960:Q1-2018:Q4 0.2700 0.1798 0.3707 0.2396 0.0641(0.0002) (0.0075) (0.0000) (0.0969) (0.2347)
1970:Q1-2018:Q4 0.1149 0.2062 0.2667 0.1047 0.0494(0.0103) (0.0028) (0.0001) (0.0734) (0.2795)
3 1950:Q1-2018:Q4 0.3540 0.1512 0.4297 0.6946 0.0000(0.0008) (0.0163) (0.0000) (0.1575) (0.5111)
1960:Q1-2018:Q4 0.3471 0.1124 0.4275 0.7096 -0.0128(0.0016) (0.0494) (0.0000) (0.1581) (0.5392)
1970:Q1-2018:Q4 0.1348 0.1720 0.2936 0.0610 0.0000(0.0232) (0.0029) (0.0001) (0.1908) (0.4626)
4 1950:Q1-2018:Q4 0.4173 -0.0253 0.4774 -0.0385 -0.1096(0.0019) (0.5842) (0.0002) (0.6239) (0.8253)
1960:Q1-2018:Q4 0.4387 -0.0235 0.4971 -0.0357 -0.1154(0.0019) (0.5735) (0.0002) (0.6047) (0.8161)
1970:Q1-2018:Q4 0.1753 0.1011 0.3277 0.0805 -0.0390(0.0136) (0.0649) (0.0003) (0.0995) (0.6279)
Note: This table reports the results for R2. R2i = 1 − MSFEα
MSFEi, where MSFEi
stands for the MSFE for different estimator such as “BP”, “γ”, “PPP” and “HA”.R2i shows the proportional reduction in MSFE for the predictive regression forecast
relative to the benchmark forecast which is the BP post-break estimator. To showwhether the forecast improvement is significant or not, we report the p-value basedon Diebold and Mariano (1995) and West (1996) statistic in bracket. This is fortesting the null hypothesis that MSFEi = MSFEα against the alternative thatMSFEi > MSFEα (corresponding to H0 : R2
i = 0 against HA : R2i > 0). The first
column shows the forecast horizon and n2 in the second column shows the start dateof the out-of sample period which all ends at 2018 : Q4.
45
(a) b1 = 0.1, q = 0.5 (b) b1 = 0.1, q = 1 (c) b1 = 0.1, q = 2
(d) b1 = 0.5, q = 0.5 (e) b1 = 0.5, q = 1 (f) b1 = 0.5, q = 2
(g) b1 = 0.9, q = 0.5 (h) b1 = 0.9, q = 1 (i) b1 = 0.9, q = 2
Figure 1: Risk gain between the Stein-like combined estimator and post break estimator.
46
(a) h = 1, n1 = 92 (b) h = 1, n1 = 132 (c) h = 1, n1 = 172
(d) h = 2, n1 = 92 (e) h = 2, n1 = 132 (f) h = 2, n1 = 172
(g) h = 3, n1 = 92 (h) h = 3, n1 = 132 (i) h = 3, n1 = 172
(j) h = 4, n1 = 92 (k) h = 4, n1 = 132 (l) h = 4, n1 = 172
Figure 2: Cumulative Difference Square Forecast Error for quarterly data with all forecast horizon,and n1 = 92, 132, 172.
47
(a) h = 1, n1 = 276 (b) h = 1, n1 = 396 (c) h = 1, n1 = 516
(d) h = 2, n1 = 276 (e) h = 2, n1 = 396 (f) h = 2, n1 = 516
(g) h = 3, n1 = 276 (h) h = 3, n1 = 396 (i) h = 3, n1 = 516
(j) h = 4, n1 = 276 (k) h = 4, n1 = 396 (l) h = 4, n1 = 516
Figure 3: Cumulative Difference Square Forecast Error for monthly data with all forecast horizon,and n1 = 276, 396, 516.
48
A Appendix: Mathematical details
A.1 Asymptotic risk of the Stein-like combined estimator
For proof of equation (23), we use Lemmas 2 - 4. For simplicity, define A ≡ G′V 1/2, Ψ = G′2V1/2,
and let the non-centrality parameter of chi-square distribution based on Lemma 4 to be defined as
µ ≡ θ′Mθ2 . For clarity, we focus on the second and third term in equation (23) one by one. The
second term can be simplified as:
tr[WAE
((Z ′MZ)−2ZZ ′
)A′]
= E[χ2k+2(µ)
]−2tr(WAMA′
)+ E
[χ2k(µ)
]−2tr(WA(I2K −M)A′
)+ E
[χ2k+4(µ)
]−2tr(WAMθθ′MA′
)+ E
[χ2k(µ)
]−2tr(WA(I2K −M)θθ′(I2K −M)A′
)+ E
[χ2k+2(µ)
]−2tr(WA(θθ′M +Mθθ′ − 2Mθθ′M)A′
)= E
[χ2k+2(µ)
]−2tr(WAA′
)+ 0 + E
[χ2k+4(µ)
]−2tr(WAθθ′A′
)+ 0 + 0
=
[1
4e−µ
Γ(k2 − 1)
Γ(k2 + 1)1F1
(k2− 1;
k
2+ 1;µ
)]tr(WAA′
)
+
[1
4e−µ
Γ(k2 )
Γ(k2 + 2)1F1
(k2
;k
2+ 2;µ
)](θ′A′WAθ
)
=
[θ′A′WAθ
(k − 2)θ′Mθ
][e−µ 1F1
(k2− 1;
k
2
)]
−
[θ′A′WAθ
(k − 2)θ′Mθ−
tr(WAA′
)k(k − 2)
][e−µ 1F1
(k2− 1;
k
2+ 1)],
(a.1)
where the last equality derives by using Lemma 3 several times, AM = A, MA′ = A′, A(I2K −
M)A′ = 0, A
(I2K−M
)θθ′(I2K−M
)A′ = 0, and A
(θθ′M+Mθθ′−2Mθθ′M
)A′ = 0. Remember
that M is an idempotent matrix. The third term in equation (23) can be simplified as:
49
tr[WΨE
((Z ′MZ)−1ZZ ′
)A′]
= E[χ2k+2(µ)
]−1tr(WΨMA′
)+ E
[χ2k(µ)
]−1tr(WΨ(I2K −M)A′
)+ E
[χ2k+4(µ)
]−1tr(WΨMθθ′MA′
)+ E
[χ2k(µ)
]−1tr(WΨ(I2K −M)θθ′(I2K −M)A′
)+ E
[χ2k+2(µ)
]−1tr(WΨ(θθ′M +Mθθ′ − 2Mθθ′M)A′
)= E
[χ2k+2(µ)
]−1tr(WAA′
)+ 0 + E
[χ2k+4(µ)
]−1tr(WAθθ′A′
)+ 0− E
[χ2k+2(µ)
]−1tr(WAθθ′A′
)=
[1
2e−µ
Γ(k2 )
Γ(k2 + 1)1F1
(k2
;k
2+ 1;µ
)]tr(WAA′ −WAθθ′A′
)
+
[1
2e−µ
Γ(k2 + 1)
Γ(k2 + 2)1F1
(k2
+ 1;k
2+ 2;µ
)]tr(WAθθ′A′
)
= 2 e−µ
[2(θ′A′WAθ
)(θ′Mθ
)(k − 2
) − tr(WAA′)
k − 2
]
×
[1F1(
k
2− 1;
k
2;µ)− 1F1
(k2− 1;
k
2+ 1;µ
)],
(a.2)
where the last equality derives by using Lemma 3 several times, ΨM = A, Ψθ = 0, Ψ(I2K−M
)A′ =
0, Ψ(I2K −M
)θθ′(I2K −M
)A′ = 0, and AA′ = V(2) − VF . Finally, plugging equations (a.1) and
(a.2) into equation (23) produces equation (15).
A.2 Proof of equation (17)
We want to show that tr(W(V(2) − VF
))− 2 θ′V 1/2GWG′V 1/2θ
θ′Mθ > 0.
It is clear that: tr(W(V(2) − VF
))= tr
(W(V(2) − VF
)Ik
)≤ tr(Ik) λmax
(W(V(2) − VF
)).Thus,
tr(W(V(2) − VF
))> Sup 2 θ′V 1/2GWG′V 1/2θ
θ′Mθ ⇒
k λmax
(W(V(2) − VF
))> tr
(W(V(2) − VF
))> 2 λmin
(W(V(2) − VF
))⇒
k >2 λmin
(W(V(2)−VF
))λmax
(W(V(2)−VF
)) ⇒ k > 2.
50
Proof of equation (18):
We want to show that θ′V 1/2GWG′V 1/2θθ′Mθ − tr
(W(V(2)−VF )
)k > 0.
We have: Inf k θ′V 1/2GWG′V 1/2θθ′Mθ > tr
(W(V(2) − VF )
), therefore k >
tr(W(V(2)−VF )
)λmin
(W(V(2)−VF )
) .
A.3 Risk of the semi-parametric estimator
R(γ) = E[L(γ)
]= E
[(µ(γ)− µ
)′(µ(γ)− µ
)]= E
[(Ω−1/22
(A+P (γ)Y2
)− µ
)′(Ω−1/22
(A+P (γ)Y2
)− µ
)]
= E
[(Ω−1/22 A+Ω
−1/22 P (γ) Ω
1/22 µ+ Ω
−1/22 p(γ) σ(2)ε2 − µ
)′(
Ω−1/22 A+Ω
−1/22 P (γ) Ω
1/22 µ+ Ω
−1/22 p(γ) σ(2)ε2 − µ
)]
= E
[(Ω−1/22 A+P (γ)ε2 +
(P (γ)µ− µ
))′(Ω−1/22 A+P (γ)ε2 +
(P (γ)µ− µ
))]
= E
[A′Ω−1
2 A+ε′2P2(γ)ε2 +
(P (γ) µ− µ
)′(P (γ) µ− µ
)+ 2A′Ω−1/2
2
(P (γ) µ− µ
)]= E
[A′Ω−1
2 A]
+ tr(P 2(γ)
)+(P (γ) µ− µ
)′(P (γ) µ− µ
)= Υ + γ tr
(P (γ)− P 2(γ)
)+ tr
(P 2(γ)
)+(P (γ)µ− µ
)′(P (γ)µ− µ
)= Υ + γ tr
(P (γ)
)+ (1− γ)
(P 2(γ)
)+
(p(γ)
(µ+ Ω
−1/22 σ(2)ε2 − Ω
−1/22 σ(2)ε2
)− µ
)′(p(γ)
(µ+ Ω
−1/22 σ(2)ε2 − Ω
−1/22 σ(2)ε2
)− µ
)= Υ + γ tr
(P (γ)
)+ (1− γ)
(P 2(γ)
)(Ω−1/22 P (γ) Y2 + Ω
−1/22 A−Ω
−1/22 A−P (γ)ε2 − µ
)′(
Ω−1/22 P (γ) Y2 + Ω
−1/22 A−Ω
−1/22 A−P (γ)ε2 − µ
)= Υ + γ tr
(P (γ)
)+ (1− γ)
(P 2(γ)
)+(µ(γ)− µ− Ω
−1/22 A−P (γ) ε2
)′(µ(γ)− µ− Ω
−1/22 A−P (γ) ε2
)= L(γ) + A′Ω−1
2 A+ε′2P2(γ)ε2 − 2
(µ(γ)− µ
)′(Ω−1/22 A+P (γ)ε2
)+ Υ + γ tr
(P (γ)
)+ (1− γ)
(P 2(γ)
)= L(γ) + A′Ω−1
2 A+ε′2P2(γ)ε2 − 2A′Ω−1
2 A−2(P (γ)µ− µ
)′(Ω−1/2 A+P (γ)ε2
)− 2 ε′2P
2(γ)ε2
51
− 4A′Ω−1/22 P (γ)ε2 + Υ + γ tr
(P (γ)
)+ (1− γ)
(P 2(γ)
)= L(γ)− C − ε′2P 2(γ)ε2 − 2
(P (γ)µ− µ
)′(Ω−1/22 A+P (γ)ε2
)+ γ tr
(P (γ)
)+ (1− γ)
(P 2(γ)
)− 4A′Ω−1/2
2 P (γ)ε2, (a.3)
where E[A′Ω−1
2 A]
= Υ+γ tr(P (γ)−P 2(γ)
), A′Ω−1
2 A = Υ+C, Υ =(γ2
q4
)β′(1)X
′1X1
(γq2X ′1X1 +
X ′2X2
)−1X ′2Ω−1
2 X2
(γq2X ′1X1+X ′2X2
)−1X ′1X1β(1), and C = γ2 σ2
(2)ε′1X1
(γq2X ′1X1+X ′2X2
)−1X ′2Ω−1
2 X2(γq2X ′1X1 +X ′2X2
)−1X ′1ε1.
A.4 Details for the optimality of the cross validated weight
In order to prove the conditions in equations (44)-(54), define λmax(B) to be the largest eigenvalue
of matrix B. Thus:
Supγ∈H
λmax
(p(γ)
)= Sup
γ∈Hλmax
(X2
( γq2X ′1X1 +X ′2X2
)−1X ′2
)≤ λmax
(X2
(X ′2X2
)−1X ′2
)= 1. (a.4)
Also, assumeX′2Ω−1
2 X2
T (1−b) → Q. Besides, it can be shown that R(γ) = Op(T ), µ′µ = O(T ), and
X ′2Ω−1/22 ε2 = Op(
√T ).
Proof of equation (44):
ε′2Ω−1/22 A =
γ
q2ε′2Ω
−1/22 X2
( γq2X ′1X1 +X ′2X2
)−1X ′1Y1
≤ γ
q2ε′2Ω
−1/22 X2
( γq2X ′1X1
)−1X ′1Y1
= ε′2Ω−1/22 X2
(X ′1X1
)−1X ′1Y1
= ε′2Ω−1/22 X2 β(1)
= Op(√T ). Op(1) = Op(
√T ).
(a.5)
Proof of equation (45):(µ′P (γ)ε2
)2≤ µ′µ ε′2P 2(γ)ε2 ≤ µ′µ λmax
(P (γ)
)ε′2P (γ)ε2 = Op(T ), (a.6)
where the first inequality derived based on the Caushy-Schwarz inequality, (X ′Y )2 ≤ (X ′X)(Y ′Y ),
and the second inequality is because for any matrix X and A, we have: X ′AX ≤ X ′X λmax(A).
Therefore, X ′A2X ≤ (X ′AX) λmax(A). Thus µ′P (γ)ε2 = Op(√T ).
Proof of equation (46):
Supγ∈H
ε′2P (γ)ε2 ≤ ε′2(X2(X ′2X2)−1X2
)ε2 = Op(1). (a.7)
52
Proof of equation (47):
Supγ∈H
tr(P (γ)
)≤ tr
(X ′2(X ′2X2)−1X2
)= k. (a.8)
Proof of equation(48):
C = γ2 σ2(2)ε′1X1
( γq2X ′1X1 +X ′2X2
)−1X ′2Ω−1
2 X2
( γq2X ′1X1 +X ′2X2
)−1X ′1ε1
≤ γ2 σ2(2)ε′1X1
( γq2X ′1X1
)−1X ′2Ω−1
2 X2
( γq2X ′1X1
)−1X ′1ε1
= q2ε′1σ(1)X1(X ′1X1)−1X ′2Ω−12 X2(X ′1X1)−1X ′1σ1ε1
= q2(β(1) − β(1)
)′X ′2Ω−1
2 X2
(β(1) − β(1)
)= Op(1).
(a.9)
Proof of equation (49):
Supγ∈H
ε′2P2(γ)ε2 ≤ λmax
(P (γ)
)ε′2P (γ)ε2 ≤ ε′2X2(X ′2X2)−1X ′2ε2 = Op(1). (a.10)
Proof of equation (50):
A′Ω−1/22 P (γ)ε2 =
γ
q2Y ′1X1(
γ
q2X ′1X1 +X ′2X2)−1 X ′2Ω−1
2 X2(γ
q2X ′1X1 +X ′2X2)−1X ′2ε2
≤ γ
q2Y ′1X1(X ′2X2)−1 X ′2Ω−1
2 X2
( γq2X ′1X1
)−1X ′2ε2
=1
σ(2)Y ′1X1(X ′1X1)−1X ′2ε2
= Y ′1X1(X ′1X1)−1X ′2Ω−12 ε2
= β1 X′2Ω−1
2 ε2
= Op(1) . Op(√T ) = Op(
√T ).
(a.11)
Proof of equation (51):
Using the Caushy-Schwarz inequality we have:[(P (γ)µ− µ
)′Ω−1/22 A
]2
≤(P (γ)µ− µ
)′(P (γ)µ− µ
)A′Ω−1
2 A
≤ R(γ)A′Ω−12 A
= R(γ)γ2
q4Y ′1X1
( γq2X ′1X1 +X ′2X2
)−1X ′2Ω−1
2 X2
( γq2X ′1X1 +X ′2X2
)−1X ′1Y1
≤ R(γ)γ2
q4Y ′1X1
(X ′2X2
)−1X ′2Ω−1
2 X2
(X ′2X2
)−1X ′1Y1
= R(γ)γ2
σ2(1)
Y ′1X1
(X ′2X2
)−1X ′1Y1
= R(γ)γ2
σ(1)2 Y ′1X1
(X ′2X2
)−1X ′1Y1
53
= Op(T ) . Op(1
T 2) . Op(T ). Op(
1
T). Op(T ) = Op(1), (a.12)
where Y ′1X1(X ′2X2)−1X ′1Y1 = Op(T ). Theorem 1 in Li et al. (2013) proves that the CV parameter
converge to zero at a fast rate of 1T . Because we keep γ2 in the above simplification, we need to plug
the estimate of the weight which we derive from CV, γ = Op(1T ). Therefore,
(P (γ)µ−µ
)′Ω−1/22 A =
Op(1).
Proof of equation (52):[(P (γ)µ− µ
)′P (γ)ε2
]2
≤(P (γ)µ− µ
)′(P (γ)µ− µ
)ε′2 P
2(γ) ε2
≤ R(γ) ε′2 P2(γ) ε2
≤ R(γ) λmax
(P (γ)
)ε′2P (γ)ε2
≤ R(γ) ε′2P (γ)ε2 = Op(T ) . Op(1) = Op(T ).
(a.13)
So,(P (γ)µ− µ
)′P (γ)ε2 = Op(
√T ).
Proof of equation (53):
Supγ∈H
tr(P (γ)
)≤ tr
(X ′2(X ′2X2)−1X2
)= k = Op(1). (a.14)
Proof of equation (54):
tr(P 2(γ)
)= tr
[X2
( γq2X ′1X1 +X ′2X2
)−1X ′2X2
( γq2X ′1X1 +X ′2X2
)−1X ′2
]
≤ tr
[X2(X ′2X2)−1X ′2X2 (X ′2X2)−1X ′2
]= k = Op(1).
(a.15)
B Appendix: Additional empirical results
B.1 Empirical results based on monthly data
Monthly data starts from January 1927 to December 2018 which gives time series dimension of
T = 1104. Table 13 reports the results for MSFE which are recursively generated using different
initial in-sample estimation window sizes 276, 396, 516 such that the beginning of the various
forecast evaluation periods runs from Jan-1950 to Jan-1970. The results for statistical significance
level of predictive accuracy is reported in Table 14. Based on the results, MSFE of the Stein-like
combined estimator is lower than all other estimators in the table with significant values for R2
at 5%. As mentioned by Campbell and Thompson (2008), for monthly data an R2 value as small
as 0.5% reflects strong out-performance. The out-performance of the Stein-like estimator over the
PPP estimator is not significant only for h = 1 and h = 4. Besides, we can see the out-performance
54
of the Stein-like estimator over the historical average for all forecast horizons and for any choices
of the initial in-sample estimation period. Figure 3 displays the results for CDSFE for the different
benchmark models with monthly data.
Table 13: Out-of-sample forecasting performance with monthly data
h n2 MSFEα MSFEκ MSFEγ MSFEBP MSFEPPP MSFEHA
1 1950:01-2018:12 0.0019 0.0019 0.0021 0.0020 0.0329 0.0031
1960:01-2018:12 0.0020 0.0020 0.0022 0.0021 0.0382 0.0034
1970:01-2018:12 0.0022 0.0022 0.0023 0.0022 0.0458 0.0034
2 1950:01-2018:12 0.0019 0.0020 0.0022 0.0021 0.0027 0.0031
1960:01-2018:12 0.0020 0.0021 0.0023 0.0021 0.0030 0.0034
1970:01-2018:12 0.0022 0.0022 0.0024 0.0023 0.0033 0.0034
3 1950:01-2018:12 0.0019 0.0020 0.0022 0.0021 0.0021 0.0031
1960:01-2018:12 0.0020 0.0021 0.0023 0.0022 0.0023 0.0034
1970:01-2018:12 0.0022 0.0022 0.0025 0.0023 0.0025 0.0034
4 1950:01-2018:12 0.0019 0.0021 0.0022 0.0021 0.0029 0.0031
1960:01-2018:12 0.0021 0.0022 0.0023 0.0023 0.0032 0.0034
1970:01-2018:12 0.0022 0.0023 0.0024 0.0023 0.0036 0.0034
Note: See the footnote for Table 11.
55
Table 14: Statistical significance of improved predictive accuracy with monthly data
h n2 R2κ R2
γ R2BP R2
PPP R2HA
1 1950:01-2018:12 0.0000 0.0952 0.0500 0.9422 0.3871(0.3434) (0.0021) (0.0087) (0.1531) (0.0000)
1960:01-2018:12 0.0000 0.0909 0.0476 0.9476 0.4118(0.4209) (0.0065) (0.0429) (0.1531) (0.0000)
1970:01-2018:12 0.0000 0.0435 0.0000 0.9520 0.3529(0.7506) (0.0544) (0.2134) (0.1531) (0.0000)
2 1950:01-2018:12 0.0500 0.1364 0.0952 0.2963 0.3871(0.0482) (0.0000) (0.0007) (0.0463) (0.0000)
1960:01-2018:12 0.0476 0.1304 0.0476 0.3333 0.4118(0.1043) (0.0000) (0.0068) (0.0470) (0.0000)
1970:01-2018:12 0.0000 0.0833 0.0435 0.3333 0.3529(0.7360) (0.0003) (0.1432) (0.0472) (0.0000)
3 1950:01-2018:12 0.0500 0.1364 0.0952 0.0952 0.3871(0.0311) (0.0000) (0.0006) (0.0150) (0.0000)
1960:01-2018:12 0.0476 0.1304 0.0909 0.1304 0.4118(0.0464) (0.0000) (0.0019) (0.0170) (0.0000)
1970:01-2018:12 0.0000 0.1200 0.0435 0.1200 0.3529(0.5624) (0.0000) (0.0543) (0.0181) (0.0000)
4 1950:01-2018:12 0.0952 0.1364 0.0952 0.3448 0.3871(0.0017) (0.0000) (0.0001) (0.1356) (0.0000)
1960:01-2018:12 0.0455 0.0870 0.0870 0.3438 0.3824(0.0034) (0.0000) (0.0004) (0.1373) (0.0000)
1970:01-2018:12 0.0435 0.0833 0.0435 0.3889 0.3529(0.2647) (0.0002) (0.0176) (0.1399) (0.0000)
Note: See the footnote for Table 12
B.2 Empirical results based on annual data
Annual data starts from 1927 to the end of 2018 which gives time series dimension of T = 92.
We report the results under different in-sample estimation period which correspond to early 1970’s
(n1 = 43) through early 1990’s (n1 = 63). Like what we did for the quarterly and monthly
data, Table 15 reports the MSFE under different forecast horizons, h. We also report the forecast
equivalency test of MSFE by R2 and to show whether the improvement is significant or not, we
report the p-value based on Diebold and Mariano (1995) and West (1996) statistic in bracket in
Table 16. The results are similar to the one that we get with quarterly and monthly data. Stein-like
combined estimator has a lower MSFE compare to other estimators for all h except the historical
average; but the under-performance of the Stein-like combined estimator over the historical average
is not significant.
56
Table 15: Out-of-sample forecasting performance with annual data
h n2 MSFEα MSFEκ MSFEγ MSFEBP MSFEPPP MSFEHA
1 1970-2018 0.0300 0.0381 0.0323 0.0323 0.0337 0.0281
1980-2018 0.0265 0.0278 0.0288 0.0288 0.0291 0.0235
1990-2018 0.0277 0.0279 0.0279 0.0279 0.028 0.0278
2 1970-2018 0.0360 0.0516 0.0520 0.0520 0.0356 0.0268
1980-2018 0.0313 0.0506 0.0508 0.0507 0.0259 0.0240
1990-2018 0.0282 0.0290 0.0289 0.0288 0.0285 0.0279
3 1970-2018 0.0288 0.0368 0.0364 0.0363 0.0315 0.0231
1980-2018 0.0270 0.0365 0.0360 0.0359 0.0260 0.0234
1990-2018 0.0268 0.0270 0.0278 0.0278 0.0271 0.0279
4 1970-2018 0.0320 0.0412 0.0407 0.0407 0.3013 0.0226
1980-2018 0.0277 0.0351 0.0356 0.0356 0.3378 0.0240
1990-2018 0.0275 0.0276 0.0284 0.0284 0.0276 0.0290
Note: See the footnote for Table 11.
57
Table 16: Statistical significance of improved predictive accuracy with annual data
h n2 R2κ R2
γ R2BP R2
PPP R2HA
1 1970-2018 0.2126 0.0712 0.0712 0.1098 -0.0676(0.1051) (0.2105) (0.2093) (0.1289) (0.6752)
1980-2018 0.0468 0.0799 0.0799 0.0893 -0.1277(0.3129) (0.2162) (0.2189) (0.1810) (0.8534)
1990-2018 0.0072 0.0072 0.0072 0.0107 0.0036(0.4096) (0.3589) (0.3744) (0.2276) (0.4698)
2 1970-2018 0.3023 0.3077 0.3077 -0.0112 -0.3433(0.1354) (0.1293) (0.1304) (0.5332) (0.9643)
1980-2018 0.3814 0.3839 0.3826 -0.2085 -0.3042(0.1242) (0.1222) (0.1232) (0.8772) (0.9437)
1990-2018 0.0276 0.0242 0.0208 0.0105 -0.0108(0.1445) (0.2247) (0.2411) (0.2345) (0.6067)
3 1970-2018 0.2174 0.2088 0.2066 0.0857 -0.2468(0.1074) (0.1184) (0.1215) (0.1845) (0.9468)
1980-2018 0.2603 0.2500 0.2479 -0.0385 -0.1538(0.1050) (0.1148) (0.1178) (0.8068) (0.9334)
1990-2018 0.0074 0.0360 0.0360 0.0111 0.0394(0.2958) (0.0132) (0.0157) (0.1414) (0.1843)
4 1970-2018 0.2233 0.2138 0.2138 0.8938 -0.4159(0.0152) (0.0317) (0.0311) (0.1475) (0.9778)
1980-2018 0.2108 0.2219 0.2219 0.9180 -0.1542(0.0664) (0.0738) (0.0738) (0.1558) (0.8865)
1990-2018 0.0036 0.0317 0.0317 0.0036 0.0517(0.1616) (0.2073) (0.2079) (0.3950) (0.1253)
Note: See the footnote for Table 12.
58
top related