selection of the optimal length of rolling window in time

Selection of the Optimal Length of Rolling Window inTime-varying Predictive Regression

Yongmiao Hong1, Yuying Sun2,3, Shouyang Wang2,3

1 Department of Economics and Department of Statistical Sciences, Cornell University

2 Academy of Mathematics and Systems Science, Chinese Academy of Sciences

3 Center for Forecasting Science, Chinese Academy of Sciences

Yongmiao Hong1 , Yuying Sun2,3 , Shouyang Wang2,3(AMSS)Selection of the Optimal Rolling Window Length in Time-varying Predictive Regression

June 10, 2017

Overview

1 Motivation

2 Literature Review

3 Framework and Approach

4 Selection of Rolling Window Length

5 Simulation

6 Empirical Applications

7 Conclusion


Motivation

Out-of-sample forecast is important in economics and finance, e.g.,Diebold (1998), Stock and Watson (2003, 2007), Canova (2007), D ′Agostinoand Surico (2009, 2012), Rossi and Inoue (2012), Gonzalez-Rivera (2013).

Why is out-of-sample forecast popular?

Reduce the probability of model overfitting (e.g., Ashley et al. (1980)).In-sample predictive ability , out-of-sample forecasting ability (e.g.,Meese and Rogoff (1983a), Swanson and White (1995)).Forecasting performance by modelling in one period seems to beunrelated to forecasting performance in another period (e.g., Stock andWatson (2003, 2007)).Possibly due to structural changes.


Motivation (Cont’d)

For out-of-sample forecasts, model parameters are usually estimatedusing either rolling or recursive schemes.

Consider a predictive linear regression model

Yt+h = α+ βXt + εt+h, t = 1, · · · , T.

The predictor for h-step ahead is given by

Yt+h = α+ βXt,

where parameter estimators α, β are based on eitherRolling method (fixed sample of size L){t = T − (L − h), · · · , T}→ YT+h.Recursive method (expanding sample): starts at the beginning andadditional observations become available.



Why use rolling estimation?Using updated information⇒ This is implicitly based on the assumption thatunderlying model parameter β = βt is time-varying.Example: U.S. GDP Growth Rate: a time varying AR(1) Model.Quarterly data (1947Q1 to 2014Q4), about 272 number of observations.Predictive model Yt+1 = α0 + α1Yt + εt+1, where Yt = 400ln(Qt/Qt−1), Qt isthe quarterly US GDP in level.

Fig: Estimation of α1 in Different Times

Question: When we forecast the U.S. GDP growth rate in 2015, how manyobservations should we use? Should we use the last 20 years of data or thelast 5 years of data?



For rolling estimation, different window lengths lead to differentforecast performances.

Why does this happen?

Reducing the sample causes the increasing variance and may alsoresult in large mean square forecast errors (e.g., Pesaran andTimmermann (2002)).

Using the earliest data irrelevant to the present data-generationprocess may improve the precision of the forecasts but at the cost ofhigher bias (e.g., Clark and McCracken (2009)).

The optimal window length is a trade-off to balance the bias andvariance.



However, practioners choose different window lengths without atheoretical guidance, e.g., empirical experience, determined byforecasters in an ad hoc manner.

Example 1: Exchange rates forecasting

Monthly data

Meese and Rogoff (1983a), set L=93;Qi and Wu (2003), set L=216;Molodtsova and Papell (2009), set L=120.

Quarterly data

Chinn (1991), set L=45;Gourinchas and Rey (2007), set L=104;Clark and West (2007), set L=67.



Example 2: U.S. GDP growth forecastingMonthly data

Stock and Watson (2012), set L=100;Kim and Swaton (2014), set L=144.

Quarterly dataBanerjee (2006) presented the best quarterly indicator in out-of-sampleforecasting is different from 13 evaluation periods;Clark and McCracken (2009), set L=40.

There is no theoretical guidance to choose the optimal window length inrolling estimation.

Concern: whether the satisfactory results reported were obtained simply bychance? Or by data snooping?



We will develop a method to choose the optimal rolling window length in atime-varying predictive regression.

Parameters, are specified as a smooth function of time with unknown form,which may be consistent with the evidence of instability in finance andeconomics.

Local linear smoothing with data reflection at boundary is used to estimatetime-varying parameters.

Three optimal rolling windows are derived by minimizing different forecastcriteria, including the unconditional, conditional and global mean squareforecast errors (MSFE).

A feasible “cross-validation” is proposed and justfied to be asymptoticallyequivalent to the method based on unconditional MSFE.

The optimal kernel is shown to be the Epanechinikov kernel and so theoptimal window selection based on ordinary least square (OLS) estimationis not fully efficient. The later gives equal weighting to each observation.


Literature Review

Parametric estimation (e.g., Chow (1984), Engle and Waston (1987),Harvey (1989)) relies on the setting of the model.

Designed for discrete breaks. (e.g., Pesaran and Timmermann(2000), Giraitis, Kapetanios and Price (2013)).

(3) Strong assumptions imposed

Strictly stationary regressors (Pesaran and Timmermann (2000,2007)). It rules out lagged dependent variables.Independent errors and exogenous regressors (Pesaran, Pick andPranovich (2011)).Without regressors (Giraitis, Kapetanios and Price (2013)).


Literature Review (Cont’d)

Inoue et al. (2017) proposed how to determine the optimal window lengthin a time-varying predictive regression model:

Yt+h = β(tT) ′Xt + ut+h.

Inoue et al. (2017) considered a rolling OLS estimator:

βL(1) =

(T−h∑

t=T−L+1

XtX′t

)−1( T−h∑t=T−L+1

XtYt+h

), (2.1)

and derived an optimal window length L by minimizing the conditionalMSFE, E((YT+h − YT+h)

2|IT).

The optimal window length is of order T23 and MSFE is O(T− 2

3 ).


Literature Review (Cont’d)

Drawbacks

The rolling OLS estimator has a slow bias (O(lT)), where lt = Lt/Tcausing a slow convergence rate for MSFE.

The equal weighting for every observation in the optimal window maynot be optimal. Recent information may have a larger impact todaythan the remote past information (Engle (1982)).


Framework and Approach

Consider a smooth time-varying predictive regression modelYt+h = X′tβ(

tT) + εt+h. (3.1)

(1) Yt+h is a dependent variable; Xt is a d × 1 vector of locally stationaryexplanatory variables.(2) β : [0, 1]→ Rd is an unknown smooth function of normalized time t

Tinstead of time t. It guarantees that the amount of local informationincreases as the sample size T →∞.(3) E(εt+h|It) = 0, where It = {Xs, Ys}

ts=1.

(4) h is the forecast horizon.


Framework and Approach (Cont’d)

Our basic idea: Estimation of β(·) is similar to smoothed nonparametriccurve estimation.

By assuming that β(·) has a continuous second derivative, for time s “near”to time t, β(s/T) can be approximated by a linear function at any fixed timepoint t

T ∈ [0, 1] as follows:

βs = β( s

T

)' β

( tT

)+ β(1)

( tT

) s − tT

,

where ' denotes the first-order Taylor approximation, β(1)(

tT

)= β′

(tT

).

Hence, for all times s/T “near” the fixed point t/T,

Ys+h = X′sβ( s

T

)+ εs+h ' X′sβ

( tT

)+

s − tT

X′sβ(1)( t

T

)+ εs+h

= Z′stθt + εs+h,

where Zst =

(Xs

( s−tT )Xs

)and θt = θ

(tT

)=

(β(

tT

)β(1)

(tT

) ).



Boundary Bias Problem Due to Out-of-Sample Forecast

We will estimate θt by minimizing a “local” weighted sum of squared forecasterrors.

One-sided estimation in out-of-sample forecast: To estimate βt and forecastYt+h, we may only use past observations up to time t.

One-sided kernel is popular in nonparametric analysis, especially in theestimation of end and jump points, e.g, Zhang and Karunamuni (1998),Dennis (2007). However, it results in a slow bias (order O(lt)).

Alternatively, it is well-known that local linear smoothing can be used toobtain a higher order bias O(l2t ).

However, the scale factors for the bias and variance at the boundary regiondiffer from those of an interior point. e.g., Cai (2007), Chen and Hong (2012).



Data Reflection Method

We follow Chen and Hong (2012) to use “reflection about theboundaries” proposed by Cline and Hart (1991) to deal with boundaryproblem in nonparametric estimation.

More specifically, no symmetric data in the boundary regions offorecast time point t: [t − Lt, t].

Goal: to make the behavior of the local linear estimator at boundarypoints similar to that at interior points so as to improve finite sampleperformance.

Solution: reflection method. (Hall and Wehrly, 1991).



What is Reflection Method?

Step 1: Obtain augmented data in the boundary regions:

Ys = Y2t−s, Xs = X2t−s, t + 1 6 s 6 t + [Lt].

Step 2: Use the union of the original data and the augmented data toestimate βt, where 1 6 t 6 T.

Reflection is equivalent to using a boundary kernel[k(

t+sTlt

)+ k

(t−sTlt

)]/2 in the boundary regions.



With data reflection, our locally weighted sum of squares is given by

t+[Lt]∑s=t−[Lt]

(Ys+h − X ′sβ

(0)( t

T

)− X ′sβ

(1)( t

T

) s − tT

)2

kst

=

t+[Lt]∑s=t−[Lt]

(Ys+h − Z ′stθt

)2 kst,

where θt =(β( t

T

)′ ,β(1)( t

T

))′is a 2d × 1 vector, kst = k

(s−tLt

)and

Lt is the window length and Lt →∞ as T →∞.



Examples of kernelUniform kernel

k(u) =12

1(|u| 6 1),

Epanechniov kernel

k(u) =34(1 − u2)1(|u| 6 1),

Quartic kernel

k(u) =1516

(1 − u2)21(|u| 6 1).

By solving the optimization problem, we obtain the estimator

θt =

t+[Lt]∑s=t−[Lt]

kstZstZ ′st

−1 t+[Lt]∑s=t−[Lt]

kstZstYs+h.



Put e1 = (1, 0)′. Then the local linear estimator for βt is given by

βt = (e ′1 ⊗ Id) θt =

t+[Lt]∑s=t−[Lt]

kstXsX ′s

−1t+[Lt]∑

s=t−[Lt]

kstXsYs+h.

This is a rolling WLS estimator based on the argumented data.

Our local linear estimator is more general than the OLS estimator and theapproach in Inoue et al. (2014).

The asymptotic bias is proportional to l2t∫1−1 k(u)u2du instead of O(lt).

General kernel is used instead of uniform kernel.


Selection of Rolling Window Length

Question: How to choose an optimal window length Lt?

Assumptions

Assumption 1: (i) {Xt, εt} is a locally stationary α−mixing process withmixing coefficients {α(j)} satisfying Σ∞j=1j2α(j)

δ1+δ < C for some

0 < δ < 1, (ii) Σ∞j>1α(j) < C.

Remark: We allow Xt to contain both exogenous and laggeddependent variables. Weaker than assumption in Robinson (1989),Cai (2000, 2005, 2007), Cheng et al. (2015).

Assumption 2: (i) {εt} is a sequence such that E(εt+h|It) = 0,E(ε2

t ) = σ2t = σ2( t

T ) 6 C, where It−1 = {X ′t , X′t−1, . . . , εt−1, εt−2, . . . }.(ii) E(ε2

t+h|It) = σ2It,t = σ

2( tT |It), {σ2

It,t}Tt=1 satisfies 0 < C−1σ2

It,t < 1uniformly in t for some constant C such that 0 < C <∞.

Remark: When h = 1, {εt+h} is a martingale difference sequence.


Selection of Rolling Window Length (Cont’d)

Assumptions

Assumption 3: (i) The d × d matrix Mt = E(XtX ′t ) is finite and positivedefinite; (ii) E(X8

ti) < C <∞ for i = 1, · · · , d and for all t and someconstant C.

Assumption 4: The kernel function k : [−1, 1]→ R+ is a symmetricbounded probability density function. (Chen, Hong (2012))

Assumption 5: For each t, βt = β( t

T

), where β(·) is a twice

continuously differentiable function and β(2)(·) (the second derivativeof β(·)) is continuous in [0, 1].



To choose the optimal window length Lt by minimizing a suitablemean square forecast errors (MSFE), used as forecast criterion.

βt − βt = (

t+[Tlt]∑s=1,s,t

kstXsX ′s )−

t+[Tlt]∑s=1,s,t

kstXsYs+h − βt

= (

t+[Tlt]∑s=1,s,t

kstXsX ′s )−

t+[Tlt]∑s=1,s,t

kstXsX ′s (βs − βt) + (

t+[Tlt]∑s=1,s,t

kstXsX ′s )−

t+[Tlt]∑s=1,s,t

kstXsεs+h

= At + Bt.

The first term determines the bias and the second term determines thevariance.



Criterion 1: Unconditional MSFE

Our interest: which factors determine the optimal length Lt onaverage, i.e., across all possible realizations of Xt at time t, e.g.,Pesaran and Timmermann (2007).

Unconditional mean squared forecast errors (UMSFE):

UMSFE = E(Yt+h − X ′t βt)2

= σ2t+h + E(βt − βt)

′XtX ′t (βt − βt) − 2E(E(εt+h|It)X ′t (βt − βt))

= σ2t+h + E(βt − βt)

′XtX ′t (βt − βt).



Theorem 1. Under Assumptions 1(i), 2(i) and 3-5, the optimal lengthto minimize unconditional MSFE is

Loptt,UMSFE =

{dσ2

t∫1−1 k(u)2du

β(2) ′t Mtβ

(2)t (∫1−1 k(u)u2du)2

} 15

T45 ,

where d = dim(Xt) and Mt = EXtX ′t .

When Xt is strictly stationary,

Loptt,UMSFE =

{dσ2

t∫1−1 k(u)2du

β(2) ′t M−1β

(2)t (∫1−1 k(u)u2du)2

} 15

T45 ,

where M = EXtX ′t .



Criterion II: Conditional MSFE

Our interest: most forecasts at time t are conditional on the dataavailable at time t, e.g., Pesaran and Timmermann (2007), Inoue etal. (2014).

Conditional mean squared forecast errors (CMSFE):

CMSFE = E((Yt+h − X ′t βt)2|It)

= E(ε2t+h|It) + 2E((βt − βt)

′Xtεt+h|It) + (βt − βt)′XtX ′t (βt − βt),

where It = {X1, Y1, X2, Y2, · · · , Xt, Yt}.



Theorem 2. Under Assumptions 1-5, the optimal length to minimizeconditional MSFE is

Loptt,CMSFE =

{dσ2

It,t∫1−1 k(u)2du

β(2) ′t XtX ′t β

(2)t (∫1−1 k(u)u2du)2

} 15

T45 ,

where d = dim(Xt) and Mt = EXtX′t .

When Xt is strictly stationary,

Loptt,CMSFE =

{σ2

It,tEXtM−1Xt∫1−1 k(u)2du

β(2) ′t XtX ′t β

(2)t (∫1−1 k(u)u2du)2

} 15

T45 ,

where M = EXtX′t .



Comparison with Inoue et al. (2017):

Our window length selection Lt ∝ T45 , while Inoue et al. (2017)

proposed Lt ∝ T23 .

Our conditional MSFE is O(T− 45 ), faster than the convergence rate of

O(T− 23 ) of Inoue et al. (2017).

We have a general nonuniform weighting k( s−tT ), while Inoue et al.

(2017) used a uniform weighting. Local WLS is better than local OLS.

We derive the optimal kernel to be the Epanchnikov kernelk(u) = 3

4a(1 − u2

a2 )+ for any a > 0, where x+ is the positive part of x,taking value x when x > 0 and 0 otherwise.



Criterion III: Global MSFE

Our interest: the optimal window length affected the whole sample.One reason is that the coefficient function βt may not have a niceshape and depends on the curvature at all time points in period.

Global mean squared forecast errors (GMSFE):

GMSFE = ΣT−ht=1 (Yt+h − X ′t βt)

2

=

T−h∑t=1

ε2t+h + 2

T−h∑t−1

X ′t (βt − βt)εt+h +

T−h∑t−1

(βt − βt)′XtX ′t (βt − βt).

The optimal rolling window length grows at a rate of sample size T,but does not depend on a specific t.



Theorem 3. Under Assumptions 1-5, the optimal window length tominimize global MSFE is

Lopt =

{d∑T−h

t=1 σ2It,t∫1−1 k(u)2du∑T−h

t=1 β(2) ′t XtX ′t β

(2)t (∫1−1 k(u)u2du)2

} 15

T45 ,

where d = dim(Xt) and Mt = EXtX′t . When Xt is strictly stationary, theexpression of Lopt remains unchanged.



Cross-validation

All three optimal methods to select the window lengths are infeasible inpractice, because they all involve the second derivatives of unknownparameter β(·) among other things.

One could use a “plug-in” method. However, the window length chosen byplug-in may be highly sensitive to the choice of the pilot window length(Leung (2005)).

We propose a direct method without extra complications with pilot estimatorsand without the extra consistent second derivative assumption required toguarantee the pilot estimator work well.

Cross-validation (CV) is a popular data-driven method, e.g., Fryzlewicz,Sapatinas, and Subba Rao (2008), Chen and Hong (2012).

The basic idea is to set one of the data points aside for validation of a modeland use the remaining data to build the model.



Cross-validation

To propose a feasible method by minimizing the weighted averageout-of-sample loss over the cross-validation sample.

Define a “leave-one-out” estimator β−t = (e′1 ⊗ Id)θ−t, where

θ−t = (Σt+[Lt]s=t−[Lt],s,tkstZstZst)

−1Σt+[Lt]s=t−[Lt],s,tkstZstY ′s+h.

Then a data-driven choice of Lt is

Lt,CV = argminc1T4/56Lt6c2T4/5CV(Lt),

where CV(Lt) = ΣTs=1,s,t(Ys+h − X′sβ−s)

2kst/Lt, and c1 and c2 are twoprespecified constants.

Theorem 4. Under Assumptions 1-5, the estimated window length bycross-validation is asymptotically equivalent to the optimal length tominimize the unconditional MSFE, i.e., Lt,CV/Lt,UMSFE

p→ 1.



Optimal Kernel

The mean square forecast errors (MSFE) depend on kernel given inTheorems 1 to 4.

Theorem 5. Under Assumptions 1-5, the nonnegative probabilitydensity function K that minimizes mean square forecast errors is arescaling of the Epanechnikov kernel:

kopt(u) =34a

(1 −u2

a2 )+, for any a > 0.

Epanechnikov kernel is the optimal kernel for kernel densityestimation (Epanechnikov (1969)) and in the robust regression(Lehmann (1983)).

Such a kernel is also used in the Fourier transform by Bochner (1936)and in spectral density estimation by Parzen (1961).



Optimal KernelThe use of data reflection is important!Example: Under UMSFE

With Reflection

Var = T−1l−1t dσ2(

tT)

∫ 1

−1k(u)2du,

Bias =l2t4(

∫ 1

−1k(u)u2du)2β

(2)′t EXtX ′t β

(2)t .

Without Reflection

Var = T−1l−1t dσ2(

tT)µ2

2cv0c − 2µ1cµ2cv1c + µ21cv2c

(µ0cµ2c − µ21c)

2 ,

Bias =l4t4

(µ2

2c − µ1cµ3c

µ0cµ2c − µ21c

)2

β(2)′t EXtX ′t β

(2)t .

where µic =∫c−1 uik(u)du, vic =

∫c−1 uik2(u)du, c→ 0− and

d = dim(Xt).It is possible that without reflection method the optimal kernel isdifferent from Epanechnikov kernel and unknown.Yongmiao Hong1 , Yuying Sun2,3 , Shouyang Wang2,3(AMSS)Selection of the Optimal Rolling Window Length in Time-varying Predictive Regression

Simulation

Simulation Study

Purpose:

Compare our methods with fixed rolling window estimation, recursiveestimation and Inoue et al. (2014) with R = max(2T3/5, 0.2T) andR = min(2T4/5, 0.8T).Compare different results based on different weights (kernel function).Investigate the performance of forecasts based on cross-validation.

DGP 1–Time varying coefficient

Yt+1 = βt(1 + 0.5Xt) + ut+1, t = 1, · · · , T,

where βt = 1.5 − 1.5exp(−3t/T − 0.5)2.

Two cases for {Xt}: (i) Xt ∼ N(0, 1); (ii) Xt = 0.5Xt−1 + νt, νt ∼ i.i.d.N(0, 1).

Three cases for {ut}: (i) ut ∼ i.i.d.N(0, 1);(ii) ut =

√htεt, ht = 0.2 + 0.5ε2

t−1, εt ∼ i.i.d.N(0, 1);(iii) ut =

√htε, ht = 0.2 + 0.5X2

t , ut ∼ i.i.d.N(0, 1).


Simulation (Cont’d)

Simulation Study

The simulation is repeated 1000 times with the sample sizeTall = 150, 300, 550.

The sample is divided into two subsets: the first sub-sample T equals to100, 250, 500, respectively to estimate, the second sub-sample Tf is the last50 data to forecast.

Rolling forecasts with different fixed window lengths, with L = 20, 40, 60,respectively.

We consider four kernel functions including uniform kernel (U), trianglekernel (T), epanechnikov kernel (E), quartic kernel(Q).

DGP1(1, 3): Xt ∼ i.i.d.N(0, 1), ut =√

htεt, ht = 0.2 + 0.5X2t and

εt ∼ i.i.d.N(0, 1).



Forecast Values βt and Yt with 100 Sample in DGP1(1, 3)



Forecast Values βt and Yt with 250 Sample in DGP1(1, 3)Yongmiao Hong1 , Yuying Sun2,3 , Shouyang Wang2,3(AMSS)Selection of the Optimal Rolling Window Length in Time-varying Predictive Regression


Forecast Values βt and Yt with 500 Sample in DGP1(1, 3)Yongmiao Hong1 , Yuying Sun2,3 , Shouyang Wang2,3(AMSS)Selection of the Optimal Rolling Window Length in Time-varying Predictive Regression

Table: Simulation results of DGP1(III)

Uniform Triangle Epanechnikov QuarticBias MSFE Bias MSFE Bias MSFE Bias MSFE

100

Recursive 0.3523 1.3484 0.1990 1.2451 0.2391 1.2669 0.1680 1.2362UMSFE -0.0171 1.1748 -0.0199 1.1760 -0.0211 1.1783 -0.0188 1.1776CMSFE -0.0211 1.1741 -0.0216 1.1818 -0.0224 1.1844 -0.0205 1.1873GMSFE -0.0158 1.1503 -0.0165 1.1621 -0.0166 1.1570 -0.0144 1.1634

L=20 -0.0125 1.5052 -0.0138 1.5401 -0.0142 1.5305 -0.0142 1.5662L=40 0.0052 1.2686 -0.0067 1.3110 -0.0048 1.2962 -0.0076 1.3309L=60 0.0811 1.2396 0.0289 1.2380 0.0383 1.2312 0.0195 1.2449CV -0.0019 1.1370 -0.0153 1.1370 -0.0112 1.1326 -0.0154 1.1377

Inoue et al. -0.1117 1.6391 -0.1117 1.6391 -0.1117 1.6391 -0.1117 1.6391

250


L=20 0.0349 1.5878 0.0384 1.6709 0.0377 1.6500 0.0388 1.6993L=40 0.0259 1.3826 0.0289 1.4317 0.0276 1.4142 0.0288 1.4464L=60 0.0193 1.3326 0.0249 1.3510 0.0235 1.3405 0.0256 1.3581CV -0.0301 1.1907 -0.0105 1.1902 -0.0104 1.1928 -0.0044 1.1941

Inoue et al. -0.1562 1.8217 -0.1562 1.8217 -0.1562 1.8217 -0.1562 1.8217

500

Recursive 0.3895 1.3988 0.2459 1.2860 0.2856 1.3144 0.2158 1.2699UMSFE 0.0227 1.2059 0.0333 1.2071 0.0320 1.2072 0.0351 1.2145CMSFE 0.0223 1.2081 0.0332 1.2095 0.0325 1.2098 0.0352 1.2118GMSFE 0.0301 1.1978 0.0337 1.1929 0.0330 1.1914 0.0344 1.1944

L=20 0.0493 1.6107 0.0473 1.6725 0.0474 1.6572 0.0468 1.6973L=40 0.0426 1.3988 0.0451 1.4513 0.0445 1.4352 0.0455 1.4714L=60 0.0411 1.3310 0.0417 1.3650 0.0411 1.3531 0.0419 1.3755CV -0.0179 1.1958 0.0083 1.1954 0.0021 1.1945 0.0179 1.1957

Inoue et al. -0.0929 1.8646 -0.0929 1.8646 -0.0929 1.8646 -0.0929 1.8646

Note: DGP1 (1,3): Xt ∼ i.i.d.N(0, 1) and ut =√

htεt , ht = 0.2 + 0.5X2t , εt ∼ i.i.d.N(0, 1). Bias is calculated as 1

Tf

∑Tft=1(Yt − Yt),

and MSFE is calculated as 1Tf

∑Tft=1(Yt − Yt)

2 .



Other DGPs:

DGP 2: Univariate model

Yt+1 = βt + ut+1, t = 1, · · · , T,

where ut+1 and βt are considered as three cases: (I) βt = t/T and ut ∼ N(0, 1/12); (II)βt = (t/T)2 and ut ∼ N(0, 9/100); (III) βt = βt−1 +

√2/Tεt, εt ∼ N(0, 1) and ut ∼ N(0, 1).

DGP 3: Bivariate model[Yt+1Xt+1

]=

[(β1t β2t0 0.9

)][YtXt

]+

[uy,t+1ux,t+1

], t = 1, · · · , T,

where the error terms {(uy,t+1, ux,t+1)′} are generated from independent and identical normal

distributions. And we consider four cases for (β1t,β2t)′:

(I) β1t = 0.9I(t 6 0.75T) + 0.5I(t > 0.75T) and β2t = 1;(II) β2t = I(t 6 0.75T) + 2I(t > 0.75T) and β1t = 1;(III) β1t = 0.9 − 0.4(t/T) and β2t = 1;(IV) β1t = 0.9 and β2t = 1 + (t/T)2.


Table: Simulation results of DGP3(I)


100

Recursive -0.0947 1.3620 -0.0174 1.2448 -0.0396 1.2769 0.0069 1.2433UMSFE -0.0529 1.1639 -0.0337 1.1841 -0.0413 1.1911 -0.0213 1.1881CMSFE -0.0832 1.3061 -0.0411 1.2543 -0.0359 1.2589 -0.0224 1.2066GMSFE -0.0399 1.2124 -0.0300 1.1741 -0.0266 1.1735 -0.0128 1.1472

L=20 -0.0500 1.6288 -0.0585 1.8031 -0.0581 1.7529 -0.0521 1.8606L=40 0.1391 1.4065 0.0720 1.3351 0.0918 1.3490 0.0537 1.3522L=60 0.0428 1.3794 0.1102 1.3161 0.1150 1.3491 0.1329 1.3341CV -0.0321 1.2211 -0.0117 1.1663 -0.0121 1.1711 0.0004 1.1452

Inoue et al. 0.0962 1.4424 0.0962 1.4424 0.0962 1.4424 0.0962 1.4424

250

Recursive 0.0160 1.5379 0.0532 1.2768 0.0474 1.3510 0.0512 1.2560UMSFE 0.0249 1.0751 0.0353 0.9929 0.0383 1.0040 0.0206 0.9746CMSFE 0.0361 1.1646 0.0074 1.1886 0.0114 1.1496 -0.0247 1.1883GMSFE 0.0176 0.9782 0.0303 0.9777 0.0248 0.9769 0.0255 0.9792

L=20 0.1046 1.6823 0.1254 1.6950 0.1184 1.6894 0.1414 1.7211L=40 0.0546 1.1646 0.0636 1.2044 0.0597 1.1859 0.0634 1.2410L=60 0.0275 1.0515 0.0482 1.0832 0.0428 1.0719 0.0513 1.0996CV 0.0235 0.9745 0.0256 0.9602 0.0224 0.9592 0.0238 0.9616

Inoue et al. 0.0121 1.5794 0.0121 1.5794 0.0121 1.5794 0.0121 1.5794

500

Recursive -0.0387 1.6462 -0.0065 1.3866 -0.0154 1.4721 0.0022 1.3446UMSFE 0.0281 1.1824 -0.0191 1.0840 -0.0111 1.1039 -0.0008 1.1235CMSFE -0.0469 1.2781 -0.0460 1.2392 -0.0507 1.2301 -0.0298 1.2235GMSFE 0.0066 1.0709 0.0036 1.0701 0.0031 1.0707 0.0019 1.0734

L=20 0.0263 1.6129 0.0021 1.7812 0.0076 1.7374 -0.0054 1.8520L=40 0.0344 1.3409 0.0439 1.3148 0.0414 1.3188 0.0447 1.3353L=60 -0.0145 1.2476 0.0148 1.2589 0.0065 1.2656 0.0212 1.2738CV 0.0069 1.0275 0.0044 1.0399 0.0045 1.0297 0.0053 1.0471

Inoue et al. 0.0110 1.3908 0.0110 1.3908 0.0110 1.3908 0.0110 1.3908


Table: Simulation results of DGP3(III)


100

Recursive -0.0185 1.0760 -0.0193 1.0860 -0.0206 1.0789 -0.0206 1.0879UMSFE -0.0600 1.1757 -0.0404 1.1743 -0.0391 1.1706 -0.0363 1.1798CMSFE -0.0595 1.2063 -0.0458 1.1952 -0.0401 1.2089 -0.0420 1.2011GMSFE -0.0264 1.1330 -0.0391 1.1210 -0.0341 1.1231 -0.0341 1.1300

L=20 -0.0392 1.6055 -0.0532 1.7832 -0.0518 1.7316 -0.0484 1.8424L=40 -0.0070 1.3093 0.0015 1.3456 0.0035 1.3435 0.0019 1.3765L=60 -0.0238 1.1544 -0.0140 1.2201 -0.0171 1.2084 -0.0056 1.2468CV -0.0644 1.1458 -0.0675 1.1034 -0.0708 1.1020 -0.0740 1.1141

Inoue et al. 0.0474 1.4599 0.0474 1.4599 0.0474 1.4599 0.0474 1.4599

250

Recursive 0.0195 0.9630 0.0267 0.9721 0.0252 0.9701 0.0269 0.9754UMSFE 0.0268 1.0023 0.0108 0.9970 0.0111 0.9913 0.0091 0.9934CMSFE 0.0145 1.0210 0.0102 1.0173 0.0124 1.0037 0.0171 1.0093GMSFE 0.0133 1.0431 0.0178 1.0122 0.0152 1.0193 0.0088 1.0039

L=20 0.0992 1.6763 0.1211 1.6783 0.1136 1.6740 0.1366 1.6984L=40 0.0522 1.1586 0.0606 1.1983 0.0566 1.1803 0.0601 1.2344L=60 0.0270 1.0449 0.0465 1.0747 0.0414 1.0639 0.0492 1.0911CV 0.0258 1.0561 0.0202 1.0056 0.0131 1.0209 0.0170 1.0011

Inoue et al. -0.0139 1.4652 -0.0139 1.4652 -0.0139 1.4652 -0.0139 1.4652

500


L=20 0.0159 1.6222 -0.0121 1.7866 -0.0055 1.7429 -0.0210 1.8574L=40 0.0331 1.3416 0.0377 1.3195 0.0365 1.3230 0.0380 1.3383L=60 -0.0134 1.2509 0.0134 1.2624 0.0061 1.2692 0.0197 1.2771CV -0.0127 1.0926 -0.0157 1.0955 -0.0129 1.0931 -0.0159 1.0990

Inoue et al. -0.0115 1.3393 -0.0115 1.3393 -0.0115 1.3393 -0.0115 1.3393



Simulation Summary

The uniform kernel performs worse than other three kernels, especially inthe small sample size.

The trends of optimal window length from Triangle and Quartic kernels,except Uniform kernel, are similar to that from Epanechnikov kernel.

The window length produced by cross-validation yields the best forecasts formodels with smoothly time-varying parameters as well as for models withmultiple discrete breaks.

Under conditional heteroscedastic errors and i.i.d errors, for cross-validationperforms remarkably well.

The UMSFE plug-in almost produces the second smallest MSFE, worsethan cross-validation.

The improvements of UMSFE plug-in over other existing methods tend to beremarkable for larger T and close to cross-validation, confirming theconsistency of Theorem 4.



Simulation ResultsThe estimators of βt from CMSFE plug-in always fluctuate more dramaticallythan the true values and the estimators from UMSFE.

The window length based on Global MSFE plug-in yields the fourth smallestMSFE. This happens because the parameters estimated by GMSFE aremore stable than those based on CMSFE in some cases and little affectedby some abnormal values.

The recursive method performs worse than other methods, including moredata always increase the bias.

The fixed rolling window length hardly perform similarly in different DGPs ordifferent sample sizes.


Empirical Applications

Forecasting GDP Growth and Inflation

Significant in-sample predictability is no guarantee of successfulout-of-sample predictability.

Instability in predictive relationships, e.g., housing price performedwell in 1971-1984 to forecast inflation, not in 1985-1999 (Stock andWatson (2003)).

Some works test and prove that the model’s relative performance haschanged over time (e.g., Clark and McCracken (2001), Rossi,Sekhposyan (2010)).

To check whether we can improve forecasts of GDP and inflationusing the window length selected by our approach.


Empirical Applications (Cont’d)

Focus on the multi-step pseudo out-of-sample forecastingperformance

Exogenous variables:

Asset Prices: federal funds rate (fedfunds); 3-month treasury bill(tb3ms), 10-year treasury bill (t10yr), Term spread (termspread), S&P500 price index (sp500);Real Economic Activity: real gross private domestic investments(Rgpdi), employment (emp), New private housing units authorized bybuilding permits (buildpermit);Price Indices: Producer price index (ppi).

Tall: 1967Q1-2015Q4; Tf : 1984Q1-2015Q4.



GDP growth rate predictive model

Yt+h = µt + αt(B)Xt + βt(B)Yt + εt+h,where

the dependent variable is Yt+h = (400/h)ln(Qt+h/Qt), Xt denotes theexogenous variables, Yt = 400ln(Qt/Qt−1), and Qt is the quarterly real GDPin level.

αt(B) denotes the lag polynomial. αt(B)Xt = α1tXt + · · ·+ αptXt−p+1.

Inflation predictive model

πt+h − πt = µt + αt(L)Xt + βt(L)∆πt + εt+h,

where πt = 400ln(Pt/Pt−1), ∆πt = πt − πt−1, πt+h = h−1∑hi=1 πt+i. Pt is the

quarterly GDP deflator in level, and Xt is the set of exogenous variables.

AR Benchmark models:

Yt+h = µt + βt(B)Yt + ξt+h, t = 1, 2, · · · , T,

πt+h = µt + βt(B)πt + ξt+h. t = 1, 2, · · · , T.



Forecast Evaluation:

rMSFE =

∑Tt=1(Yt − Yt)

2∑Tt=1(Yf ,t − Yt)2

.

rMSFE indicates the ratios of the MSFE produced by the optimal windowsize relative to those produced in rolling fixed 40 window length in the samemodel, the same window size used by Stock and Watson (2001) and Inoueet al. (2017).

Eight window length approaches are: ‘RECURSIVE’: recursive method;‘UMSFE’: unconditional MSFE; ‘CMSFE’: conditional MSFE; ‘GMSFE’:global MSFE; ‘CV’: cross-validation; ‘Inoue et al.’: the feasible method inInoue et al. (2017) with R = max(1.5T2/3, 20) and R = min(6T2/3, T − h);‘L = 25’ and ‘L = 40’: two fixed rolling window lengths.


Table: Forecast Criteria of GDP Growth Rate with Different Window Lengths from 1984Q1 to 2015Q4

h=1 h=2 h=4 h=8

AR

Recursive 0.9786 0.9807 0.9606 0.9669UMSFE 0.9034 0.8884 0.8513 0.8803CMSFE 0.9042 0.8864 0.8561 0.8852GMSFE 0.8711 0.8435 0.7864 0.8430

L=25 1.1173 1.0807 1.0157 0.9166CV 0.8759 0.8290 0.7606 0.7923

Inoue et al. 0.8597 0.8213 0.7784 0.8172

fedfunds


L=25 1.4678 1.2023 0.8060 0.6435L=40 1.1845 1.0603 0.8240 0.7421CV 0.9006 0.8375 0.7568 0.7928

Inoue et al. 0.9387 0.8914 0.8150 0.8473

tb3ms


L=25 1.4064 1.1593 0.8039 0.6274L=40 1.1962 1.1180 0.8599 0.7431CV 0.9319 0.8652 0.7758 0.8027

Inoue et al. 0.9474 0.9019 0.8264 0.8432

sp500


L=25 1.4523 1.3890 1.2148 1.0617L=40 1.2284 1.2708 1.2117 1.1868CV 0.9122 0.8623 0.7714 0.8120

Inoue et al. 0.9370 0.9289 0.8436 0.8588


h=1 h=2 h=4 h=8

rgpdi


L=25 1.5287 1.5054 1.3311 1.1364L=40 1.2326 1.2329 1.1698 1.1757CV 0.9257 0.8781 0.8226 0.8138

Inoue et al. 0.9123 0.8712 0.8104 0.8550

emp


L=25 1.5961 1.5720 1.3257 1.0487L=40 1.2906 1.3396 1.2583 1.2187CV 0.9033 0.8626 0.7770 0.7980

Inoue et al. 0.9170 0.8706 0.8199 0.8394

buidpermit


L=25 1.3909 1.2819 1.1380 1.0071L=40 1.1644 1.1860 1.0700 0.9869CV 0.8731 0.8635 0.7565 0.7393

Inoue et al. 0.8768 0.8687 0.7812 0.7848

ppi


L=25 1.5498 1.3197 1.0309 0.8881L=40 1.2311 1.1556 0.9785 1.0291CV 0.9065 0.8468 0.7398 0.7127

Inoue et al. 0.9460 0.9101 0.7634 0.7846


Table: Forecast Criteria of Inflation with Different Window Lengths from 1984Q1 to 2015Q4

h=1 h=2 h=4 h=8

AR


L=25 1.1535 1.1150 1.1284 1.1025CV 0.8288 0.8386 0.7978 0.7507

Inoue et al. 0.8522 0.8482 0.8156 0.7732

fedfunds


L=25 1.4617 1.4251 1.4062 1.4462L=40 1.1950 1.2084 1.2183 1.2340CV 0.8415 0.8448 0.8080 0.7894

Inoue et al. 0.8637 0.8696 0.8552 0.8259

tb3ms


L=25 1.3574 1.3589 1.3865 1.4482L=40 1.0828 1.1253 1.1618 1.2240CV 0.8427 0.8316 0.7925 0.7845

Inoue et al. 0.8757 0.8584 0.8363 0.8195

sp500


L=25 1.4206 1.3990 1.4157 1.3969L=40 1.2129 1.1735 1.1443 1.1831CV 0.8728 0.8620 0.8024 0.7712

Inoue et al. 0.8923 0.8785 0.8413 0.8264


h=1 h=2 h=4 h=8

rgpi


L=25 1.3877 1.4451 1.4555 1.3995L=40 1.1175 1.1343 1.1458 1.1926CV 0.8437 0.8405 0.8056 0.7781

Inoue et al. 0.8612 0.8433 0.8213 0.8000

emp


L=25 1.3307 1.3335 1.3526 1.3690L=40 0.9885 1.0116 1.0749 1.1449CV 0.7648 0.7607 0.7667 0.7557

Inoue et al. 0.7651 0.7694 0.7802 0.7685

buildpermit


L=25 1.4217 1.3884 1.4016 1.3442L=40 1.1712 1.1828 1.1904 1.1932CV 0.8727 0.8606 0.8394 0.7665

Inoue et al. 0.8761 0.8769 0.8399 0.7919

ppi


L=25 1.5369 1.4565 1.4282 1.3168L=40 1.2689 1.2082 1.2561 1.1988CV 0.8938 0.8672 0.8569 0.8058

Inoue et al. 0.8977 0.8685 0.8439 0.8634


Conclusion

A general approach choosing the optimal rolling window length in atime-varying predictive linear regression model.

We allow regressors to change smoothly over time (Xt can be locallystationary). Thus, autoregression models are included.

Compared to the OLS rolling estimation, our ”reflection about theboundaries” method increases the order of the bias, thus enhancingthe rate of convergence of MSFE. And the order of the length over Tis −1

5 , which outperforms −13 .

Forecasts using the window length selected from Uniform kernelperform worse than other kernels. It implies that the optimal lengthselected from OLS method is not fully efficient.

We propose and justify a feasible cross validation to choose theoptimal length, which is asymptotically equivalent to the infeasiblecounterpart denied from the local unconditional MSFE criterion.


Thanks!


selection of the optimal length of rolling window in time

Documents