notes on the autocorrelation problemese302/extra_mtls/autocorrelation... · to answer this...

21
ESE 302 Tony E. Smith 1 NOTES ON THE AUTOCORRELATION PROBLEM The following notes illustrate the problem of temporally autocorrelated regression residuals that may arise when using time-series data (and represent the most common violation of the independent-residual assumption in regression modeling). Here the Durbin-Watson statistic is shown to provide diagnostic tool for identifying temporal autocorrelation, and the method of two-stage least squares is shown to be one possible method for removing this effect. This development will utilize the sales forecasting data set, Sales.jmp, in the class directory. The question of interest for this particular data set is whether per capita income levels can be used to predict retail sales. To answer this question, data on annual retail sales, sales, and annual per capita income, pci, in the US have been collected for a period of 15 years. A (Fit Y by X) regression of sales on pci yields the following results: The parameter estimates are seen to be very significant, and the R-squared value is quite impressive. But observe that the residuals appear to exhibit a “cyclical” pattern about the Figure 1. Initial Regression of Sales on PCI

Upload: vuongnhi

Post on 18-Mar-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

ESE 302 Tony E. Smith

1

NOTES ON THE AUTOCORRELATION PROBLEM

The following notes illustrate the problem of temporally autocorrelated regression residuals that may arise when using time-series data (and represent the most common violation of the independent-residual assumption in regression modeling). Here the Durbin-Watson statistic is shown to provide diagnostic tool for identifying temporal autocorrelation, and the method of two-stage least squares is shown to be one possible method for removing this effect. This development will utilize the sales forecasting data set, Sales.jmp, in the class directory. The question of interest for this particular data set is whether per capita income levels can be used to predict retail sales. To answer this question, data on annual retail sales, sales, and annual per capita income, pci, in the US have been collected for a period of 15 years. A (Fit Y by X) regression of sales on pci yields the following results: The parameter estimates are seen to be very significant, and the R-squared value is quite impressive. But observe that the residuals appear to exhibit a “cyclical” pattern about the

Figure 1. Initial Regression of Sales on PCI

ESE 302 Tony E. Smith

2

regression line, suggesting that they are not independent, in the sense that neighbors of positive residuals tend to be positive, and similarly, that neighbors of negative residuals tend to be negative. This cyclical dependency can be seen even more clearly by plotting the residuals of this regression (click the red triangle next to Linear Fit, and select Plot Residuals). In Figure 2 below, only two of the four resulting plots are shown: The top plot, Residual by X, essentially flattens the regression line to a horizontal base line and plots the size of each regression deviation about this line. But since the rows of the data table are ordered by year, it is the second plot, Residual by Row, which actually shows that these residuals are exhibiting a cyclical pattern in time. The reason why this pattern also appears in the upper table is that the explanatory variable, PCI, happens to exhibit the same order, i.e., per capita incomes are uniformly increasing over this 15 year period. More generally however, such patterns may not be apparent when simply plotting residuals against explanatory variables. This is why the Residual by Row plot is so useful for detecting temporal autocorrelation. (Notice also that this plot works equally well for multiple regressions since there is exactly one residual for each time point, no matter how many explanatory variables are used.) But even when looking at such plots, the presence of temporally autocorrelated residuals may not always be this obvious, especially when key explanatory variables are missing. So it is desirable to develop statistical tests for identifying significant autocorrelation effects. 1. Durbin-Watson Statistic The single most commonly used test is the Durbin-Watson test. This test is based on the simple observation that if residuals are autocorrelated, then neighboring residuals should tend be more similar in value than arbitrary pairs of residuals. This suggests that sums of squared differences

Residual by X Plot

-1.0

-0.5

0.0

0.5

1.0

sale

s Re

sidu

al

16 18 20 22 24

pci

Residual by Row Plot

-1.0

-0.5

0.0

0.5

1.0

sale

s Re

sidu

al

0 5 10 15

Row Number

Figure 2. Residual Plots

ESE 302 Tony E. Smith

3

between neighboring residuals should tend to be small relative to sums of squared residuals themselves. More formally, if for any given set of residuals, ( : 1,.., )t t Tε = , the ratio

(1) 2

122

1

( )Tt tt

Ttt

Dε ε

ε−=

=

−= ∑

is designated as the Durbin-Watson statistic, D, then values of D for temporally autocorrelated residuals should tend to be small. This suggests that a test for such autocorrelation effects can be constructed in terms of this statistic. But before doing so, notice that these regression residuals are indexed by “t” to denote different time periods. While other types of orderings may in some cases be relevant, we focus exclusively on time orderings. Observe also that the summation in the numerator of D starts at 2t = , since period 1t − is not defined for 1t = . (More generally, for any sequence of T values, there are only 1T − successive differences of these values.) To construct a test of autocorrelation based on D, one begins (as always) by asking how D would behave under the null hypothesis of “no autocorrelation”. In other words, what would the distribution of D look like if the residuals ( : 1,.., )t t Tε = satisfied the standard regression assumptions that (2) 2~ (0, ) , 1,..,t iid

N t Tεε σ =

While it is difficult to characterize this distribution explicitly, one can in fact calculate its mean explicitly as follows. Under hypothesis (2) it can be shown1 that values of the ratio, D, are statistically independent of the values of its denominator, 2

1

Tttε

=∑ , so that by (1), (3) 2 2

12 1( ) *T T

t t tt tDε ε ε−= =

− =∑ ∑ ( ) ( ) ( )2 2 2

12 1 1( ) * ( )T T T

t t t tt t tE E D E D Eε ε ε ε−= = =

⇒ − = =∑ ∑ ∑

( )

( )2

12

21

( )( )

Tt tt

Ttt

EE D

E

ε ε

ε

−=

=

−⇒ =

∑∑

Thus it suffices to calculate the means of the numerator and denominator separately, as follows. Turning first to the denominator, and noting from (2) that (4) 2 2 2 2var( ) ( ) ( ) ( ) , 1,..,t t t tE E E t Tεσ ε ε ε ε= = − = =

1 This follows from the celebrated Koopmans-Pitman Theorem, which is best understood in terms of Koopmans’ original proof (p.18) in Koopmans, T.C. (1942), “Serial Correlation and Quadratic Forms in Normal Variables”, Annals of Mathematical Statistics, 13: 14-33.

ESE 302 Tony E. Smith

4

we see that (5) ( )2 2 2 2

1 1 1( )T T T

t tt t tE E Tε εε ε σ σ

= = == = =∑ ∑ ∑

Similarly, by expanding the numerator and recalling from independence that (6) 0 cov( , ) ( ) ( ) ( ) ( )t t t t t t t tE E E Eε ε ε ε ε ε ε ε′ ′ ′ ′= = − = for all distinct time periods, t and t′ , we see that (7) ( )2 2

1 12 2( ) [( ) ]T T

t t t tt tE Eε ε ε ε− −= =

− = −∑ ∑

2 2

1 12[ 2 ]T

t t t ttE ε ε ε ε− −=

= − +∑ 2 2

1 12 2 2( ) 2 ( ) ( )T T T

t t t tt t tE E Eε ε ε ε− −= = =

= − +∑ ∑ ∑ 2 2

2 2 22 (0)T T T

t t tε εσ σ= = =

= − +∑ ∑ ∑ 22( 1)T εσ= − Thus it follows from (3), (5) and (7) that

(8) 2

2

2( 1) 1( ) 2T TE DT T

ε

ε

σσ− − = =

So under hypothesis (2) we see that (for any reasonably sized T ) (9) ( ) 2E D ≈ Moreover, for any positively autocorrelated residuals we have already seen that squared differences, 2

1( )t tε ε −− , should tend to be small, so that each 21[( ) ]t tE ε ε −− in the numerator of

(1) should also be small. Thus for positively correlated residuals, ( )E D should lie between 0 and 2. Finally, if residuals are negatively correlated so that neighbors tend to have opposite signs, the same argument suggests that mean squared differences, 2

1[( ) ]t tE ε ε −− , should tend to be larger than 2. While negative autocorrelation is of far less interest for our purposes, it is worth noting that in the extreme case where 1t tε ε −≡ − [so that 1 2t t tε ε ε−− ≡ ], we can actually approximate (1) as follows:

ESE 302 Tony E. Smith

5

(10) 2 2

2 22 2

1 1

(2 )( ) 4 4(1) 4

T Tt tt t

T Tt tt t

E D E Eε ε

ε ε= =

= =

= = ≈ =

∑ ∑∑ ∑

So in summary, the mean behavior of D can be neatly summarized as follows: 2. Durbin-Watson Test Using this statistic, we now construct an explicit test for autocorrelation based on the null hypothesis in (2) above. To do so, we begin by estimating this statistic in the obvious way, namely by using the estimated residuals, ˆ( : 1,.., )t t Tε = , obtained from the regression in Figure 1 above. This yields the corresponding test statistic,

(11) 2

122

1

ˆ ˆ( )

ˆ

Tt tt

Ttt

dε ε

ε−=

=

−= ∑

While one can of course save these residuals and construct this statistic explicitly, it is not surprising that this construction is available in JMP. Here it is important to emphasize that which the Fit Y by X option was useful for plotting residuals in alternative ways, the Durban-Watson test is only available in the Fit Model option, which of course allows simple as well as multiple regressions. So the first task here is to redo this regression using Fit Model. Having done so, the Durbin Watson test can be accessed by click the red triangle next to Response Sales, and using the path Row Diagnostics > Durbin Watson test. The result will now appear at the bottom of the regression tableau, as shown in Figure 4 below:

Figure 3. Range of E(D) values

Figure 4. Durbin Watson Test

• • • 0 2 4

independent < positive corr negative corr >

ESE 302 Tony E. Smith

6

0.5 1 1.5 2 2.5 3 3.5 4

0

200

400

600

800

1000

Here the number on the left is the value of the Durbin Watson test statistic, 0.8034d = . Thus we see that d is less than 2, and indeed, is closer to 0 than it is to 2. From the arguments above, this certainly suggests that these residual are positively autocorrelated -- but we have yet to develop an actual test of this assertion. To so, it is important to emphasize that even under the null hypothesis in (2), the distribution of the test statistic, d, in (11) is much more complex than the distribution of D in (1). In particular, since each residual estimate, tε , depends explicitly on the values of the explanatory data, pci, as well as the sales data, the distribution of d must also depend on this data. In fact this distribution is so complex, that it can only be estimated by simulation methods. In the present case, this amounts to sampling many realizations of ( : 1,.., )t t Tε = from the joint normal distribution in (2), and computing the corresponding values of d for each such realization, as shown by the histogram of 1000 simulated d values in Figure 5 below: 2 Notice also that the sample mean of the d-distribution is larger than 2 (actually 2.15d = in this case). This would appear to contradict (8) which is slightly less than 2. But, as stated above, the distribution of the estimator, d, is different from that of D (and in particular, depends on the given data values). Given this simulated distribution, the p-value for a one-sided test of hypothesis (2) can be now be calculated by determining the fraction of d samples which do not exceed the observed value,

0.8034d = shown in the Figure. This fraction, which can be seen to be very small, is in this case given by: (12) 0.0014pvalue = 2 It can be shown that the sampled values of d are independent of the value of 2σ in hypothesis (2), which is here set equal to 1. The actual simulation was programed and implemented in Matlab.

Figure 5. Durbin-Watson P-values

0.5 1 1.5 2 2.5 3 3.5 4

0

200

400

600

800

1000

d = .8034 d d d

ESE 302 Tony E. Smith

7

To display such a p-value in JMP, one must click the red triangle next to Durbin-Watson in Figure 4, and then click on “Significant P Value” to obtain: Notice that the resulting value, 0.0011, is slightly different than (12) because a more complex exact procedure is used in JMP. 3 But such small variations have little effect on the result – namely that these residuals are significantly positively autocorrelated. 3. Consequences of Autocorrelation Before proceeding, it is important to ask what effect autocorrelation has on the regression results. To explore this question graphically, the left panel in Figure 7 below illustrates a linear model with autocorrelated residuals very similar to the present example, where y = sales and where the explanatory variable, x = pci, is increasing in time (so that the x-y plot reveals the autocorrelation). If one estimates this particular model with linear regression, then since the sum of squared residuals is always minimized by definition, the resulting regression line will tend to be closer to

3 The calculation of this p-value in JMP involves the (approximate) integration of a certain complex-valued integral transform. If you want further details, look at the SAS online documentation of their autoreg function at https://support.sas.com/documentation/cdl/en/etsug/68148/HTML/default/viewer.htm#etsug_autoreg_details27.htm, which is not available in the JMP documentation. All that is said in the documentation is that “the computation of this exact probability can be memory and time-intensive if there are many observations”. This is why the calculation of this exact p-value is made “optional” in their Durbin-Watson test.

Figure 7. Underestimation of Residual Variance

Figure 6. Durbin-Watson P-value Display

ESE 302 Tony E. Smith

8

the data points, as shown in the right panel. So it should be clear that this procedure will tend to underestimate the actual sum of squared residuals, and thus produce a root-mean-square estimate, ˆεσ , which underestimates the true standard error, εσ , of the residuals in the left panel. But since the standard error of the slope estimator, 1β , for the regression in Figure 1 was shown in class to have the form,

(13) ( )1 2ˆ

( )ttx xεσσ β =−∑

it follows that the standard error estimator,

(14) 1 2

ˆ( )tt

sx xεσ=−∑

will also underestimate (13). (A more precise analysis of this issue is given in Appendix 1 below.) Finally, since this estimator appears in the denominator of the associated t-ratio,

1 1 1ˆ /t sβ= [and since the numerator, 1β , continues to provide an unbiased estimate of its true

value, 1β ], it follows that 1t will tend to be too large – thus inflating the significance of pci. This simple illustration underscores the most important practical problem with temporal autocorrelation, namely its tendency to make regression results look too significant. So to draw meaningful inferences about the significance of explanatory variables, it is important to correct for such autocorrelation effects. 4. The First-Order Autocorrelation Model While the Durbin-Watson test provides a fairly general method for detecting autocorrelation, there is no equally general method for correcting this problem. The difficulty is that autocorrelation itself can take many forms. But there is nonetheless one simple probabilistic model of autocorrelation which is sufficiently robust to allow a relatively general correction procedure to be developed. This model, known as the first-order autocorrelation model [or AR(1) model ] postulates that time dependencies between residuals are very local in nature. For our present purposes, it is convenient to formalize this notion in a general multivariate setting as follows. For any temporal regression model, (15) 0 1

, 1,..,kt j tj tj

y x u t Tβ β=

= + + =∑

the residuals ( : 1,.., )tu t T= are said to exhibit first-order autocorrelation whenever they are related in the following way (16) 1 , 2,..,t t tu u t Tρ ε−= + =

ESE 302 Tony E. Smith

9

where ( : 1,.., )t t Tε = is a sequence of iid normal random variables [as in (2) above]. So all dependencies of tu on the past 1 2( , ,...)t tu u− − are assumed to be fully captured by the previous period, 1tu − . The parameter, ρ , determines the sign of this autocorrelation effect, and is thus designated the coefficient of autocorrelation. The additional randomness term, tε , is by construction independent of the past, and is usually referred to as the innovation occurring in period t. Finally if 0ρ = , then the resulting process of innovations yields precisely the standard regression case. So in this setting, hypothesis (2) is equivalent to the (null) hypothesis that 0ρ = . Observe however that condition (16) is not fully complete, and in particular, says nothing about the initial period, 1u . It will turn out that the specification of 1u has no effect on the correction scheme we will employ below (and for this reason, is often set equal to 1ε for completeness). However it does make a difference from a theoretical perspective, as can be seen as follows. If we set 1 1u ε= , then it follows from (16) that (17) 2 2 2 2

2 1 2 1 2 2 1var( ) var( )u u u uε ε ερ ε ρ ε ε ρ σ σ σ= + = + ⇒ = + > = Similarly, residual variances must continue to increase each period, which makes very little sense from a behavioral viewpoint. So it is much more natural to suppose that the variances of these correlated residuals remain constant over time, as is implicit in the standard regression assumption (2). This “steady state” assumption is easily modeled by observing that if this constant variance be denoted by 2

uσ , then by the same argument as in (17) it follows that (18) 2 2 2 2 2

2 1 1var( ) var( ) var( ) u uu u ερ ε σ ρ σ σ= + ⇒ = + 2 2 2(1 ) u ερ σ σ⇒ − = ( )2

2 211u ερ

σ σ−

⇒ =

Notice that (18) is only meaningful under the “stationarity condition” that | | 1ρ < . Given this

condition, it follows that if we now set, 21 1 1u ε ρ= − , then by definion the steady-state

relation in (18) is automatically satisfied.4 So the complete temporal regression model for our purposes is (15) together with the condition that

(19) 21

11

1

, 1

, 1t

t t

tu

u tρε

ρ ε−

== + >

4 Alternatively, if one allows residual variance to keep increasing over successive time periods, then by extending the argument in expression (17), it can be shown that if | | 1ρ < , then the limiting residual variance is given by

2 2 2 2 2 2 2( )u ε ε εσ σ ρ σ ρ σ= + + + ⋅⋅⋅ [ ] 22 1 2 2

11

1( )t

t ε ερρ σ σ∞ −

= −= Σ = , which is exactly (18). So an equivalent

interpretation of this steady state is that the observed data is part of a temporal sequence that started “long ago”.

ESE 302 Tony E. Smith

10

for some sequence of innovations as in (2). 5. Correcting for Autocorrelation If autocorrelation effects are assumed to be first-order in nature, then a natural correction procedure can be motivated by observing that if the value of the autocorrelation coefficient, ρ , were known, then such effects could easily be eliminated as follows. Starting with the regression model in (15), we can construct a new model with essentially the same beta coefficients by considering the following lagged variables: (20) 1 , 2,..,t t tZ Y Y t Tρ −= − = (21) 1, , 2,.., , 1,..,tj tj t jw x x t T j kρ −= − = = With these definitions, it follows from (15) together with (16) that (22) ( ) ( )1 0 0 1, 11 1

k kt t t j tj t j t j tj j

Z Y Y x u x uρ β β ρ β β− − −= == − = + + − + +∑ ∑

0 1, 11

(1 ) ( ) ( )kj tj t j t tj

x x u uρ β β ρ ρ− −== − + − + −∑

0 1

, 2,..,kt j tj tj

Z w t Tα β ε=

⇒ = + + =∑

where 0 0(1 )α ρ β= − . The key point to observe here is that by (16) the “innovation” residuals ( : 2,.., )t t Tε = exhibit no autocorrelation. So by this change of variables, we obtain a standard linear regression model involving the same slope coefficients, ( : 1,.., )j j kβ = , and a simple (known) multiple of the original intercept, 0β : (23) 0 1

, 2,..,kt j tj tj

Z w t Tα β ε=

= + + =∑

(24) 2~ (0, )t iid

N εε σ

While this standard model is not operational without knowing the value of ρ , it nonetheless suggests that a good approximation can be obtained by finding a reasonable estimate, ρ , of ρ . Here a natural estimate of ρ can obtained as follows. If for any given set of data,

1( , ,.., ) , 1,..,t t tky x x t T= , it is true that the residual estimates

ESE 302 Tony E. Smith

11

(25) ( )0 1

ˆ ˆˆ ˆ kt t t t j tjj

u y y y xβ β=

= − = − +∑

resulting from regression (15) are first-order autocorrelated, then by replacing each tu in (16) with its estimate, ˆtu , in (25), it is reasonable to suppose that these estimates should satisfy the relation: (26) 1ˆ ˆ , 2,..,t t tu u t Tρ ε−= + = If so, then (26) itself constitutes a “no-intercept” regression model that can be used to estimate ρ , with the resulting estimate, ρ , given by

(27) 122

12

ˆ ˆ

ˆ

Tt tt

Ttt

u u

uρ −=

−=

= ∑∑

While this regression procedure suffers from certain theoretical problems,5 it can be shown that (27) yields a consistent estimator of ρ under quite general conditions. However, the estimator often used in practice (and in particular in JMP) is a slight modification of (27) in which 2

1u is added to the denominator to the denominator to obtain:6

(28) 122

1

ˆ ˆˆ

ˆ

Tt tt

Ttt

u u

uρ −=

=

= ∑∑

This is precisely the Autocorrelation estimate, ˆ 0.4821ρ = , as shown in Figure 7 above. One conceptual advantage of this estimator is its direct relation to the sample estimator, d, of the Durbin-Watson statistic in (1), which can be seen by expanding d as follows (using ˆtu rather than tε ):

(29) 2 2 2

1 1 12 2 2 22 2 2 2

1 1 1 1

ˆ ˆ ˆ ˆ ˆ ˆ( )ˆ2 1 1 2

ˆ ˆ ˆ ˆ

T T T Tt t t t t tt t t t

T T T Tt t t tt t t t

u u u u u ud

u u u uρ− − −= = = =

= = = =

−= = + − ≈ + −∑ ∑ ∑ ∑

∑ ∑ ∑ ∑

ˆ2(1 )d ρ⇒ ≈ −

5 In particular, regression model (26) suffers from the “endogeneity” problem that each ˆtu serves as both the

dependent variable in one equation, 1ˆ ˆt t tu uρ ε−= + , and the independent (explanatory) variable in another equation,

1 1ˆ ˆt t tu uρ ε+ += + . 6 Note that ρ is also a consistent estimator since 2 2

2 1ˆ ˆ ˆ( ) ( / ) ( )(1)T T

t t t tprob

u uρ ρ ρ ρ= == Σ Σ → = .

ESE 302 Tony E. Smith

12

So within the context of the first-order autocorrelation model above, the (more general) Durbin-Watson statistic, d, is essentially an equivalent form of this autocorrelation statistic, ρ . In particular, with respect to the classification scheme in Figure 3 above, we now have the approximate correspondence: (30) which is seen to strengthen the interpretation of this classification scheme in terms of corresponding levels of autocorrelation. Given this estimator of ρ , we can now approximate the “corrected” model in [(23),(24)] above by simply replacing ρ with ρ in expressions (20) and (21) above to obtain the corresponding approximate lag variables, (31) 1

ˆ ˆ , 2,..,t t tZ Y Y t Tρ −= − = (32) 1,ˆˆ , 2,.., , 1,..,tj tj t jw x x t T j kρ −= − = = Finally, using these variables, one can proceed to estimate the corresponding approximated version of model (23): (33) 0 1

ˆ ˆ , 2,..,kt j tj tj

Z w t Tα β ε=

= + + =∑

The resulting slope estimates, 1

ˆ( ,.., )kβ β , in this two-stage regression procedure can now be used to estimate 1( ,.., )kβ β in model (15) above. Similarly, the intercept estimate, 0α , can be used to estimate 0β by setting (34) 0 0

ˆ ˆ ˆ(1 )β ρ α= − But before doing so, the key question to be addressed is whether or not autocorrelation effects have be effectively removed from (16) by this procedure. This can of course be checked by a second application of the Durbin-Watson test, which we now illustrate in terms of the Sales.jmp example above. To do so, we start by recording the rho_hat estimate, 0.4821 , as a new column in the data set, and then construct the transformed variables

ˆ 1 0

ˆ 0 2

ˆ 1 4

d

d

d

ρ

ρ

ρ

= ↔ =

= ↔ =

= − ↔ =

ESE 302 Tony E. Smith

13

(31) and (32) in terms of rho_hat. In particular, the transformed sales variable is here denoted by d_sales (for weighted sales difference), and is constructed as in Figure 8 below, where the Lag operator (under Row functions) identifies the sales value in the previous row. The transformed pci variable, d_pci, is constructed in a similar manner, and Fit Model is now used to regress d_sales on d_pci, yielding the results in Figure 9 below: Notice first that the cyclical pattern of residuals in Figure 1 above appears to have been substantially reduced, though not completely removed. This is typical of such a two-stage regression procedure. Next observe that while d_pci is still seen to be a very significant predictor of d_sales, a comparison of the t Ratio in Figures 1 and 9 shows that this level of significance has indeed been reduced (along with the value of RSquare). This is precisely the expected effect of accounting for the higher variance of 1β , as detailed in Appendix 1.

Figure 8. Transformed Sales Variable

Figure 9. Regression of Transformed Variables

ESE 302 Tony E. Smith

14

The Durbin-Watson results for this transformed data set are shown in Figure 1 below, along with a Residual-by-Row plot showing the sequence of these transformed residuals in time. Note first from the Durbin-Watson test that the significance, 0.1160, of autocorrelation is now considerably less, and indeed, is no longer even weakly significant (by the usual “prob < 0.10” standard). This is borne out by the Residual-by-Row plot in which the strong cyclical relation seen in Figure 1 above is no longer evident. So in the present example, it does appear that this two-stage regression procedure has been successful in eliminating (or at least significantly reducing) the effects of autocorrelation. This means that one can have much more faith in these new estimates and significance levels for the beta coefficients. However, it should be emphasized that autocorrelation effects are not always so easily rectified. Indeed, while the significance of autocorrelation is almost always reduced by this two-stage procedure, it may nonetheless continue to be quite significant (say reduced from .001 to .01). This suggests that it might be worthwhile repeating the above procedure by “differencing the differences” in the second stage. Such a three-stage regression procedure is detailed in Appendix 2 below. However, it should also be emphasized that there are dangers in doing so. Roughly speaking, taking differences of data series tends to produce a “rougher” series with larger variance (as we have seen). So taking second differences will tend to increase variance even further. Thus, while autocorrelation effects may be reduced further, this additional variance may render all beta coefficients insignificant as well. In such cases, all that can be concluded is that autocorrelation effects are so strong that no substantive relations between the dependent and explanatory variables can be identified. With this in mind, we now consider an alternative approach in which autocorrelation effects are captured by additional explanatory variables.

Figure 10. Durbin-Watson Test plus Residual Plot

ESE 302 Tony E. Smith

15

6. An Explanatory Approach to Autocorrelation In the Sales.jmp example above, it is of interest ask whether the cyclical fluctuations in residuals might actually be explained by other means. In the present case, it is well known that economic fluctuations (business cycles) often influence spending behavior in a manner which is more rapid than overall changes in per capita income. Because this is appears to be consistent with the fluctuations observed in these sales residuals, it is of interest to ask whether there are other explanatory variables that might capture such phenomena. An obvious choice here is the unemployment rate, which is perhaps the single best indicator of the state of the economy. Annual unemployment rates, ur, are readily available at the national level, and are included in Sale.jmp. If this new variable is added to the regression, then the results obtained using Fit Model are shown in Figure 11 below. Note first from the RSquare Adj value that the addition of this unemployment variable has substantially improved the overall fit of the model, and in particular that its highly significant negative beta coefficient suggests that sales are indeed substantially reduced during periods of unemployment. Moreover, we see from the associated Durbin-Watson test that all autocorrelation effects have been effectively removed from this regression (as also seen by the Residual-by-Row plot). So in this example, we have not only succeeded in eliminating autocorrelation effects, but have also obtained a sharper explanation of the original sales data. However, it should be emphasized that such simple explanations of autocorrelation effects are often not available. So the real power of the “black box” two-stage regression procedure above is that it is always applicable, even when sources of autocorrelation effects are complex, or perhaps not known at all.

Figure 11. Regression Results with Unemployment Added

ESE 302 Tony E. Smith

16

Appendix 1. Consequences of Autocorrelation for Regression To analyze the consequences of autocorrelation, we focus on the variance of the slope estimate,

1β , for a simple regression, (35) 0 1 , 1,..,t t tY x u t Tβ β= + + = with autocorrelated errors satisfying

(36) 21

11

1

, 1

, 2,..,t

t t

tu

u t Tρε

ρ ε−

== + =

where | | 1ρ < and 2~ (0, ) , 1,..,t iid

N t Tεε σ = . To do so, recall first that 1β is a linear estimator of

the form, (37) 1 1

ˆ Tt tt

w Yβ=

= ∑ where

(38) 2

1

, 1,..,( )t

t Tii

x xw t Tx x

=

−= =

−∑

As was shown in class, this implies that 1β is an unbiased estimator of 1β , regardless of the presence of autocorrelation. However, the estimated variance of 1β depends on the assumption that that the errors ( : 1,.., )tu t T= are independently and identically distributed with mean zero and variance, 2

uσ , so that as shown in class,

(39) 2

2 2 21 21 1

ˆvar( ) var( )( )

T T ut t u tt t

tt

w Y wx xσβ σ

= == = =

−∑ ∑ ∑

With these observations, our main objective is to show that the true variance of 1β tends to be much larger than (39) in the presence of autocorrelation. To do so, we start by observing that if the tY s′ are not independent, and in particular have autocorrelated errors, then the variance of the sum of random variables in (37) now takes the more general form: (40) 2

1 1 1ˆvar( ) var( ) cov( , )T T

t t t s t st t s tw Y w w Y Yβ

= = ≠= +∑ ∑ ∑

To analyze this expression further, observe next that

ESE 302 Tony E. Smith

17

(41) cov( , ) cov( , ) ( ) ( ) ( ) ( )t s t s t s t s t sY Y u u E u u E u E u E u u= = − = In particular, this implies that for each 2,..,t T= , (42) 2

1 1 1 1 1 1cov( , ) ( ) [( ) ] ( ) ( )t t t t t t t t t tu u E u u E u u E u E uρ ε ρ ε− − − − − −= = + = + But since 1tu − is a function of 1 2( , ,...)t tε ε− − , it follows that tε and 1tu − are independent, so that (43) 2

1 1 1cov( , ) ( ) var( ), 2,..,t t t tu u E u u t Tρ ρ− − −= = = Moreover, the stationarity argument in (18) shows that (44) 2var( ) , 1,2,..,t uu t Tσ= = so that expression (43) takes the simpler form, (45) 2

1cov( , ) , 2,..,t t uu u t Tρσ− = = By applying the same argument to tu and 2tu − , we see that (46) 2 2 1 2 2 1 2cov( , ) ( ) [( ) ] [( ( ) ) ]t t t t t t t t t t tu u E u u E u u E u uρ ε ρ ρ ε ε− − − − − − −= = + = + + 2 2 2 2

2 2 1 2 2( ) ( ) ( ) ( ) 0 0t t t t t tE u E u E u E uρ ρ ε ε ρ− − − − −= + + = + + 2 2

uρ σ= Thus, by recursive applications of the same argument, it can readily be shown that (47) 2cov( , ) , 1,.., 1, 2,..,k

t t k uu u k t t Tρ σ− = = − = and thus that the second term in (40) can be given the more explicit form, (48)

1 1cov( , ) 2 cov( , )T T

t s t s t s t st s t t s tw w Y Y w w Y Y

= ≠ = <=∑ ∑ ∑ ∑

1 1 2

1 1 1 12 cov( , ) 2T t T t k

t t k t t k t t k ut k t kw w Y Y w w ρ σ− −

− − −= = = ==∑ ∑ ∑ ∑

Finally, since 2var( ) var( )t t uY u σ= = by (41), it follows that (40) can be simplified to (49) 12 2 2

1 1 1 1ˆvar( ) 2T T t k

t u t t k ut t kw w wβ σ ρ σ−

−= = == +∑ ∑ ∑

ESE 302 Tony E. Smith

18

To evaluate the w coefficients in this expression, it is convenient to replace the denominator in (38) by leverage, 2( )t tL x x= Σ − , so that (49) becomes

(50) 2

12 21 2 21 1 1

( ) ( )( )ˆvar( ) 2T T tt t t k ku ut t k

x x x x x xL L

β σ ρ σ− −= = =

− − −= +∑ ∑ ∑

122 21 1

( )( )2 T t t t kku t k

L x x x xL L

σ ρ− −= =

− − = + ∑ ∑

2

1

1 1

( )( )1 2 T tu t t kkt k

x x x xL Lσ ρ− −

= =

− − = + ∑ ∑

Finally, by replacing leverage, L, with its explicit form, we obtain the key result:7

(51) 2

11 2 21 1

( )( )ˆvar( ) 1 2( ) ( )

T tu t t kkt k

t t t t

x x x xx x x xσβ ρ− −

= =

− −= + Σ − Σ −

∑ ∑

By comparing this expression with (39), it becomes clear that the presence of autocorrelation will tend to increase the variance of 1β , whenever the summation in brackets is positive. But recall that autocorrelation is almost invariably positive, so that 0ρ > . Moreover, since the vast majority of x-processes also exhibit some degree of positive autocorrelation, one can expect the the cross products, ( )( )t t kx x x x−− − , to be positive for small k. Finally, since the (geometrically) decreasing weights, kρ , ensure that these small-lag terms dominate this summation, it follows that the sum will be positive in most cases, and thus that the true variance of 1β will be substantially larger that expression (39).8

7 This expression is essentially the same as expression (12.2.8) in Gujarati, D.N. and D.C. Porter (2009) Basic Econometrics, 5th Ed., Chapter 12 (available online at https://www.academia.edu/15273562/). The only difference is that their notation expresses the explanatory variable, x, in deviation form (i.e., as deviations from the sample mean). 8 A more detailed examination of the potential size of this effect is given in the Gujarati-Porter reference above.

ESE 302 Tony E. Smith

19

Appendix 2. Extension to Three-Stage Least Squares At mentioned in the text, the two-stage least squares procedure for removing temporal autocorrelation effects can be extended to a three-stage procedure. This procedure can be formalized as follows. Three-Stage Model (52) 0 1

, 1,..,kt j jt tj

Y x t Tβ β ε=

= + + =∑

(53) 1 , 2,..,t t tu t Tε ρε −= + = (54) 1 , 2,..,t t tu u v t Tλ −= + =

(55) 2~ (0, ) , 1,..,t iidv N t Tσ =

The key point here is that the “second-order” residuals, tu , are no longer assumed to be independent. Rather, they now depend on previous residual values as well. In this three-stage model, there are two autocorrelation parameters, ρ and λ , and the resulting “third-order” residuals, tv , are now assumed to be iid normal [expression (55)]. Three-Stage Procedure As in the two-stage procedure, we start by letting, (56) (1)

1 , 2,..,t t tY Y Y t Tρ −= − = (57) (1)

, 1 , 2,..,jt jt j tx x x t Tρ −= − = so that (58) (1)

0 0 , 1 11 1( ) ( )k k

t j jt t j j t tj jY x xβ β ε ρ β β ε− −= =

= + + − + +∑ ∑

0 , 1 11

(1 ) ( ) ( )kj jt j t t tj

x xρ β β ρ ε ρε− −== − + − + −∑

(1)

0 1(1 ) , 2,..,k

j jt tjx u t Tρ β β

== − + + =∑

Now proceed to stage three by letting (59) (2) (1) (1)

1t t tY Y Yλ −= −

ESE 302 Tony E. Smith

20

(60) (2) (1) (1), 1 , 3,..,jt jt j tx x x t Tλ −= − =

so that (61) (2) (1)

0 1[(1 ) ]k

t j jt tjY x uρ β β

== − + +∑

(1)

0 , 1 11[(1 ) ]k

j j t tjx uλ ρ β β − −=

− − + +∑

(1) (1)

0 , 1 11(1 )(1 ) ( ) ( )k

j jt j t t tjx x u uλ ρ β β ρ ρ− −=

= − − + − + −∑

(2)

0 1(1 )(1 ) , 3,..,k

j jt tjx v t Tλ ρ β β

== − − + + =∑

So if ρ and λ were known, then by (55) we see that the resulting regression in (61) with 2T − samples has removed all autocorrelation effects. Moreover, since the slope values, jβ , in (61) are the same as in (52), these initial values can be estimated using (61). Finally, to estimate the unknown parameters, ρ and λ , we start with the two-stage procedure and obtain a set of estimated regression residuals, tε , and a corresponding (modified) ρ estimate,

(62) 132

2

ˆ ˆˆ

ˆ

Tt tt

Ttt

ε ερ

ε−=

=

= ∑∑

If the residuals, 1ˆ ˆ ˆˆt t tu ε ρε −= − are uncorrelated (by Durbin-Watson), then we may assume that

0λ = in (54) and stop. Otherwise, we proceed to estimate λ by the regression: (63) 2

1ˆ ˆ , ~ (0, ) , 1,..,t t t t iidu u v v N t Tλ σ−= + =

and obtain the corresponding (modified) least squares estimate

(64) 132

2

ˆ ˆˆˆ

Tt tt

Ttt

u u

uλ −=

=

= ∑∑

which is equivalent to iterating the two-stage least squares procedure in JMP. If the residuals, 1

ˆˆ ˆt tu uλ −− , are uncorrelated (again by Durbin-Watson), then this procedure has been successful. Otherwise, one could in principle proceed to a “fourth stage”. However, as I said in class, so much additional “differencing noise” has already been introduced, that the beta estimates of interest in such a fourth stage are not likely to remain significant.

ESE 302 Tony E. Smith

21

Note finally that expression (52) in this three-stage model looks exactly like the original multiple regression model. So expressions (53) through (55) simply elaborate the error structure of this model. Moreover, if one formally “initializes” this model by including the error variables 1ε and

1u , and replacing (55) by (65) 2

1 1 1, , ,.., ~ (0, )T iidu v v Nε σ

then it follows in particular that 1( ) 0E ε = and that (66) 2 1 2 1 1 2( )u u vε ρε ρε λ= + = + + 2 1 1 2( ) ( ) ( ) ( ) 0E E E u E vε ρ ε λ⇒ = + + = Proceeding by induction, one may similarly verify that ( ) 0tE ε = for all 1,..,t T= , and thus that the conditional expectation in (52) has the familiar form: (67) 1 0 1

( | ,.., ) , 1,..,kt t kt j jtj

E Y x x x t Tβ β=

= + =∑

So all jβ values continue to have the same interpretation, i.e., the expected change in tY resulting from a unit change in jtx , all else being equal.