multicollinearity - rice universitybwbwn/econ510_files/part_3.pdfvariable as collinearity increases,...

Chapter 10

Multicollinearity

10.1 The Nature of Multicollinearity

10.1.1 Extreme Collinearity

The standard OLS assumption that (xi1, xi2, . . . , xik ) not be linearly relatedmeans that for any ( c1, c2, . . . , ck )

xik 6= c1xi1 + c2xi2 + · · ·+ ck−1xi,k−1 (10.1)

for some i. If the assumption is violated, then we can find ( c1, c2, . . . , ck−1 )such that

xik = c1xi1 + c2xi2 + · · ·+ ck−1xi,k−1 (10.2)

for all i. Define

X1 =

x12 · · · x1kx22 · · · x2k

......

xn2 · · · xnk

, xk =

xk1xk2

...xkn

, and c =

c1c2...

ck−1

.

Then extreme collinearity can be represented as

xk = X1c. (10.3)

We have represented extreme collinearity in terms of the last explanatory vari-able. Since we can always re-order the variables this choice is without loss ofgenerality and the analysis could be applied to any non-constant variable bymoving it to the last column.

10.1.2 Near Extreme Collinearity

Of course, it is rare, in practice, that an exact linear relationship holds. Instead,we have

xik = c1xi1 + c2xi2 + · · ·+ ck−1xi,k−1 + vi (10.4)

133

134 CHAPTER 10. MULTICOLLINEARITY

or, more compactly,xk = X1c + v, (10.5)

where the v’s are small relative to the x’s. If we think of the v’s as randomvariables they will have small variance (and zero mean if X includes a columnof ones).

A convenient way to algebraically express the degree of collinearity is thesample correlation between xik and wi = c1xi1+c2xi2+ · · ·+ck−1xi,k−1, namely

rx,w =cov(xik, wi )√var(xi,k)var(wi)

=cov(wi + vi, wi )√var(wi + vi)var(wi)

(10.6)

Clearly, as the variance of vi grows small, this value will go to unity. For nearextreme collinearity, we are talking about a high correlation between at leastone variable and some linear combination of the others.

We are interested not only in the possibility of high correlation between xikand the linear combination wi = c1xi1 + c2xi2 + · · ·+ ck−1xi,k−1 for a particularchoice of c but for any choice of the coefficient. The choice which will maximizethe correlation is the choice which minimizes

∑ni=1 w

2i or least squares. Thus

c = (X′1X1)−1X′1xk and w = X1c and

(rxk,w)2 = R2k• (10.7)

is the R2 of this regression and hence the maximal sample correlation betweenxki and the other x’s.

10.1.3 Absence of Collinearity

At the other extreme, suppose

R2k• = rxk,w = cov(xik, wi ) = 0. (10.8)

That is, xik has zero sample correlation with all linear combinations of the othervariables for any ordering of the variables. In terms of the matrices, this requiresc = 0 or

X′1xk = 0. (10.9)

regardless of which variable is used as xk. This is called the case of orthogonalregressors, since the various x’s are all orthogonal. This extreme is also veryrare, in practice. We usually find some degree of collinearity, though not perfect,in any data set.

10.2 Consequences of Multicollinearity

10.2.1 For OLS Estimation

We will first examine the effect of xk1 being highly collinear upon the estimatorβk. Now let

xk = X1c + v (10.10)

10.2. CONSEQUENCES OF MULTICOLLINEARITY 135

The OLS estimates are given by the solution of

X′y = X′Xβ

= X′( X1 : xk )β

= ( X′X1 : X′xk )β (10.11)

Applying Cramer’s rule to obtain βk yields

βk =|X′X1 : X′y||X′X|

(10.12)

However, as the collinearity becomes more extreme, the columns of X (the rowsof X′) become more linearly dependent and

limv→0

β1 =0

0(10.13)

which is indeterminant.Now, the variance-covariance matrix is

σ2( X′X )−1 = σ2 1

|X′X|adj( X′X )

= σ2 1

|X′X|adj

[(X′1x′k

)( X1 : xk )

]

= σ2 1

|X′X|adj

(X′1X1 X′1xkX′1xk x′kxk

). (10.14)

The variance of βk is given by the (k, k) element, so

var( βk ) = σ2 1

|X′X|cof(k, k) = σ2 1

|X′X||X′1X1|. (10.15)

Thus, for |X′1X1| 6= 0, we have

limv→0

var( βk ) =σ2|X′1X1|

0=∞. (10.16)

and the variance of the collinear terms becomes unbounded.It is instructive to give more structure to the variance of the last coefficient

estimate in terms of the sample correlation R2k• given above. First we obtain the

covariance of the OLS estimators other than the intercept. Denote X = (` : X∗)where ` is an n × 1 vector of ones and X∗ are the nonconstant columns of X,then

X′X =

[`′` `′X∗

X∗′` X∗′X∗

]. (10.17)

Using the results for the inverse of a partitioned matrix we find that the lowerright-hand k − 1× k − 1 submatrix of the inverse is given by

(X∗′X∗ −X∗′`(`′`)−1`′X∗)−1 = (X∗′X∗ − nx∗x∗′)−1 (10.18)

= [(X∗ − `x∗′)′(X∗ − `x∗′)]−1

= (X′X)−1


where x∗ = `′X∗/n is the mean vector for the nonconstant variables andX = X∗ − `x∗′ is the demeaned or deviation form of the data matrix for thenonconstant variables.

We now denote X = (X1 : xk) where xk is last column (k − 1)th, then

X′X =

[X′1X1 X

′1xk

x′kX x′kxk

]. (10.19)

Using the results for partitioned inverses again, the (k− 1, k− 1) element of the

inverse of (X′X)−1

is given by,

(x′kxk − x′kX1(X′1X1)−1X

′1xk)−1 = 1/(x′kxk − x′kX1(X

′1X1)−1X

′1xk)

= 1/e′kek= 1/(x′kxk · e′kek/x′kxk)

= 1/(x′kxk(SSEkSSTk

))

= 1/(x′kxk(1−R2k•)) (10.20)

where ek = (In − X1(X′1X1)−1X

′1)xk are the OLS residuals from regressing

the demeaned xk’s on the other variables and SSEk, SSTk, and R2k· are the

corresponding statistics for this regression. Thus we find

var(βk) = σ2[( X′X )−1]kk = σ2/(x′kxk(1−R2k•)) (10.21)

= σ2/(∑n

i=1(xik − xk)2(1−R2

k•))

= σ2/(n · 1

n

∑n

i=1(xik − xk)2(1−R2

k•)).

and the variance of βk increases with the noise σ2 and the correlation R2k• of xk

with the other variables, and decreases with the sample size n and the signal1n

∑ni=1(xik−xk)2. Since the order of the variables is arbitrary and any could be

placed in the k-th position, var(βj) will be the same expression with j replacingk.

Thus, as the collinearity becomes more and more extreme:

• The OLS estimates of the coefficients on the collinear terms become in-determinant. This is just a manifestation of the difficulties in obtaining(X′X)

−1.

• The OLS coefficients on the collinear terms become infinitely variable.Their variances become very large as R2

k· → 1.

• The OLS estimates are still BLUE and with normal disturbances BUE.Thus, any unbiased estimator will be afflicted with the same problems.

Collinearity does not effect our estimate s2 of σ2. This is easy to see, since wehave shown that

(n− k )s2

σ2∼ χ2

n−k (10.22)

10.2. CONSEQUENCES OF MULTICOLLINEARITY 137

regardless of the values of X, provided X′X still nonsingular. This is to becontrasted with the β where

β ∼ N(β, σ2( X′X )−1 ) (10.23)

clearly depends on X and more particularly the near non-invertibility of X′X.

10.2.2 For Inferences

Provided collinearity does not become extreme, we still have the ratios (βj −βj)/√s2djj ∼ tn−k where djj = [( X′X )−1]jj . Although βj becomes highly

variable as collinearity increases, djj grows correspondingly larger, thereby com-pensating. Thus under H0 : βj = β0

j , we find (βj − β0j )/√s2djj ∼ tn−k, as is

the case in the absence of collinearity. This result that the null distribution ofthe ratios is not impacted as collinearity becomes more extreme seems not tobe fully advertised in most texts.

The inferential price extracted by collinearity is loss of power. Under H1 :βj = β1

j 6= β0j , we can write

(βj − β0j )/√s2djj = (βj − β1

j )/√s2djj + (β1

j − β0j )/√s2djj . (10.24)

The first term will continue to follow a tn−k distribution, as argued in theprevious paragraph, as collinearity becomes more extreme. However, the secondterm, which represents a “shift” term, will grow smaller as collinearity becomesmore extreme and djj becomes larger. Thus we are less likely to shift thestatistic into the tail of the ostensible null distribution and hence less likely toreject the null hypothesis. Formally,

(βj − β0j )/√s2djj ∼ tn−k((β1

j − β0j )/√σ2djj) (10.25)

and the noncentrality parameter becomes smaller and smaller as collinearitybecomes more extreme.

Alternatively the inferential impact can be seen through the impact on theconfidence intervals. Using the standard approach discussed in the previouschapter, we have [βj − a

√s2djj , βj + a

√s2djj) as the 95% confidence interval,

where a is the critical value for a .025 tail. Note that as collinearity becomesmore extreme and djj becomes larger, the width of the interval becomes largeras well. Thus we see that the estimates are consistent with a larger and largerset of null hypothesis as the collinearity strengthens. In the limit it is consistentwith any null hypothesis and we have zero power.

We should emphasize that collinearity does not always cause problems withpower. The noncentrality parameter in (10.22) can be written

(β1j − β0

j )/√σ2djj =

√n(β1

j − β0j )/

√σ2/(

1

n

∑n

i=1(xij − xj)2(1−R2

j•))

which clearly depends on other factors than the degree of collinearity. Thesize of the noncentrality increases with the sample size

√n, the difference be-

tween the null and alternative hypotheses (β1j − β0

j ), and the signal noise ratio


( 1n

∑ni=1(xij −xj)2/σ2. It is entirely possible for R2

j• to be close to one and thenoncentrality to still be large. The important question is not whether collinear-ity is present or extreme but whether is is extreme enough to eliminate thepower of our test. This is also a phenomenon that does not seem to be fullyappreciated or well-enough advertised in most texts.

We can easily tell when collinearity is not a problem if the coefficients are sig-nificant or we reject the null hypothesis under consideration. Only if apparentlyimportant variables are insignificantly different from zero or have the wrong signshould we consider the possibility that collinearity is causing problems.

10.2.3 For Prediction

If all we are interested in is prediction of yp given xp1, xp2, . . . , xpk, then we arenot particularly interested in whether or not we have isolated the individualeffects of each xij . We are interested in predicting the total effect or variationin y.

A good measure of how well the linear relationship captures the total effector variation is the overall R2 statistic. But the R2 value is related to s2 by

R2 = 1− e′e(y − y)′(y − y)

= 1− (n− k)s2

var(y), (10.26)

which does not depend upon the collinearity of X.

Thus, we can expect our regressions to predict well, despite collinearity andinsignificant coefficients, provided the R2 value is high. This depends, of course,upon the collinearity continuing to persist in the future. If the collinearitydoes not continue, then prediction will become increasingly uncertain. Suchuncertainty will be reflected, however, by the estimated standard errors of theforcast and hence wider forecast intervals.

10.2.4 An Illustrative Example

As an illustration of the problems introduced by collinearity, consider the con-sumption equation

Ct = β0 + β1Yt + β2Wt + ut, (10.27)

where Ct is consumption expenditures at time t, Yt is income at time t and Wt

is wealth at time t. Economic theory suggests that the coefficient on incomeshould be slightly less than one and the coefficient on wealth should be positive.The time-series data for this relationship are given in the following table:

10.3. DETECTING MULTICOLLINEARITY 139

Ct Yt Wt

70 80 81065 100 100990 120 127395 140 1425110 160 1633115 180 1876120 200 2052140 220 2201155 240 2435150 260 2686

Table 9.1: Consumption Data

Applying least squares to this equation and data yields

Ct = 24.775(6.752)

+ 0.942(0.823)

Yt − 0.042(0.081)

Wt + et,

where estimated standard errors are given in parenthesis. Summary statisticsfor the regression are: SSE = 324.446, s2 = 46.35, and R2 = 0.9635. Thecoefficient estimate for the marginal propensity to consume seems to be a rea-sonable value however it is not significantly different from either zero or one.And the coefficient on wealth is negative, which is not consistent with economictheory. Wrong signs and insignificant coefficient estimates on a priori impor-tant variables are the classic symptoms of collinearity. As an indicator of thepossible collinearity the squared correlation between Yt and Wt is .9979, whichsuggests near extreme collinearity among the explanatory variables.

10.3 Detecting Multicollinearity

10.3.1 When Is Multicollinearity a Problem?

Suppose the regression yields significant coefficients, then collinearity is not aproblem—even if present. On the other hand, if a regression has insignificantcoefficients, then this may be due to collinearity or that the variables, in fact,do not enter the relationship.

10.3.2 Zero-Order Correlations

If we have a trivariate relationship, say

yt = β1 + β2xt2 + β3xt3 + ut, (10.28)


we can look at the zero-order correlation between x2 and x3. As a rule of thumb,if this (squared) value exceeds the R2 of the original regression, then we have aproblem of collinearity. If r23 is low, then the regression is likely insignificant.

In the previous example, r2WY = 0.9979, which indicates that Yt is morehighly related to Wt than Ct and we have a problem. In effect, the variablesare so closely related that the regression has difficulty untangling the separateeffects of Yt and Wt.

In general (k > 3), when one of the zero-order correlations between x′s islarge relative to R2 we have a problem.

10.3.3 Partial Regressions

In the general case (k > 3), even if all the zero-order correlations are small, wemay still have a problem. For while x1 my not be strongly linearly related to anysingle xi (i 6= 1), it may be very highly correlated with some linear combinationof xs.

To test for this possibility, we should run regressions of each xi on all theother x’s. If collinearity is present, then one of these regressions will have a highR2 (relative to R2 for the complete regression).

For example, when k = 4 and

yt = β1 + β2xt2 + β3xt3 + β4xt4 + ut (10.29)

is the regression, then collinearity is indicated when one of the partial regressions

xt2 = α1 + α3xt3 + α4xt4

xt3 = γ1 + γ2xt2 + γ4xt4

xt4 = δ1 + δ2xt2 + δ3xt3 (10.30)

yields a large R2 relative to the complete regression.

10.3.4 Variance Inflation Factor.

A more informative use of the partial regressions is to directly calculate theimpact of collinearity on the variance of the coefficient in question. We knowthat the variance of coefficient j will be increased by

vj = 1/(1−R2j•) (10.31)

as a result of the correlation structure among the explanatory variablles. Thisis sometimes called the variance inflation factor (VIF). If you have the usualunbiased variance estimator for xij , denoted say s2xj = 1

n−1∑ni=1(xij−xj)2, then

we do not need to perform a partial regression but can alternatively calculatethe VIF as

vj =(sβj )

2(n− 1)s2xjs2

. (10.32)

where sβj denotes the usual reported estimated standard error of βj .

10.3. DETECTING MULTICOLLINEARITY 141

The square root of this factor gives us a direct measure of how much theestimated standard errors have been inflated as a result of collinearity. In casesof near extreme collinearity these factors will sometimes be in the hundredsor thousands. If it is less than say 4, then the standard errors are less thandoubled. For the consumption example given above it is 476, which indicatescollinearity between income and wealth is a substantial problem for the precisionof the estimates for both the income and wealth coefficients.

10.3.5 The F Test

The manifestation of collinearity is that estimators become insignificantly dif-ferent from zero, due to the inability to untangle the separate effects of thecollinear variables. If the insignificance is due to collinearity, the total effect isnot confused, as evidenced by the fact that s2 is unaffected.

A formal test, accordingly, is to examine whether the total effect of theinsignificant (possibly collinear) variables is significant. Thus, we perform an Ftest to test the joint hypothesis that the individually insignificant variables arejointly insignificant.

For example, if the regression

yt = β1 + β2xt2 + β3xt3 + β4xt4 + ut (10.33)

yields insignificant (from zero) estimates of β2, β3 and β4, we use an F testof the joint hypothesis β2 = β3 = β4 = 0. If we reject this joint hypothesis,then the total effect is strong, but the individual effects are confused. This isevidence of collinearity. If we accept the null, then we are forced to concludethat the variables are, in fact, insignificant.

For the consumption example considered above, a test of the null hypothesisthat the collinear terms (income and wealth) are jointly zero yields an F -statisticvalue of 92.40 which is very extreme under the null when the variable has anF2,7. Thus the variables are individually insignificant but are jointly significant,which indicates that collinearity is, in fact, a problem.

10.3.6 The Condition Number

Belsley, Kuh, and Welsh (1980), suggest an approach that considers the invert-ibility of X directly. First, we transform each column of X so that they are ofsimilar scale in terms of variability by dividing each column to unit length:

x∗j = xj/√

x′jxj (10.34)

for j = 1, 2, ..., k. Next we find the eigenvalues of the moment matrix of theso-transformed data matrix by finding the k roots of :

det(X∗′X∗ − λIk) = 0. (10.35)

Note that since X∗′X∗ is positive semi-difinite the eigenvalues will be betweenzero and one with values of zero in the event of singularity and close to zero in


the event of close to singularity. The condition number of the matrix is takenas the ratio of the largest to smallest of the eigenvalues:

c =λmax

λmin. (10.36)

Using an analyis of a number of problems BKW suggest that collinearity is apossible issue when c ≥ 20. For the example the condition number is 166.245,which indicates a very poorly conditioned matrix. Although this approachtells a great deal about the invertibility of X′X and hence the signal, it tells usnothing about the noise level relative to the signal.

10.4 Correcting For Collinearity

10.4.1 Additional Observations

Professor Goldberger has quite aptly described multicollinearity as “micronu-merosity” or not enough observations. Recall that the shift term depends onthe difference between the null and alternative, the signal-noise ratio, and thesample size. For a given signal-noise ratio, unless collinearity is extreme, itcan always be overcome by increasing the sample size sufficiently. Moreover, wecan sometimes gather more data that, hopefully, will not suffer the collinearityproblem. With designed experiments, and cross-sections, this is particularly thecase. With time series data this is not feasible and in any event gathering moredata is time-consuming and expensive.

10.4.2 Independent Estimation

Sometimes we can obtain outside estimates. For example, in the Ando-Modiglianiconsumption equation

Ct = β0 + β1Yt + β2Wt + ut, (10.37)

we might have a cross-sectional estimate of β1, say β1. Then,

(Ct − β1Yt) = β0 + β2Wt + ut (10.38)

becomes the new problem. Treating β1 as known allows estimation of β2 withincreases in precision. It would not reduce the precision of the estimate of β1which would simply be the cross-sectional estimate. The implied error term,moreover, is more complicated since β1 may be correlated with Wt. Mixedestimation approaches should be used to handle this approach carefully. Notethat this is another way to gather more data.

10.4.3 Prior Restrictions

Consider the consumption equation from Klein’s Model I:

Ct = β0 + β1Pt + β2Pt−1β3Wt + β4W′t + ut, (10.39)

10.4. CORRECTING FOR COLLINEARITY 143

where Ct is the consumption expenditure, Pt is profits, Wt is the private wagebill and W ′t is the governement wage bill.

Due to market forces, Wt and W ′t will probably move together and collinear-ity will be a problem for β3 and β4. However, there is no prior reason todiscriminate between Wt and W ′t in their effect on Ct. Thus it is reasonable tosuppose Wt and W ′t impact Ct in the same way. That is, β3 = β4. The modelis now

Ct = β0 + β1Pt + β2Pt−1β(Wt +W ′t ) + ut, (10.40)

which should avoid the collinearity problem.

10.4.4 Ridge Regression

One manifestation of collinearity is that the effected estimates, say β1, will beextreme with a high probability. Thus,

k∑

i=1

β2i = β2

1 + β22 + · · ·+ β2

k = β′β (10.41)

will be large with a high probability.By way of treating the disease by treating its symptoms, we might restrict

β′β to be small. Thus, we might reasonably

minβ

( y −Xβ )′( y −Xβ ) subject to β′β ≤ m. (10.42)

Form the Lagrangian (since β′β is large, we must impose the restriction withequality).

L = ( y −Xβ )′( y −Xβ ) + λ(m− β′β )

=

n∑

t=1

(yt −

k∑

i=1

βixti

)2

+ +λ(m−k∑

i=1

β2i ). (10.43)

The first-order conditions yield

∂L∂βj

= −2∑

t

(yt −

∑

i

βixti

)xtj + 2λβ2

i = 0, (10.44)

or∑

t

ytxtj =∑

t

∑

i

xtixtj βi + λβj

=∑

i

βi∑

t

xtixtj + λβj , (10.45)

for j = 1, 2, . . . , k. In matrix form, we have

X′y = (X′X + λIn)β. (10.46)


So, we haveβ = (X′X + λIn)−1X′y. (10.47)

This is called ridge regression.Substition yields

β = (X′X + λIn)−1X′y

= (X′X + λIn)−1X′(Xβ + u)

= (X′X + λIn)−1X′(Xβ + (X′X + λIn)−1u) (10.48)

andE[ β ] = (X′X + λIn)−1X′Xβ = Pβ, (10.49)

so ridge regression is biased. Rather obviously, as λ grows large, the expectation“shrinks” towards zero so the bias is towards zero. Next, we find that

var( β ) = σ2(X′X + λIn)−1X′X(X′X + λIn)−1 = σ2Q < σ2(X′X)−1. (10.50)

If u ∼ N(0, σ2In), then

β ∼ N(Pβ, σ2Q) (10.51)

and inferences are possible only for Pβ and hence the complete vector.The rather obvious question in using ridge regression is what is the best

choice for λ? We seek to trade off the increased bias against the reduction inthe variance. This may be done by considering the mean square error (MSE)which is given by

MSE(β) = σ2Q + (P− Ik)ββ′(P− Ik)

= (X′X + λIn)−1σ2X′X + λ2ββ′(X′X + λIn)−1.

We might choose to minimize the determinant or trace of this function. Notethat either is an decreasing function of λ through the inverses and an increasingfunction through the term in brackets. Note also that the minimand dependson the true unkown β, which makes it infeasible.

In practice, it is useful to obtain what is called a ridge trace, which plotsout the estimates, estimated standard error, and estimated square root of meansquared error (SMSE) as a function of λ. Problematic terms will frequentlydisplay a change of sign and a dramatic reduction in the SMSE. If this phe-nonmenon occurs at a sufficiently small value of λ, then the bias will be smalland inflation in SMSE relative to the standard error will be small and we canconduct inference in something like the usual fashion. In particular, if theestimate of a particular coefficient seems to be significantly different from zerodespite the bias toward zero, we can reject the null that it is zero.

Chapter 11

Stochastic ExplanatoryVariables

11.1 Nature of Stochastic X

In previous chapters, we made the assumption that the x’s are nonstochastic,which means they are not random variables. This assumption was motivatedby the control variables in controlled experiments, where we can choose thevalues of the independent variables. Such a restriction allows us to focus onthe role of the disturbances in the process and was most useful in working outthe stochastic properties of the estimators and other statistics. Unfortunately,economic data do not usually come to us in this form. In fact, the independentvariables are typically random variables much like the dependent variable whosevalues are beyond the control of the researcher.

Consequently we will restate our model and assumptions with an eye towardstochastic x. The model is

yi = x′iβ + ui for i = 1, 2, ..., n. (11.1)

The assumptions with respect to unconditional moments of the disturbances arethe same as before:

(i) E[ui] = 0

(ii) E[u2i ] = σ2

(iii) E[uiuj ] = 0, j 6= i

The assumptions with respect to xmust be modified. We replace the assumptionof x nonstochastic with an assumption regarding the joint stochastic behavior ofui and xi, which are taken to be jointly i.i.d.. Several alternative cases will beintroduced regarding the degree of dependence between xi and ui. For stochasticx, the assumption of linearly independent x’s implies that the covariance matrix

145

146 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

of the x’s has full column rank and is hence positive definite. Stated formally,we have:

(iv) (ui,xi) jointly i.i.d. with dependence assumption

(v) E[xix′i] = Q p.d.

Notice that the assumption of normality, which was introduced in previous chap-ters to facilitate inference, was not generally reintroduced. It will be introducedlater but only for one of the dependence cases. Thus we are effectively relax-ing both the nonstochastic regressor and normality assumptions at the sametime, except for one case. The motivation for dispensing with the normalityassumption will become apparent presently.

We will now examine the various alternative assumptions that will be en-tertained with respect to the degree of dependence between xi and ui. Beyondgiving a complete treatment of altrnatives there are good reasons to considereach of these alternative at some length.

11.1.1 Independent X

The strongest assumption we can make relative to this relationship is that ui isstochastically independent of xi, so assumption (iv) becomes

(iv,a) (ui,xi) jointly i.i.d. with ui independent of xi.

This means that the distribution of ui depends in no way on the value of xi andvisa versa. Note that

cov ( g(xi), h(ui) ) = 0, (11.2)

for any functions g(·) and h(·), in this case. When combined with the normalityassumption below, this alternative is the only dependence assumption where theinferential results obtained in Chapter 8 continue to apply in finite samples.

11.1.2 Conditional Zero Mean

The next strongest assumption is that ui is mean independent of xi or E[ui|xi] =0. We also impose variance independence so Assumption (iv) becomes

(iv,b) (ui,xi) jointly i.i.d. with E[ui|xi] = 0, E[u2i |xi] = σ2.

Note that this assumption implies

cov ( g(xi), ui) ) = 0, (11.3)

for any function g(·). The independence assumption (iv,a) along with the un-conditional statements E[ui] = 0 and E[u2i ] = σ2 imply conditional zero meanand constant conditional variance but not the reverse.

We could add conditional normality to this assumption, so ui|xi ∼ N(0, σ2),but this implies that ui is independent of x, so we are really in the previouscase. The usual interences developed in Chapter 8, will be problematic in finite

11.1. NATURE OF STOCHASTIC X 147

samples for the current case without normality, but turn out to be appropriatein large samples whether or not the disturbances are normal. Moreover, theleast square estimator will be unbiased and enjoy optimality properties withinthe class of unbiased estimators.

This assumption is motivated by supposing that our model is simply a state-ment of conditional expectation, E[yi|xi] = x′iβ, and sometimes will not be ac-companied by the conditional second moment assumption E[u2i |xi] = σ2. In thiscase of nonconstant variance, which will be considered at length in Chapter 13,the large sample behavior of least squares is the same as for the uncorrelatedcase considered next.

11.1.3 Uncorrelated X

A weaker assumption is that xi and ui are only uncorrelated, so Assumption(iv) becomes

(iv,c) (ui,xi) jointly i.i.d. with .

This assumption only implies zero covariance in the levels of xi and ui, or forone element of xi,

cov (xij , ui ) = 0. (11.4)

The properties of β are less accessible in this case. Note that conditional zeromean always implies uncorrelated, but not the reverse. It is possible to havea random variables that are uncorrelated but neither has constant conditionalmean given the other.

This assumption is the weakest of the alternatives under which least squareswill continue to be consistent. In general, the conditional second moment willalso be nonconstant for this case. Such nonconstant conditional variance iscalled heteroskedasticity, and will be studied at length in Chapter 13.

11.1.4 Correlated X

A priori information sometimes suggests the possibility that ui is corrleatedwith xi, so Assumption (iv) becomes

(iv,d) (ui,xi) jointly i.i.d. with E[xiui] = d, d 6= 0.

Stated another way

E(xijui ) 6= 0. (11.5)

for some j. As we shall see below, this can have quite serious implications forthe OLS estimates. Consequently we will spend considerable time developingpossible solutions and tests for this condition, which is a violation of (iv,c).

Examples of models which are afflicted with this difficulty abound in theapplied econometric literature. A leading example is the case of simultaneousequations models that we will examine later in this chapter. A second leading


example occurs when our right-hand side variables are measured with error.Suppose

yi = α+ βxi + ui (11.6)

is the true model but

x∗i = xi + vi (11.7)

is the only available measurement of xi. If we use x∗i in our regression, then weare estimating the model

yi = α+ β(x∗i − vi ) + ui

= α+ βx∗i + (ui − βvi ). (11.8)

Now, even if the measurement error vi were uncorrelated with the disturbanceui, the right-hand side variable x∗t will be correlated with the effective distur-bance (ui − βvi).

11.2 Consequences of Stochastic X

11.2.1 Consequences for OLS Estimation

Recall that

β = ( X′X )−1X′y (11.9)

= ( X′X )−1XX ′( Xβ + u )

= β + ( X′X )−1X′u

= β + (1

nX′X )−1

1

nX′u

= β +

1

n

n∑

j=1

xjx′j

−1(

1

n

n∑

i=1

xiui

)

We will now examine the bias and consistency properties of the estimators underthe alternative dependence assumptions.

11.2.1.1 Uncorrelated X

Suppose, under Assumption (iv,c), that xt is only assured of being uncorrelatedwith ui. Rewrite the second term in (11.9) as

1

n

n∑

j=1

xjx′j

−1(

1

n

n∑

i=1

xiui

)=

1

n

n∑

i=1

1

n

n∑

j=1

xjx′j

−1

xi

ui

=1

n

n∑

i=1

wiui.

11.2. CONSEQUENCES OF STOCHASTIC X 149

Note that wi is a function of both xi and xj and is nonlinear in xi for j = i.Now ui is uncorrelated with the xj for j 6= i by independence and the levelof xi by the assumption but is not necessarily uncorrelated with the nonlinearfunction of xi. Thus, if the expectation exists,

E[( X′X )−1X′u] 6= 0, (11.10)

in general, whereupon

E[β] = β + E[( X′X )−1X′u

]6= β. (11.11)

Similarly, we find E[s2] 6= σ2. Thus both β and s2 will be biased, although thebias may disappear asymptotically as we will see below. Note that sometimesthese expectations are not well defined.

Now, each element of xix′i and xiui are i.i.d. random variables with ex-

pectations Q and 0, respectively. Thus, the law of large numbers guaranteesthat

1

n

n∑

i=1

xixi −→p Exixi = Q, (11.12)

and1

n

n∑

i=1

xiui −→p Exiui = 0. (11.13)

It follows that

plimn→∞

β = β + plimn→∞

[(1

nX′X

)−11

nX′u

]

= β + plimn→∞

(1

nX′X

)−1plimn→∞

1

nX′u

= β + Q−1 · 0 = β. (11.14)

Similarly, we can show that

plimn→∞

s2 = σ2. (11.15)

Thus both β and s2 will be consistent.

11.2.1.2 Conditional Zero Mean

Suppose Assumption (iv,b) is satisfied, then E[ui|xi] = 0 and, by independenceacross i, we have E(u|X) = 0. It follows that

E[β] = β+E[(X′X)−1X′u]

= β+E[(X′X)−1X′E(u|X)]

= β+E[(X′X)−1X′ · 0] = β.


and OLS is unbiased. Since conditional zero mean implies uncorrelatedness,then we have the same consistency results as before, namely

plimn→∞

β = β and plimn→∞

s2 = σ2. (11.16)

Now E[u2i |xj ] = σ2 for i 6= j by i.i.d. and E[u2i |xi] = σ2 by (iv,b) henceE(uu′|X) = σ2In. It follows that

E[(β − β)(β − β)′|X] = σ2(X′X)−1

and the Gauss-Markov theorem continues to hold in the sense that least-squaresis BLUE, given X, within the class of estimators that is linear in y with thelinear transformation matrix a function only of X.

Under this conditional covariance assumption, we can show unbiasedness, ina similar fashion, for the variance estimator,

Es2 = Ee′e/(n− k)

=1

n− kEu′( In −X(X′X)−1X′ )u

= σ2. (11.17)

This estimator, being quadratic in u, and hence y, can be shown to be the bestquadratic unbiased estimator BQUE of σ2.

11.2.1.3 Independent X

Suppose Assumption (iv,a) holds, then xi is independent of ui. Since (i) and(ii) assure that E[ui] = 0 and E[u2i ] = σ2, we have conditional zero mean andconstant conditional variance and the corresponding unbiasedness results

Eβ = β and Es2 = σ2, (11.18)

together with the BLUE and BQUE properties and the consistency results

plimn→∞

β = β and plimn→∞

s2 = σ2. (11.19)

11.2.1.4 Correlated X

Finally, suppose Assumption (iv,d) applies, so xi is correlated with ui.and

Exiui = d 6= 0. (11.20)

Obviously, since xi is correlated with ui there is no reason to believe E[(X′X)−1X′u] = 0so

Eβ 6= β


and the OLS estimator will, in general, be biased. Moreover, this bias will notdisappear in large samples as we will see below. Turning to the possibility ofconsistency, we see, by the law of large numbers that

1

n

n∑

i=1

xiuip−→ Exiui = d, (11.21)

whereuponplimn→∞

β = β + Q−1 · d 6= β (11.22)

since Q−1 is nonsingular and d is nonzero. Thus OLS is also inconsistent. Itwill follow that s2 will also be biased and inconsistent.

11.2.2 Consequences for Inferences

In previous chapters, the assumption of unconditionally normal disturbances,specifically,

u ∼N(0,σ2In)

was introduced to facilitate inferences. Together with the nonstochastic regres-sor assumption, it implied that the distribution of the least squares estimator,

β = β + ( X′X )−1X′u

which is linear in the disturbances, has an unconditional normal distribution,specifically,

β∼N(β,σ2(X′X)−1)

All the inferential results from Chapter 8 based on the t and F distributionsfollowed directly.

If the x’s are random variables, however, the unconditional normality of thedisturbances is not sufficient to obtain these results. Accordingly, we eitherstrengthen the assumption to conditional normality or abandon normality com-pletely. In this section, we consider these two alternatives respectively for theindependence (iv,a) and conditional zero mean (iv,b) cases. The other two caseswill be treated in the next section in a more general context and normality willnot be assumed.

11.2.2.1 Conditional Normality - Finite Samples

We first consider the normality assumption for the independene case. Specifi-cally, we add the assumption

(vi) ui ∼ N(0, σ2)

This assumption together with Assumption (iv,a) for the independent case,implies

ui|xi ∼ N(0, σ2)


Moreover

ui|xj ∼ N(0, σ2)

for i 6= j due to the joint i.i.d. assumption. Thus, under (iv,a) and (vi), we have

u|X ∼N(0,σ2In),

which does not depend on X, and

β|X ∼N(β,σ2(X′X)−1).

which does depend on the conditioning variables X.

Fortunately, the distributions of the statistics we utilize for inference do notdepend on the conditioning values. Specifically, it is easy to see that

βj − βj√σ2djj

|X ∼ N(0, 1).

while we can show that

(n− k )s2

σ2|X ∼ χ2

n−k

and is conditionally independent of β|X. Whereupon, following the argumentsin Chapter 8, we find

βj − βj√s2djj

|X ∼ tn−k.

which does not depend on the conditioning values.

Since the resulting distribution does not depend on x, the unconditional dis-tribution of the usual ratio is the same as before. A similar result applies forall the ratios that were found to follow a t distribution in Chapter 8. Likewise,the statistics that were found to follow an unconditional F distribution in thatchapter have the same unconditional distribution here. The non-central distri-butions that applied under the alternative hypotheses there, and depended onX will also apply here given X. Thus we see that treating the X matrix as givenhere is essentially the same as treating it as nonstochastic in Cchapter 8, whichis hardly surprising.

11.2.2.2 Conditional Non-Normality - Large Samples

In the event that u is not conditionally normal, the estimator will not be con-ditionally normal, the standardized ratios will not be standard normal, and thefeasible standardized ratio will not follow the tn−k distribution in finite sam-ples. Fortunately, we can appeal to the central limit theorem for help in largesamples.

We shall first develop the large-sample asymptotic distribution of β underAssumption (iv,b), the case of conditional zero mean and constant conditional


variance. The limiting behavior is identical for Assumption (iv,a), the inde-pendence assumption since together with Assumptions (i) and (ii), it impliesconditional zero mean with constant conditional variance. Recall that,

plimn→∞

( β − β ) = 0, (11.23)

in this case, so in order to have a nondegenerate distribution we consider

√n( β − β ) =

(1

nX′X

)−11√n

X′u. (11.24)

The j-th element of

1√nX ′u =

1√n

n∑

i=1

xiui (11.25)

is1√n

n∑

i=1

xijui. (11.26)

Note the xijui are i.i.d. random variables with

E[xijui] = E[E[xijui|x]] = E[xijE[ui|x]] = E[xij · 0] = 0 (11.27)

and

E[(xijui )2] = E[x2iju2i ] = E[E[x2iju

2i |x]] = E[x2ijE[u2i |x]] = E[x2ijσ

2] = σ2qjj ,(11.28)

where qjj = E[x2jj ] is the jj-th element of Q. Thus, according to the centrallimit theorem,

1√n

n∑

i=1

xijuid−→ N(0, σ2qjj). (11.29)

And, more generally,

1√n

n∑

i=1

xiui =1√n

X′ud−→ N(0, σ2Q). (11.30)

Since 1nX′X converges in probability to the fixed matrix Q, we have

√n( β − β )

d−→ Q−11√n

X′ud−→ N(0, σ2Q−1). (11.31)

For inferences, √n(βj − βj)

d−→ N(0, σ2qjj) (11.32)

and √n(βj − βj)√σ2qjj

d−→ N(0, 1). (11.33)


Unfortunately, neither σ2 nor Q−1 are available, so we will have to substituteestimators. We can use s2 as a consistent estimator of σ2 and

Q =1

nX′X (11.34)

as a consistent estimator of Q. Substituting, we have

√n(βj − βj)√s2qjj

=

√n(βj − βj)√

s2[( 1nX′X)−1]jj

=βj − βj√s2djj

d−→ N(0, 1). (11.35)

Thus, the usual statistics we use in conducting t-tests are asymptotically stan-dard normal. This is particularly convenient since the t-distribution convergesto the standard normal. The small-sample inferential procedures we learnedfor the nonstochastic regressor case are appropriate in large samples for thestochastic regressor case, under conditional zero mean and constant conditionalvariance. And this will be true whether or not conditional normality applies.

In a similar fashion, we can show that the approach introduced in previouschapters for inference on complex hypotheses that had an F -distribution un-der normality with nonstochastic regressors continue to be appropriate in largesamples with non-normality and stochastic regressors. For example, consideragain the model

y = X1β1 + X2β2 + u (11.36)

with H0 : β2 = 0 and H1 : β2 6= 0. Regression on this unrestricted model yieldsSSEu while regression on the restricted model y = X1β1 + u yields the SSEr.We form the statistic [(SSEr − SSEu)/k2]/[SSEu/(n− k)], where k is the un-restricted number of regressors and k2 is the number of restrictions. Underconditional normality this statistic will have a Fk2,n−k distribution. Note thatasymptotically, as n becomes large, the denominator converges to σ2 and theFk2,n−k distribution converges to a χ2

k2distribution (divided by k2). But, fol-

lowing the arguments of the previous paragraph, this would also be the limitingdistribution of this statistic under the conditional zero mean with conditionalconstant variance assumption even if the disturbances are non-normal.

11.3 Correcting for Correlated X

11.3.1 Instruments

Consider

y = Xβ + u (11.37)

and premultiply by X′ to obtain

X′y = X′Xβ + X′u1

nX′y =

1

nX′Xβ +

1

nX′u. (11.38)

11.3. CORRECTING FOR CORRELATED X 155

If X is uncorrelated with u, as in Assumption (iv,c), then the last term disap-pears in large samples:

plimn→∞

1

nX′y = plim

n→∞

1

n(X′X)β, (11.39)

which may be solved to obtain

β = plimn→∞

(1

nX′X

)−11

nX′y

= plimn→∞

(X′X)−1X′y = plim

n→∞β. (11.40)

Of course, if X is correlated with u, as in Assumption (iv,d), then

plimn→∞

1

nX′u = d and plim

n→∞β 6= β. (11.41)

Suppose we can find similarly dimensioned i.i.d. variables zi that are uncorre-lated with ui, then

Eziui = 0. (11.42)

Also, suppose that the zi are correlated with xi so

Ezix′i = P (11.43)

and P is nonsingular. Such variables are known as instruments for the variablesxi. Note that some of the elements of xi may be uncorrelated with ui, in whichcase the analogous elements of zi will be the same and only the elements corre-sponding to correlated variables replaced. Examples will be presented shortly.

We can now summarize these conditions, plus some second moment con-ditions that will be needed below, by expanding Assumptions (iv,d) and (v)as

(iv,d) (ui,xi, zi) jointly i.i.d. with E[xiui] = d, E[ziui] = 0.

(v) E[xix′i] = Q, E[zix

′i] = P, E[ziz

′i] = N, E[u2ixix

′i] = M, E[u2i ziz

′i] = G.

Note that for OLS estimation, zi = xi, P = Q, and for OLS and case (c) d = 0,G = M.

11.3.2 Instrumental Variable (IV) Estimation

Suppose that, analogous to OLS, we premultiply (11.37) by

Z′ = (z1, z2, . . . , zn) (11.44)

to obtain

Z′y = Z′Xβ + Z′u1

nZ′y =

1

nZ′Xβ +

1

nZ′u. (11.45)


But since Eziui = 0, then

plimn→∞

1

nZ′u = plim

n→∞

1

n

n∑

i=1

ziui = 0, (11.46)

so

plimn→∞

1

nZ′y = plim

n→∞

1

nZ′Xβ (11.47)

or

β = plimn→∞

(1

nZ′X

)−11

nZ′y

= plimn→∞

(Z′X)−1Z′y

= = plimn→∞

β, (11.48)

where

β = (Z′X)−1Z′y (11.49)

is defined as the instrumental variable (IV) estimator. Note that OLS is an IVestimator with X chosen as the instruments.

It is instructive to look at a simple example to get a better idea of what ismeant by instrumental variables. Consider again the measurement error model(11.6-11.8). Suppose that yi is the wage received by an individual and xi isthe unobservable variable ability. We have a measurement (with error) of xi,say x∗i , which is the score on an IQ test for the individual, so x′i = (1, x∗i ). Wehave a second, perhaps rougher, measurement (with error) of xi, say zi, whichis the score on a knowledge of the work world test, whereupon z′i = (1, zi).Hopefully, the measurement errors on the regressor x∗i and the instrument ziare uncorrelated so the conditions for IV to be consistent will be met.

11.3.3 Properties of the IV Estimator

We have just shown that

plimn→∞

β = β, (11.50)

so the IV estimator is consistent. In small samples, since

β = (Z′X)−1Z′y

= (Z′X)−1Z′(β + u)

= β + (Z′X)−1Z′u, (11.51)

we generally have bias since we are only assured that zi is uncorrelated withui, but not that (Z′X)−1Z′ is uncorrelated with u. The bias, however, willdisappear as demonstrated in the limiting distribution.

11.3. CORRECTING FOR CORRELATED X 157

Since the estimator is consistent, we need to rescale to obtain the limitingdistribution. After some simple rearranging we have

√n(β − β) = (

1

nZ′X)−1

1√n

Z′u (11.52)

= (1

n

n∑

i=1

zix′i)−1 1√

n

n∑

i=1

ziui. (11.53)

The expression being averaged inside the inverse is i.i.d., and has mean P,so by the law of large numbers 1

n

∑ni=1 zix

′i −→p E[zix

′i] = P. And the

expression inside the final summation is i.i.d., has mean 0, and covariance G,so, by the central limit theorem, 1√

n

∑ni=1 ziui −→p N(0,G). Combining these

two results we find

√n(β − β) −→d N(0,P−1GP′−1). (11.54)

Note that any bias has disappeared in the limiting distribution. So if the meanof the estimator exists in finite samples the limit of the expectation must bezero and the estimator is asymptotically unbiased.

This limiting normality result for instrumental variables is very general andyields the limiting distribution when Z = X and hence β = β in each of the caseswhere OLS was found to be consistent. For independence (Assumption (4,a)) orconditional zero mean with constant conditional covariance (Assumption (4,b)),we have P = Q and G =σ2Q, whereupon

√n(β − β) −→d N(0,σ2Q−1) (11.55)

which is the same as (11.31). For the uncorrelated case (Assumption (4,c)) orconditional zero mean with non-constant conditional covariance, we have P = Qand G = M, whereupon

√n(β − β) −→d N(0,Q−1MQ−1). (11.56)

This last result will play an important role in Chapter 13 when we study het-eroskedasticity.

In order to make these limiting distribution results operational for inferencepurposes, we must estimate the covariance matrices. Let

u = y −Xβ (11.57)

be the IV residuals, then, under our assumptions, we can show.

P=1

nZ′X −→pP, G =

1

n

n∑

i=1

u2i zix′i−→pG

So a consistent covariance matrix is provided by P−1GP′−1. For the case whereleast squares is valid, we have Q from above and the rather obvious consistent

estimator of M is M = 1n

∑ni=1 u

2ixix

′i.


In most research that has involved the use of instrumental variables, thedisturbances have been assumed to be uncorrelated with the instruments andthe conditional variances (given the values of the instruments) are assumed tobe constant. Thus E[u2i ziz

′i] = G =σ2N and we have

√n(β − β) −→d N(0,σ2P−1NP′−1).

Again, the result for OLS under independence or conditional zero mean andconstant conditional variance is obtained as a special case when Z = X. Esti-mators of the covariance components are provided by

N=1

nZ′Z −→pN

σ2 =u′un

=

n∑

i=1

u2in−→pσ

2. (11.58)

Thusσ2P−1NP′−1 = n · σ2[(Z′X)−1Z′Z(X′Z)−1]

is a consistent estimator of the limiting covariance matrix and with the omissionof n, is the standard output of most IV packages.

This covariance estimator (taking account of scaling by n) can be used inratios and quadratic forms to obtain asymptotically appropriate statistics. Forexample,

βj − βj√σ2[(Z′X)−1Z′Z(X′Z)−1]jj

d−→ N(0, 1), (11.59)

with βj = 0 is the standard ratio printed by IV packages and is asymptoticallyappropriate for testing the null that the coefficient in question is zero. Similarly,

(Rβ−r)′[σ2R(Z′X)−1Z′Z(X′Z)−1R

′]−1(Rβ−r)

d−→ χ2q

is asymptotically appropriate for testing the null Rβ = r. Note that the scalingby n has cancelled in each case.

11.3.4 Optimal Instruments

The instruments zi cannot be just any variables that are independent of anduncorrelated with ui. They should be as closely related to xi as possible whileat the same time remaining uncorrelated with ui.

Looking at the asymptotic covariance matrices P−1GP−1 or σ2P−1NP′−1,we can see as zi and xi become unrelated and hence uncorrelated, that

plimn→∞

1

nZ′X = P (11.60)

goes to zero. The inverse of P consequently grows large and P−1GP−1 willbecome large. Thus the consequence of using zi that are not close to xi is

11.4. DETECTING CORRELATED X 159

imprecise estimates. It is easy to find variables that are uncorrelated with uibut sometimes more difficult to find variables that are also sufficiently closelyrelated to xi. Much of the applied econometric literature is dominated by thesearch for such instruments.

In fact, we can speak of optimal instruments as being all of xi except the partthat is correlated with ui. For models where there is an explicit relationshipbetween ui and xi, the optimal instruments can be found and utilized, at leastasymptotically. For example, suppose the data generating process for xi isgiven by

xi = Πwi + vi, (11.61)

where vi is the part of xi that is linearly related to ui and Πwi is the remain-der. If wi is observable, then we can show that a lower bound for the limitingcovariance matrix for IV is obtained when we use zi = Πwi as the instruments.If Π is unknown we can estimate it from the relationship above and use the fea-sible instruments zi = Πwi, which will yield the same asymptotically optimalbehavior.

11.4 Detecting Correlated X

The motivation for using instrumental variables is compelling when the regres-sors are correlated with the disturbances (d 6= 0). When the regressors areuncorrelated with the disturbances (d = 0), the motivation for using OLS in-stead is equally compelling. In fact, for independence (Assumption (iv,a)) andconditional zero mean with constant conditional variance (Assumption (iv,b)),OLS has well demonstrated optimality properties. Accordingly, it behooves usto determine which is the relevant state. We will set this up as a test of thenull hypothesis that d = 0 against the alternative that d 6= 0.

11.4.1 An Incorrect Procedure

With previously encountered problems of OLS we have examined the OLS resid-uals for signs of the problem. In the present case, where ui being correlated withxi is the problem, we might naturally see if our proxy for ui, the OLS residualset, are correlated with xi. Thus, the estimated covariance

1

n

n∑

i=1

xiet =1

nX′e (11.62)

might be taken as an indication of any correlation between xi and ui. Unfortu-nately, one of the properties of OLS guarantees that

X′e = 0 (11.63)

whether or not ui is correlated with xi. Thus, this procedure will not beinformative.


11.4.2 A Priori Information

Typically, we know that xi is correlated with ui as a result of the structure ofthe model. For example, in the errors in variables model considered above. Insuch cases, the candidates for instruments are often evident.

Another leading case occurs with simultaneous equation models. For exam-ple, consider the consumption equation

Ct = α+ βYt + ut, (11.64)

where income, Yt, is defined by the identity

Yt = Ct +Gt. (11.65)

Substituting (11.64) into (11.65), we obtain

Yt = α+ βYt + ut +Gt, (11.66)

and solving for Yt,

Yt =α

1− β+

1

1− βGt +

1

1− βut. (11.67)

Rather obviously, Yt is linearly related and hence correlated with ut. A candi-date as an instrument for Yt is the exogenous variable Gt.

11.4.3 An IV Approach

In both the simultaneous equation model and the measurement error modelthe possibility of the regressors being correlated with the disturbances is onlysuggested. It is entirely possible that the measured with error variables or theright-hand side endogenous variables are not sufficiently correlated to justify theuse of IV. The real question is what is the appropriate metric for “sufficientlycorrelated”. The answer will be based on a careful comparison of the OLS andIV estimators.

Under the null hypothesis d = 0, we know that both OLS and IV will beconsistent, so

plimn→∞

β = β = plimn→∞

β.

Under the alternative hypothesis d 6= 0, we find that OLS is inconsistent butIV is still consistent, so

plimn→∞

β = β + Q−1d 6= β = plimn→∞

β.

Thus a test can be formulated on the difference between the two estimators,which converges to Q−1d. Under the null this will be zero but under the

alternative it will be nonzero. Recall that√n(β−β)

d−→ N(0,P−1GP−1) and

under the null√n(β−β)

d−→ N(0,Q−1MQ−1) so we will look at the normalized

11.4. DETECTING CORRELATED X 161

difference√n(β − β). In order to determine the limiting distribution of the

difference, we need to add an additional fourth moment condition to Assumption(v), specifically E[u2i zix

′i] = H. With this assumption, it is straightforward to

show, under the null, that

√n(β − β)

d−→[0, (P−1GP−1 −P−1HQ−1 −Q−1HP−1

′+ Q−1MQ−1)

].

(11.68)Under the alternative hypothesis this scaled difference will diverge at the rate√n.

The most powerful use of this result can be based on a quadratic form thatjointly tests the entire vector d = 0. Specifically, we find

n·(β−β)′(P−1GP−1−P−1HQ−1−Q−1HP−1′+Q−1MQ−1)−1(β−β)

d−→ χ2k

if the limiting covariance in (11.68) is nonsingular. It is possible that thismatrix is not full rank, in which case we use the Moore-Penrose generalizedinverse to obtain

n ·(β−β)′(P−1GP−1−P−1HQ−1−Q−1HP−1′+Q−1MQ−1)+(β−β)

d−→ χ2q

where the superscript + indicates a generalized inverse and q is the rank ofthe covariance matrix. To make these statistics feasible, we need to use theestimators of the covariance matrices Q, M, P, and G introduced above and anobvious analogous estimator of H. The limiting distributions will be unchanged.

The approach discussed above is very general and quite feasible but thecovariance matrix used is somewhat more complicated than the tests one seesin the literature. In order to simplify the covariance, a little more structureis required on the relationship between the regressors and instruments and anadditional assumption added. Specifically, we can decompose zi into the com-ponent linearly related to xi and a residual, whereupon we can write

zi = Bxi + εi

where B = PQ−1 and εi orthogonal to xi by construction. Since zi is orthog-onal to ui by the properties of instruments and, under the null hypothesis, xi isorthogonal to ui, then εi is also orthogonal to ui. Beyond mutual orthogonalityof εi, xi, and ui, we need to assume E[u2i εix

′i] = 0. Under the null and this

additional assumption, we find Q−1HP−1′ = Q−1MQ−1 so

√n(β − β)

d−→[0, (P−1GP−1′ −Q−1MQ−1)

], (11.69)

and the covariance of the difference (suitably scaled) is the difference in thecovariances. The quadratic form test will be similarly simplified and madefeasible by consistently estimating the covariance components.

One case where this assumption is satisfied, that has received widespreadattention in the literature, occurs under the constant conditional variance condi-tion E[u2i |zi,xi] = σ2 whereupon E[u2i εix

′i] = E[E[u2i εix

′i|zi,xi]] = E[E[u2i εix

′i|εi,xi]] =


E[σ2εix′i] = 0 by definition. Moreover the covariance matrix simplifies further

since G =σ2N and M =σ2Q so

√n(β − β)

d−→[0, (σ2P−1NP−1 − σ2Q−1)

],

and the quadratic form test becomes

n · (β − β)′(σ2P−1NP−1 − σ2Q−1)−1(β − β)d−→ χ2

k. (11.70)

Again, it is possible that the weight matrix in the quadratic form is rank deficientin which case we use the generalized inverse instead and the degrees of freedomequals the rank. And a feasible version of the statistic requires the consistentestimators of the components of the covariance that were introduced above.

This class of tests based on the difference between an estimator that isconsistent under both the null and alternative and another estimator that isconsistent only under the null are called Hausman-type tests. Strictly speaking,the Hausman test compares an estimator that is efficient under the null andinconsistent under the alternative with one that is consistent under both. Theearliest example of a test of this type was the Wu test, which was applied tothe simultaneous equation model and has the feasible version of the form givenin (11.79).

Chapter 12

Nonscalar Covariance

12.1 Nature of the Problem

12.1.1 Model and Ideal Conditions

Consider the modely = Xβ + u, (12.1)

where y is n × 1 vector of observations on the dependent variable, X is then × k matrix of observations on the explanatory variables, and u is the vectorof unobservable disturbances.

The ideal conditions are

(i) E[u] = 0

(ii & iii) E[uu′] = σ2In

(iv) X full column rank

(v) X nonstochastic

(vi) [u ∼N(0,σ2In)]

12.1.2 Nonscalar Covariance

Nonscalar covariance means that

E[uu′] = σ2Ω, tr(Ω) = n (12.2)

an n-by-n positive definite matrix such that Ω 6= In. That is,

E

u1u2...un

(u1, u2, . . . , un)

= σ2

ω11 ω12 · · · ω1n

ω21 ω22 · · · ω2n

......

. . ....

ω1n ω2n · · · ωnn

(12.3)

163

164 CHAPTER 12. NONSCALAR COVARIANCE

A covariance matrix can be nonscalar either by having non-constant diagonalelements or non-zero off diagonal elements or both.

12.1.3 Some Examples

12.1.3.1 Serial Correlation

Consider the modelyt = α+ βxt + ut, (12.4)

whereut = ρut−1 + εt, (12.5)

and E[εt] = 0, E[ε2t ] = σ2, and E[εtεs] = 0 for all t 6= s. Here, ut and ut−1 arecorrelated, so Ω is not diagonal. This is a problem that afflicts a large fractionof time series regressions.

12.1.3.2 Heteroscedasticity

Consider the model

Ci = α+ βY + ui i = 1, 2, . . . , n, (12.6)

where Ci is consumption and Yi is income for individual i. For a cross-section,we might expect more variation in consumption by high-income individuals.Thus, E[u2i ] is not constant. This is a problem that afflicts many cross-sectionalregressions.

12.1.3.3 Systems of Equations

Consider the joint model

yt1 = x′t1β1 + ut1

yt2 = x′t2β2 + ut2.

If ut1 and ut2 are correlated, then the joint model has a nonscalar covariance.If the error terms ut1 and ut2 are viewed as omitted variables then it is obviousto ask whether common factors have been omitted and hence the terms arecorrelated.

12.2 Consequences of Nonscalar Covariance

12.2.1 For Estimation

The OLS estimates are

β = (X′X)−1X′y

= β + (X′X)−1X′u. (12.7)

12.2. CONSEQUENCES OF NONSCALAR COVARIANCE 165

Thus,E[β] = β + (X′X)−1X′E[u] = β, (12.8)

so OLS is still unbiased (but not BLUE since (ii & iii) not satisfied).Now

β − β = (X′X)−1X′u. (12.9)

so

E[(β − β)(β − β)′] = (X′X)−1X′E[uu′]X(X′X)−1

= σ2(X′X)−1X′ΩX(X′X)−1

6= σ2(X′X)−1.

The diagonal elements of (X′X)−1X′ΩX(X′X)−1 can be either larger or smallerthan the corresponding elements of (X′X)−1. In certain cases we will be ableto establish the direction of the inequality.

Suppose

1

nX′X

p→Q p.d. (12.10)

1

nX′ΩX

p→M

then (X′X)−1X′ΩX(X′X)−1 = 1n ( 1

nX′X)−1 1nX′ΩX( 1

nX′X)−1p→ 1nQ−1MQ−1

p→0so

βp→β (12.11)

since β unbiased and the variances go to zero.

12.2.2 For Inference

Supposeu ∼ N(0, σ2Ω) (12.12)

thenβ ∼ N(β, σ2(X′X)−1X′ΩX(X′X)−1). (12.13)

Thusβj − βj√

σ2[(X′X)−1]jj N(0, 1) (12.14)

since the denominator may be either larger or smaller than√σ2[(X′X)−1X′ΩX(X′X)−1]jj .

Andβj − βj√

s2[(X′X)−1]jj tn−k (12.15)

We might say that OLS yields biased and inconsistent estimates of the variance-covariance matrix. This means that our statistics will have incorrect size so weover- or under-reject a correct null hypothesis.


12.2.3 For Prediction

We seek to predicty∗ = x′∗β + u∗ (12.16)

where ∗ indicates an observation outside the sample. The OLS (point) predictoris

y∗ = x′∗β (12.17)

which will be unbiased (but not BLUP). Prediction intervals based on σ2(X′X)−1

will be either too wide or too narrow so the probablility content will not be theostenisble value.

12.3 Correcting For Nonscalar Covariance

12.3.1 Generalized Least Squares

Since Ω positive definite we can write

Ω = PP′ (12.18)

for some n × n nonsingular matrix P (typically upper or lower triangular).Multiplying (11.1) by P−1 yields

P−1y = P−1Xβ + P−1u (12.19)

ory∗ = X∗β + u∗ (12.20)

where y∗ = P−1y, X∗ = P−1X, and u∗ = P−1u.Perform OLS on the transformed model yields the generalized least squares

or GLS estimator

β = (X∗′X∗)−1X∗′y∗ (12.21)

= ((P−1X)′P−1X)−1(P−1X)

′P−1y

= (X′P−1′P−1X)−1X′P−1′P−1y.

But P−1′P−1 = P′−1P−1 = Ω−1 whereupon we have the alternative represen-

tationβ = (X′Ω−1X)−1X′Ω−1y. (12.22)

This estimator is also known as the Aitken estimator. Note that GLS reducesto OLS when Ω = In.

12.3.2 Properties with Known Ω

Suppose that Ω is a known, fixed matrix, then

• E[u∗] = 0

12.3. CORRECTING FOR NONSCALAR COVARIANCE 167

• E[u∗u∗′] = P−1E[uu′]P−1′ = σ2P−1ΩP−1′ = σ2P−1PP′P−1′ = σ2In

• X∗ = P−1X nonstochastic

• X∗ has full column rank

so the transformed model satisfies the ideal model assumptions (i)-(v).Applying previous results for the ideal case to the transformed model we

haveE[β] = β (12.23)

E[(β − β)(β − β)′] = σ2(X∗′X∗)−1 = σ2(X′Ω−1X)−1 (12.24)

and the GLS estimator is unbiased and BLUE. We assume the transformedmodel satisfies the asymptotic properties studied in the previous chapter. First,suppose

1

nX′Ω−1X =

1

nX∗′X∗

p→Q∗ p.d. (a)

then βp→β. Secondly, suppose

1√n

X′Ω−1u =1√n

X∗′u∗d→N(0, σ2Q∗) (b)

then√n(β−β)

d→N(0, σ2Q∗−1). Inference and prediction can proceed as beforefor the ideal case.

12.3.3 Properties with Unknown Ω

If Ω is unknown then the obvious approach is to estimate it. Bear in mind,however, that there are up to n(n+1)/2 possible different parameters if we haveno restrictions on the matrix. Such a matrix cannot be estimated consistentlysince we only have n observations and the number of parameters is increasingfaster than the sample size. Accordingly, we look at cases where Ω = Ω(λ) forλ a p× 1 finite-length vector of unknown paramters. The three examples willfall into this category.

Suppose we have an estimator λ (possibly consistent) then we obtain Ω= Ω(λ)and the feasible GLS estimator

β = (X′Ω−1X)−1X′Ω−1y

= β + (X′Ω−1X)−1X′Ω−1u.

The small sample properties of this estimator are problematic since Ω = PP′

will generally be a function of u so the regressors of the feasible transformedmodel X∗ = P−1X become stochastic. The feasible GLS will be biased andnon-normal in small samples even if the original disturbances were normal.

It might be supposed that if λ is consistent that everything will work out inlarge samples. Such happiness is not assured since there are possibly n(n+1)/2


possible nonzero elements in Ω which can interact with the x′s in a pathologicalfashion. Suppose that (a) and (b) are satisfied and furthermore

1

n[X′Ω(λ)

−1X−X′Ω(λ)

−1X]

p→0 (c)

and1√n

[X′Ω(λ)−1

u−X′Ω(λ)−1

u]p→0 (d)

then √n(β − β)

d→N(0, σ2Q∗−1). (12.25)

Thus in large samples, under (a)-(d), the feasible GLS estimator has the sameasymptotic distribution as the true GLS. As such it shares the optimalityproperties of the latter.

12.3.4 Maximum Likelihood Estimation

Supposeu ∼ N(0, σ2Ω) (12.26)

theny ∼ N(Xβ, σ2Ω) (12.27)

and

L(β, σ2,Ω; y,X) = f(y|X;β, σ2,Ω) (12.28)

=1

(2πσ2)n/2 |Ω|1/2e−

12σ2

(y−Xβ)′Ω−1(y−Xβ).

Taking Ω as given, we can maximize L(·) w.r.t. β by minimizing

(y −Xβ)′Ω−1(y −Xβ) = (y −Xβ)

′P′−1P−1(y −Xβ) (12.29)

= (y∗−X∗β)′(y∗−X∗β).

Thus OLS on the transformed model or the GLS estimator

β = (X′Ω−1X)−1X′Ω−1y (12.30)

is MLE and BUE since it is unbiased.

12.4 Seemingly Unrelated Regressions

12.4.1 Sets of Regression Equations

We consider a model with G agents and a behavioral equation with n observa-tions for each agent. The equation for agent j can be written

yj = Xjβj + uj , (12.31)

12.4. SEEMINGLY UNRELATED REGRESSIONS 169

where yj is n× 1 vector of observations on the dependent variable for agent j,Xj is the n × k matrix of observations on the explanatory variables, and uj isthe vector of unobservable disturbances. Writing the G sets of equations as onesystem yields

y1

y2

...yG

=

X1 0 . . . 00 X2 . . . 0...

.... . .

...0 0 . . . XG

β1

β2...βG

+

u1

u2

...uG

(12.32)

or more compactlyy = Xβ + u (12.33)

where the definitions are obvious.The individual equations satisfy the usual OLS assumptions

E[uj ] = 0 (12.34)

andE[uju

′j ] = σ2

j In (12.35)

but due to common ommited factors we must allow for the possibility that

E[uju′`] = σj`In j 6= `. (12.36)

In matrix notation we haveE[u] = 0 (12.37)

andE[uu′] = Σ⊗ In = σ2Ω (12.38)

where

Σ =

σ21 σ12 . . . σ1G

σ12 σ22 . . . σ2G

......

. . ....

σG1 σG2 . . . σ2G

. (12.39)

12.4.2 SUR Estimation

We can estimate each equation by OLS

βj = (X′jXj)−1X′jyj (12.40)

and as usual the estimators will be unbiased, BLUE for linearity w.r.t. yj , andunder normality

βj ∼ N(βj , σ2j (X′jXj)

−1). (12.41)

This procedure, however, ignores the covariances between equations. Treat-ing all equations as a combined system yields

y = Xβ + u (12.42)


whereu ∼ (0,Σ⊗ In) (12.43)

is non-scalar. Applying GLS to this model yields

β = (X′(Σ⊗ In)−1X)−1X

′(Σ⊗ In)−1y (12.44)

= (X′(Σ−1 ⊗ In)X)−1X

′(Σ−1 ⊗ In)y

This estimator will be unbiased and BLUE for linearity in y and will, in general,be efficient relative to OLS.

If u is multivariate normal then

β ∼ N(β, (X′(Σ⊗ In)−1X)−1). (12.45)

Even if u is not normal then, with reasonable assumptions about the jointbehavior of X and u, we have

√n(β − β)

d→ N(0, [plimn→∞

1

n(X′(Σ⊗ In)−1X)]−1). (12.46)

12.4.3 Diagonal Σ

There are two special cases in which the SUR esimator simplifies to OLS oneach equation. The first case is when Σ is diagonal. In this case

Σ =

σ21 0 . . . 0

0 σ22 . . . 0

......

. . ....

0 0 . . . σ2G

(12.47)

and

X′(Σ⊗ In)−1X =

X′1 0 . . . 00 X′2 . . . 0...

.... . .

...0 0 . . . X′G

1σ21In 0 . . . 0

0 1σ22In . . . 0

......

. . ....

0 0 . . . 1σ2G

In

X1 0 . . . 00 X2 . . . 0...

.... . .

...0 0 . . . XG

(12.48)

=

1σ21X′1X1 0 . . . 0

0 1σ22X′2X2 . . . 0

......

. . ....

0 0 . . . 1σ2G

X′GXG

.

Similarly,

X′(Σ⊗ In)−1y=

1σ21X′1y1 0 . . . 0

0 1σ22X′2y2 . . . 0

......

. . ....

0 0 . . . 1σ2G

X′GyG

(12.49)

12.4. SEEMINGLY UNRELATED REGRESSIONS 171

whereupon

β =

(X′1X1)−1X′1y1

(X′2X2)−1X′2y2

...(X′GXG)−1X′GyG

. (12.50)

So the estimator for each equation is just the OLS estimator for that equationalone.

12.4.4 Identical Regressors

The second case is when each equation has the same set of regressor, i.e. Xj = Xso

X = IG ⊗X. (12.51)

And

β = [(IG ⊗X′)(Σ−1 ⊗ In)(IG ⊗X)]−1

(IG ⊗X′)(Σ−1 ⊗ In)y (12.52)

= (Σ−1 ⊗X′X)−1

(Σ−1 ⊗X′)y

= [Σ⊗ (X′X)−1

](Σ−1 ⊗X′)y

= [IG ⊗ (X′X)−1

X′]y

=

(X′X)−1X′y1

(X′X)−1X′y2

...(X′X)−1X′yG

.

In both these cases the other equations have nothing to add to the estimationof the equation of interest because either the omitted factors are unrelated orthe equation has no additional regressors to help reduce the sum- of-squarederrors for the equation of interest.

12.4.5 Unknown Σ

Note that for this case Σ comprises λ in the general form Ω = Ω(λ). It is finite-length with G(G+1)/2 unique elements. It can be estimated consistently usingOLS residuals. Let

ej = yj −Xjβj

denote the OLS residuals for agent j. Then by the usual arguments

σj` =1

n

n∑

i=1

eijei`

and

Σ = (σj`)


will be consistent. Form the feasible GLS estimator

β = (X′(Σ−1 ⊗ In)X)−1X

′(Σ−1 ⊗ In)y

which can be shown to satisfy (a)-(d) and will have the same asymptotic dis-tribution as β. This estimator will be obtained in two steps: the first step isto estimate all equations by OLS and thereby obtain the estimator Σ, in thesecond step we obtain the feasible GLS estimator.

Chapter 13

Heteroskedasticity

13.1 The Nature of Heteroskedasticity


Consider, from Chapter 11, the model with stochastic regressors, written oneobservation at a time

yi = x′iβ + ui

for i = 1, 2, . . . , n. The assumptions for the conditional zero mean case (iv,b)were

(i) E[ui] = 0

(ii) E[u2i ] = σ2

(iii) E[uiul] = 0, i 6= l

(iv,b) (ui,xi) jointly i.i.d. with E[ui|xi] = 0, E[u2i |xi] = σ2

(v) E[xix′i] = Q p.d.

As we pointed out there, if we add a conditional normality assumption,

(vi) ui|xi ∼ N(0, σ2)

then we are really back in (iv,a), the case of ui independent of x. And conditionalon X, the distributional and inferential results obtained in Chapters 7-10 fornonstochastic regressors could be applied.

13.1.2 Heteroskedasticity

We can think about the systematic part of the model x′iβ as being a model of theconditional expectation of yi given xi. So, if the systematic portion is correct,the assumption of conditional zero mean disturbances is quite compelling. It

173

174 CHAPTER 13. HETEROSKEDASTICITY

is not equally compelling that the conditional variance, which is, in general, afunction of the conditioning variables, is constant as assumed in (iv,b). In fact,there is some restrictiveness in arguing that the conditional expectation of yiis a function of the conditioning variables but its condtional variance is not.Particularly in cross-sectional models, it is sometimes crucial to entertain thenotion that variances are not constant.

The constant conditional variance in (iv,b) is the assumption of conditionallyhomoskedastic disturbances. If the conditional variance is not constant, thenwe say the disturbances are conditionally heteroskedastic. Formally, we have

(iv,h) (ui,xi) jointly i.i.d. with E[ui|xi] = 0, E[u2i |xi] = σ2λ2(xi)

where λ2(xi) is not constant. The terminology arises from the Greek root“skedos” or spread paired with either “homo” for single or “hetero” for varied.So heteroskedasticity literally means varied spreads. Note that (iv,h) does notcontradict (ii) since one is a conditional variance and the other is unconditional,but the two together imply E[λ2(xi)] = 1.

Under this relaxed assumption, we need additional higher order momentassumptions

(v,h) E[xix′i] = Q p.d., E[u2ixix

′i] = E[σ2λ2(xi)xix

′i] = M, E[(1/λ2(xi))xix

′i] =

Q∗ p.d.

The second condition is borrowed from Assumption (iv,c) but the third is in-troduced specifically for this case. For inference in small samples we will addconditional normality

(vi,h) ui|xi ∼ N(0, σ2λ2(xi)).

The normality assumption was not added with Assumption (iv,b) since it im-plied we were in Case (iv,a).

In terms of the matrix notation, the conditional covariance can now be writ-ten

E[uu′|X] = σ2Λ

where

Λ =

λ2(x1) 0 · · · 00 λ2(x2) · · · 0...

. . ....

0 · · · 0 λ2(xn)

and E[tr(Λ)] = n. This an example of a non-scalar covariance that is diagonalbut non-constant along the diagonal.

13.1.3 Some Examples

For example, suppose,

Ci = α+ βYi + γWi + ui

13.2. CONSEQUENCES OF HETEROSKEDASTICITY 175

is estimated from cross-sectional data, where i = 1, 2, ..., n. It is likely that thevariance of Ci and hence ui will be larger for individuals with large Yi. Thusui will be uncorrelated with Yi but its square will be related. Specifically, wewould have E[u2i |Yi] = σ2λ2(Yi). We would not usually know the specific formof λ2(·), only that it is a monotonic increasing function.

Sometimes, however, it is possible to ascertain the form of the heteroskedas-ticity function. Consider an even simpler model of consumption, but with amore complicated panel-data structure:

Csi = α+ βYsi + usi

where the subscript si denotes inidvidual i from state s, for s = 1, 2, ..., S, andi = 1, 2, ..., ns. The disturbances uis are assumed to have ideal properties withσ2 denoting the constant variance. Unfortunately, individual specific data arenot available. Instead we have state-level averages

Cs = α+ βY s + us

where Cs =∑nsi=1 Csi/ns, Y s =

∑nsi=1 Ysi/ns, and us =

∑nsi=1 usi/ns. But

now we see that although E[us] = 0 and E[usur] = 0 for s 6= t, E[u2s] = σ2/ns.Thus the error term will be heteroskedastic with known form since λ2i = 1/ns.Moreover, the non-constant variances are not stochastic.

13.2 Consequences of Heteroskedasticity

13.2.1 For OLS estimation

Consider the OLS estimator

β = (X′X)−1

X′y

substituting y = Xβ + u yields

β = (X′X)−1

X′(Xβ + u)

= (X′X)−1

X′Xβ + (X′X)−1

X′u

= β + (X′X)−1

X′u

and β is unbiased (but not BLUE) since

E[β|X] = β.

Recallβ − β = (X

′X)−1

X′u

So,

E[(β − β)(β − β)′|X] = (X′X)−1

X′E[uu′|X]X(X′X)−1

= (X′X)−1

X′σ2ΛX(X′X)−1

= σ2(X′X)−1

X′ΛX(X′X)−1


is the conditional covariance matrix of β.Using Assumption (v,h) and applying the law of large numbers, we have

1nX′X

p−→ Q = E[xix′i]. It follows that,

X′X =

n∑

i=1

xix′i = Op(n)

so (X′X)−1

= Op(1/n). Note λ2(xi)xix′i is i.i.d. and that E[σ2λ2(xi)xix

′i] = M

so σ2 1n

∑ni=1 λ

2(xi)xix′i

p−→M. Thus, it also follows that

X′ΛX =

n∑

i=1

λ2(xi)xix′i = Op(n)

wherupon

(X′X)−1

X′ΛX(X′X)−1

= Op(1/n)

goes to zero. Thus, by convergence in quadratic mean, β collapses in distributionto its mean β, which means it is consistent.

Now, in terms of the estimated variance scalar, we find

Es2|X = E

∑ni=1 e

2i

n− k|X =

1

n− kE[e′e|X]

=1

n− kE[u′(In −X(X′X)

−1X)u|X]

= σ2 n

n− k− 1

n− kE[tr((X′X)

−1X′uu′X)|X]

= σ2 +σ2

n− kk − tr((X′X)

−1X′ΛX).

Depending on the interaction of λ2(xi) and xi in (X′X)−1

X′ΛX, we find s2 is,in general, now biased. However, since the second term is Op(1/n) then

limn→∞

E[s2] = σ2

and s2 is asymptotically unbiased. Moreover, under Assumption (iv,h), we canalso show

plimn→∞

s2 = σ2

and s2 is consistent.

13.2.2 For inferences

Suppose ui are conditionally normal or

u|X ∼ N(0, σ2Λ)

13.2. CONSEQUENCES OF HETEROSKEDASTICITY 177

thenβ|X ∼ N(β, σ2(X

′X)−1

X′ΛX(X′X)−1

)

Clearly, then

s2(X′X)−1

is not an appropriate estimate of

σ2(X′X)−1

X′ΛX(X′X)−1

since (X′X)−1 6= (X

′X)−1

X′ΛX(X′X)−1

, in general.Depending on the interaction of Λ and X, the diagonal elements of

σ2(X′X)−1

X′ΛX(X′X)−1

can be either larger or smaller than the diagonal elements of

s2(X′X)−1

Thus, the estimated variances (diagonal elements of s2(X′X)−1

) and standarderrors printed by OLS packages can either understate or overstate the truevariances.

Consequently, the distribution of

βj − βj√s2djj

, djj = [(X′X)−1

]jj

can be either fatter or thinner than a tn−k distribution. Thus, our inferencesunder the null hypothesis will, in general, be incorrectly sized and we will eitherover-reject or under-reject. This can increase the probability of either Type Ior Type II errors.

These problems persist in larger samples. From Case (iv,c) in Chapter 10,we have √

n(β − β)p−→ Q−1MQ−1

and √n(βj − βj)

p−→ [Q−1MQ−1]jj .

Thus

βj − βj√s2djj

=


s2[( 1nX′X)

−1]jj

=


[Q−1MQ−1]jj

√[Q−1MQ−1]jj√s2[( 1

nX′X)−1

]jj

p−→ N(0, 1)

√[Q−1MQ−1]jj√σ2[Q−1]jj

.

The numerator and denominator in the ratio are, in general, not equal dependingon the interaction of λ2(xi) and xi and hence the form of M. Thus we caneither over- or under-reject, even in large samples.


13.2.3 For prediction

Suppose we use

yp = x′pβ

as a predictor of yp = x′pβ + up. Then since Eβ = β,

Eyp = Ex′pβ = Eyp

so yp is still unbiased. However, since β is not BLUE, then yp is not BLUP. Thatis, there are other predictors linear in y that have smaller variance. Moreover,the variance of the predictor will not have the usual form and the width ofprediction intevals based on the usual variance-covariance estimators will bebiased.

13.3 Detecting Heteroskedasticity

13.3.1 Graphical Method

In order to detect heteroskedasticity, we usually need some idea of the possiblesources of the non-constant variance. Usually, the notion is that λ2i is a functionof xi, if not constant.

In time series, where the xt usually move smoothly as t increases, a generalapproach is to obtain the plot of et against t. Under the null hypothesis ofhomoskedasticity we expect to observe.

t

et

bb

b

b

bb

bb

b

bb

bb

b

bb

b

b

b b

b

b

b

bb

b

b b

b

bb

b

b

b

bb

b

b

That is, the residuals seem equally dispersed over time. If heteroskedasticityoccurs, we might observe some pattern in the dispersion as a function of time,e.g.

13.3. DETECTING HETEROSKEDASTICITY 179

t

et

b

b

bb

b

b

b

b

bb

b

b

b b

b

b

b

bb

b

b b

b

bb

b

b

b

bb

bb

b

b b

b

b

bb

b

b b

b

b

A general technique for either time series or cross-sections is to plot ei againstyi = x′iβ. Under the null hypothesis of homoskedasticity, the dispersion of ei is

unrelated to xt and, hence (asymptotically) x′tβ. Thus any pattern such as

yi

ei

b

b

bb

b

b

b

b

bb

b

b

b

b

b

bb

bbb

b

b

b

bb

b

b

b

b

b

b b

b

b

bb

bb

bb

b

b

b

b

indicates the presence of heteroskedasticity.

If we have a general idea which xij may influence the variance of ui, wesimply plot ei against xij . Suppose, for example, we think Eu2i may be anincreasing function of xij , then we would find a pattern

xij

ei

b

b

bb

b

b

b

b

bb

b

b

b b

b

b

b

bb

b

b b

b

bb

b

b

b

bb

bb

b

b b

b

b

bb

b

b b

b

b

While even dispersion for all xij would argue against heteroskedasticity.


13.3.2 Goldfeld-Quandt Test

In order to perform this test, we must have some idea when the heteroskedas-ticity, if present, will lead to larger variances. That is, we need to be ableto determine which observations are likely to have higher variance and whichsmaller. When this is possible, we reorder the observations and split the sampleso that

y = Xβ + u

can be written as (y1

y2

)=

(X1

X2

)β +

(u1

u2

)

where u1 correspond to the n1 observations with larger variances if heteroskedas-ticity is present. That is

E[u1u′1|X1] = σ2

1In1

and

E[u2u′2|X2] = σ2

2In2

where H0 : σ21 = σ2

2 , while H1 : σ21 > σ2

2 .

Perform OLS on the two subsamples and obtain the usual form

s21 =e′1e1n1 − k

and

s22 =e′2e2n2 − k

where e1 and e2 are the OLS residuals from each of the subsamples. Now, byprevious results,

(n1 − k)s21σ21

∼ χ2n1−k

and

(n2 − k)s22σ22

∼ χ2n2−k

and the two are independent. Thus, under H0 : σ21 = σ2

2 , the ratio of thesedivided by their respective degrees of freedom yields

s21s22∼ Fn1−k,n2−k

Under the alternative H1 : σ21 > σ2

2 we find s21/s22 ∼ Fn1−k,n2−k · σ2

1/σ22 and

expect to obtain large values for this statistic.

13.3. DETECTING HETEROSKEDASTICITY 181

13.3.3 Breusch-Pagan Test

Suppose

λ2i = λ2(zi′α)

z′i = (1, z∗i′)

α′ = (α1,α∗′)

then

u2i = λ2(z′iα) + vi

where E[vi] = 0. This is an example of a single-index model for heteroskedas-ticity. We assume that λ2(·) is a strictly monotonic function of its argument.Under H0 : α∗ = 0 we see that the model is homoskedastic. Moreover, u2i willbe uncorrelated and hence have zero covariance with z∗i which could be testedby regressing u2i on zi and testing all the slopes zero.

Along these lines, Breusch and Pagan propose the asymptotically equivalentapproach of regressing qi = e2i − σ2 on zi and testing the slopes zero. Using then ·R2 test for testing all slopes zero we find

η∗ = n ·R2

=q′Z(Z′Z)−1Z′q1n

∑t(e

2t − σ2)2

=q′Z(Z′Z)−1Z′q

q′q/nd−→ χ2

s−1

under H0, where R2 is for the auxilliary regression. The single-index structureand and monotonicity assure that qi is likely to be correlated with zi, since∂λ2(z′iα)

∂α = h′(z′iα)zi where the scalar function h′(·) is the first derivative of λ2(·)with respect to its single argument. Thus η∗ is likely to become increasinglypositive under the alternative and the test will have power. This test will beasymptotically appropriate whether the ui are normal or not.

Under normality, the denominator of this test, which is an estimator of thefourth moments has a specific structure which can be used to simplify the test.Specifically, since the raw fourth moment of a normal around its mean is 3, wehave

η =q′Z(Z′Z)−1Z′q

2(σ2)2d−→ χ2

s−1

under H0. The discussion for the power of the previous more general ver-sion of the test apply to the normal-specific version as well. Under conditionalnormality, this test can be shown to be a Lagrange Multiplier test.


13.3.4 White Test

By the i.i.d. assumption and the law of large numbers, we have 1n

∑ni=1 u

2i,

p−→σ2 = E[u2

i ],

1

n

n∑

i=1

u2ixix′i

p−→ E[u2i xix

′i]= E[E[u2

i xix′i|xi]] = E[σ2λ2(xi)xix

′i]xi) = M,

and

Q =1

n

n∑

i=1

u2ixix′i

p−→ Q.

For ei = yi − x′iβ, due to consistency of β, we can show

σ2 =1

n

n∑

i=1

e2i =n− kn

s2p−→ σ2

and also

M =1

n

n∑

i=1

e2ixix′i

p

−→M .

Now, under H0 : λ2i = 1

M = σ2Q

while under H1 : λ2i non-constant

M 6= σ2Q

possible. Thus, we compare M with σ2Q and

M− σ2Q =1

n

∑e2ixix

′i −

(1

n

∑

i

e2i

)(1

n

∑

i

xix′i

)p−→ 0

under H0. Let wi = e2i and zi = 〈unique elements of xi ⊗ xi〉 then under H0 :

1

n

∑wizi −

(1

n

∑

i

wi

)(1

n

∑

i

zi

)p−→ 0.

or1

n

∑

i

(wi − w)(zi − z)p−→ 0.

While under the alternative this sample covariance will converge in probabilityto a nonzero value.

Accordingly, White proposes we test whether the slope coefficients in the re-gression of wi on zi are zero. Let εi be the residuals in this auxilliary regression.

13.4. CORRECTING FOR HETEROSKEDASTICITY 183

Then, under general conditions, White shows that

R2w = 1−

∑ε2i∑

(wi − w)2

=

∑(wi − w)2 −

∑ε2i∑

(wi − w)2

nR2 =

∑(wi − w)2 −

∑ε2i∑

(wi − w)2/n

d−→ χ2r−1

where r = k(k + 1)/2 is the number of unique elements of xi⊗xi.Note that the White test is strictly speaking not necessarily a test for het-

eroskedasticity but a test for whether the OLS standard errors are appropriatefor inference purposes. It is possible to have heteroskedasticity or Λ 6= In andstill have X′ΛX = X′X or M = σ2Q, whereupon standard OLS inference isfine. It is also worth noting that the White test is just the Breusch-Pagan testwith zi = 〈unique elements of xi ⊗ xi〉.

13.4 Correcting for Heteroskedasticity

13.4.1 Weighted Least Squares (known weights)

The only manner in which the model

yi = x′iβ + ui i = 1, 2, . . . , n

violates the ideal conditions is that

E(u2i ) = σ2λ2(xi) = σ2λ2i

is not constant, for λ2i known. A logical possibility is to transform the model toeliminate this problem. Specifically, we propose

1

λiyi =

(1

λix′iβ

)+

1

λiui

or

y∗i = x∗i′β + u∗i

where y∗i = yi/λi, x∗i = xi/λi, and u∗i = ui/λi. Then clearly

E(u∗i ) = 0

E(u∗2i ) = σ2

E(u∗i u∗l ) = 0 i 6= l

(u∗i ,x∗i ) jointly i.i.d. withE[u∗i |x∗i ] = 0,E[u∗2i |x∗i ] = σ2

E[x∗ix∗i ’]=Q

∗p.d.


for i = 1, 2, . . . , n and the assumptions for the conditional zero mean case aresatisfied. If we add conditional normality then the transformed disturbances areindependent of the transformed regressors and we have the same finite sampledistributions and inferences as the stochastic regressor case.

Thus, we perform OLS on the transformed system

y∗ = X∗β + u∗

to obtain

β = (X∗′X∗)−1X∗′y∗

= (X′Λ−1X)−1X′Λ−1y

This estimator is called weighted least-squares (WLS) since we have performedleast-squares after having weighted the observations by wi = 1/λi. This es-timator is just generalized least squares (GLS) for the specific problem of het-eroskedasticity of known form.

13.4.2 Properties of WLS (known weights)

Substituting for y∗ in the usual fashion yields

β = X∗′(X∗β + u)

= β + (X∗′X∗)−1X∗′u

SoE[β|X∗] = β

and β is unbiased and also BLUE. Similarly, we find, as might be expected

E[(β − β)(β − β)′|X∗] = σ2(X∗′X∗)−1

which will go to zero for large n. Thus, β will be consistent.If ui are conditionally normal or

u|X ∼ N(0, σ2Λ)

thenu∗|X∗ ∼ N(0, σ2In)

andβ|X∗ ∼ N(β, σ2(X∗′X

∗)−1)

And

(n− k)s∗2

σ2∼ χ2

n−k

independent of β so the usual finite sample inferences based on the transformedmodel are correct. Moreover, β is MLE and hence BUE.


If ui are non-normal we must rely on the asymptotic properties. Fortunately,Assumptions (iv,h) and (v,h) allow us to use the law of large numbers and centrallimit theorem. Specifically, since 1

λ2(xi)xix′i is i.i.d. and E[ 1

λ2(xi)xix′i] = Q∗, we

find1

nX∗′X∗ =

1

n

n∑

i=1

1

λ2ixix′i =

1

nX′Λ−1X

p−→ Q∗ (a)

and, since 1λ2ixiui is i.i.d. with E[ 1

λ2(xi)xiui] = 0 and E[u2i

1λ2(xi)

xix′i] = σ2Q∗,

then1√n

X∗′u∗ =1√n

n∑

i=1

1

λ2ixiui =

1√n

X′Λ−1ud−→ N(0, σ2Q∗), (b)

whereupon√n(β − β

d

) −→ N(0, σ2Q∗−1).

This is just an application of the results from Chapter 11. The conditions (a)and (b) are not always assured and should be verified if possible.

The usual t-ratios and quadratic forms would have complicated propertiesin small samples but would be appropriate in large samples. Specifically, wehave

βj − βj√s2[(X∗′X∗)−1]jj

d−→ N(0, 1)

and(SSE∗r − SSE∗u)/q

SSE∗u/(n− k)

d−→ χ2q/q

where SSE∗ denotes the sum-of-squared errors for the transformed model. Con-veniently, these are the limiting distributions of the tn−k and Fq,n−k as n growslarge.

13.4.3 Estimation with Unknown Weights

The problem with WLS is that λi must be known, which is typically not thecase. It frequently occurs, however, that we can estimate λi (as in the Breusch-

Pagan formulation), whereupon we simply use λi rather than λi as above. Thisis known as feasible WLS.

Once we use estimates for λi, say λi, then x∗i = x∗i /λi is no longer non-

stochastic and the properties of β are intractable in small samples. Fortunately,if our estimates λi improve in large samples, we usually find the differencebetween using λi and λi disappears. Strictly speaking, in addition to (a) and(b) above we need to verify

1

n[X′Λ

−1X−XΛ−1X]

p→0 (c)

and1√n

[X′Λ−1

u−X′Λ−1u]p→0 (d)


for the case at hand before being assured that feasible WLS is asymptoticallyequivalent to WLS based on known weights. Verification of these conditions willdepend on the specific form of λ2i = λ2(xi).

In order to perform inference with the feasible WLS estimator we need aconsistent covariance estimator. Under Condition (c) and (d)

Q∗ =1

n[X′Λ

−1X]

p→Q∗

and

βj − βj√s2[(X′Λ

−1X)−1]jj

d−→ N(0, 1)

which means the usual ratios reported for the OLS esitmates of the feasibletransformed model are asymptotically appropriate. This consistent covariancematrix may also be used in quadratic forms to perform inference on more com-plicated hypotheses.

13.4.4 Heteroskedastic-Consistent Covariances

If we don’t have a good model for the heteroskedasticity, then we can’t doWLS. An alternative, in this case, is to use an appropriate covariance estimatefor OLS. Recall that

√n(β − β)

p−→ N(0,Q−1MQ−1).

Now, under the conditions of (iv,h) and (v,h), s2p−→ σ2, and

Q =1

n

∑

t

xtx′t

p−→ Q

M =1

n

∑

t

e2txtx′t

p−→M

soC = Q−1MQ−1

p−→ Q−1MQ−1

And

√n(βi − βi)√

Cii

=(βi − βi)√

[(X′X)−1(∑t e

2txtx

′t)(X

′X)−1]ii

p−→ N(0, 1).

The denominator in the middle expression above would be the reported robustor heteroskedastic-consistent standard errors. They are also known as Eicker-White standard errors.

The beauty of this approach is that it does not require a model of het-eroskedasticity. If we instead use the feasible weighted least squares approachwe always stand the chance of misspecification, particularly since our models


seldom suggest a form for the heteroskedasticity. And if we misspecify theheteroskedasticity the resulting transformed model from WLS will still be het-eroskedastic and suffer inferential difficulties. This is just another manifestationof the bias-variance trade-off.

Chapter 14

Serial Correlation

14.1 Review and Introduction


For relevance, we rewrite the model in terms of time series data. Written oneperiod at a time, the model is

yt = β1xt1 + β2xt2 + . . .+ βkxtk + ut

=

k∑

j=1

βjxtj + ut

where t = 1, 2, . . . , n. The ideal assumptions are

(i) E[ut] = 0

(ii) E[u2t ] = σ2

(iii) E[utus] = 0, s 6= t

(iv) xti non-stochastic

(v) (xt1, xt2, . . . , xtk) not linearly related

(vi) ut normally distributed.

This model is the same as before except for the use of the subscript t ratherthan i, since we are using time-series data.

In matrix notation, since we are not identifying individual observations, themodel is the same as before, namely

y = Xβ + u.

And the assumptions are still

188

14.1. REVIEW AND INTRODUCTION 189

(i) E[u] = 0

(ii & iii) E[uu′] = σ2In

(iv & v) X non-stochastic, full column rank

(vi) u ∼ N(0, σ2In).

14.1.2 Serial Correlation

The condition (iii) is the assumption at issue if the disturbances are serially(between periods) correlated. The disturbances are said to be serially correlatedif (iii). is violated, i.e.,

E[usut] = w 6= 0

Serial correlation, as a problem, is most frequently encountered in time seriesanalysis. This may be chiefly a result of the fact that we have a stronger notion ofproximity of observations, e.g. adjacent periods, than in cross-sectional analysis.

14.1.3 Autoregressive Model

For expositional purposes, we will entertain one of the simplist possible, butstill very useful, models of serial correlation. Suppose that the ut are relatedbetween periods by,

ut = ρut−1 + εt t = 1, 2, . . . , n

where −1 < ρ < 1 and εt satisfy

E[εt] = 0

E[ε2t ] = σ2ε

E[εtεs] = 0 s 6= 0

for t = 1, 2, . . . , n. The disturbances ut are said to have been generated by afirst-order auto-regressive or AR(1) process.

Rewrite (by substituting recursively)

ut = εt + ρut−1= εt + ρ(εt−1 + ρut−2)

= εt + ρεt−1 + ρ2(εt−2 + ρut−3)

= εt + ρεt−1 + ρ2εt−2 + ρ3εt−3 + · · · .

Since |ρ| < 1 then

E[ut] = 0

190 CHAPTER 14. SERIAL CORRELATION

While,

E[u2t ] = E[εt + ρεt−1 + ρ2εt−2 + ρ3εt−3 + · · · ]2

= E[ε2t + ρεtεt−1 + ρ2εtεt−2 + ρ3εtεt−3 + · · ·+ ρεt−1εt + ρ2ε2t−1 + ρ3εt−1εt−2 + · · ·+ ρ2εt−2εt + ρ3εt−1εt−2 + ρ4ε2t−2 + · · · ]

= σ2ε + ρ2σ2

ε + ρ4σ2ε + · · ·

= σ2ε (1 + ρ2 + ρ4 + · · · )

σ2u = σ2

ε

1

1− ρ2since ρ2 < 1.

Similarly, we have

E[utut−1] = E[εt + ρεt−1 + ρ2εt−2 + ρ3εt−3 + · · ·+]

= E[ε2t + ρεtεt−1 + ρ2εtεt−2 + ρ3εtεt−3 + · · ·+ εt−1 + ρεt−2 + ρ2εt−3 + · · · ]

= ρσ2ε + ρ3σ2

ε + ρ5σ2ε + · · ·

= ρσ2ε (1 + ρ2σ2

ε + ρ4σ2ε + · · · )

= ρσ2u 6= 0

and,generally,E[utut−s] = ρsσ2

u.

Forming the variances and covariances into a matrix yields

E[uu′] =

u21 u1u2 u1u3 · · · u1unu2u1 u22 u2u3 · · · u2unu3u1 u3u2 u23 · · · u3un

......

.... . .

...unu1 unu2 unu3 · · · u2n

=

σ2u σ2

uρ σ2uρ

2 · · · σ2uρn−1

σ2uρ σ2

u σ2uρ · · · σ2

uρn−2

σ2uρ

2 σ2uρ σ2

u · · · σ2uρn−3

......

.... . .

...σ2uρn−1 σ2

uρn−2 σ2

uρn−3 · · · σ2

u

= σ2u

1 ρ ρ2 · · · ρn−1

ρ 1 ρ · · · ρn−2

ρ2 ρ 1 · · · ρn−3

......

.... . .

...ρn−1 ρn−2 ρn−3 · · · 1

= σ2uΩ(ρ).

14.1. REVIEW AND INTRODUCTION 191

Note the structure of this covariance matrix. The elements along the maindiagonal are constant as are the elements all along each off-diagonal. Howeverthe values in the off-diagonals become smaller exponentially as we move awayfrom the main diagonal toward the upper right-hand (and lower left-hand) cor-ner(s). This latter feasure is an example of the infinite but decreasing memoryof the process. An innovation εt always matters to the process but matters lessand less as the process evolves. Such properties are typical of time-series.

It is possible to entertain other models of serial correlation. For examplewe could posit a second-order autoregressive AR(2) process

ut = ρ1ut−1 + ρ2ut−2 + εt t = 1, 2, . . . , n

where ρ1 and ρ2 are parameters subject to appropriate stability restrictions.This process would have a much more complicated covariance structure whichwe will not pursue here. Or we might have a first-order moving-average MA(1)process

ut = εt + λεt−1 t = 1, 2, . . . , n

where λ is a parameter governing the covariance between adjacent periods. Thisprocess has a simpler but still interesting structure and is sometimes encounteredin practice but will not be pursued here.

14.1.4 Some causes of Serial Correlation

Most economic time series are subject to inertia. That is, they tend to movesmoothly or sluggishly over time. There is a sense in which values of the sub-script t that are similar indicate a certain proximity economically so the valuesof the variables are expected to be similar. When this occurs, we find the se-ries are serially correlated since adjacent values of the series tend to be largeand/or small together. It is not unreasonable to consider the possibility thatthe disturbances might have similar properties.

Serial correlation can arise if we misspecify the model. For example if weomit variables that are themselves serially correlated. Suppose that the truerelationship is

yt = β1xt1 + β2xt2 + β3xt3 + β4xt4 + ut

But we estimateyt = β1xt1 + β2xt2 + β3xt3 + vt

Then vt = β4xt4 + ut. Aside from other difficulties, if the xt3 are seriallycorrelated, then so will vt be serially correlated, even if the original ut were not!For this reason many econometricians take evidence of serial correlation in amodel as an index of misspecification.

Or if the true relationship is nonlinear and we mistakenly specify a linearmodel, serial correlation can result. Suppose the true relationship is

yt = f(xt; θ) + ut

where f(xt; θ) is a nonlinear relationship, yielding the scatterplot


f(xt; θ)

yt

xt

α + βxt

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b b

b

b

b

b

b

b b

b

b

b

b

bb

bb

b

b

b

b b

b

b

b

bb

b

b

b

b

b

b

b

bb

b

If we mistakenly estimate

yt = α+ βxt + vt

then even the “best” fitting line will exhibit correlated residuals since the resid-uals will tend to be positive and negative together and the xt move sluggishly.

It is possible to have serial correlation in non-time-series cases. For exampleif we have spatially located observations with a careful measure of distance wecan define the concept of spatial serial correlation. Here, we can index theobservations by the location measure. There has been substantial work onsuch models but the basic approaches used are much the same as for the time-series cases.

14.2 Consequences of Serial Correlation

14.2.1 For OLS Estimation

Consider the OLS estimator

β = (X′X)−1X′y

= (X′X)−1X′(Xβ + u)

= β + (X′X)−1X′u

or

β − β = (X′X)−1X′u

Thus

E[β] = β + (X′X)−1X′E[u] = β

and OLS is unbiased. (But not BLUE).

14.2. CONSEQUENCES OF SERIAL CORRELATION 193

While

cov(β) = E[(β − β)(β − β)′]

= E[(X′X)−1X′uu′X(X′X)−1]

= (X′X)−1X′σ2uΩX(X′X)−1

= σ2u(X′X)−1X′ΩX(X′X)−1

With the reasonable assumptions 1nX′X→p Q and 1

nX′ΩX→p M, then

limn→∞

(X′X)−1X′ΩX(X′X)−1 = 0

and the variance of β collapses about their expectation β, whereupon β isconsistent.

Consider

s2 =e′en− k

wheree = y −Xβ

are the OLS residuals. We can show

E[s2] = σ2

[n

n− k− 1

n− ktr((X′X)−1X′ΩX)

]

Under the assumptions used above, for large n, the second term goes to zerowhile n

n−k → 1, thus

limn→∞

E[s2] = σ2u

Moreover, under these same assumptions, consistency of s2 follows from consis-tency of β.

14.2.2 For inferences

Suppose ut are normal orut ∼ N(0, σ2

uQ)

thenβ ∼ N

(β, σ2

u(X′X)−1X′ΩX(X′X)−1)

Clearlys2(X′X)−1

is not an appropriate estimate of the covariance of β, namely,

σ2u(X′X)−1X′ΩX(X′X)−1

since X′QX 6= X′X in general.


f(t ) n-k

f(z )j

c t(n-k,0.025)

Suppose ρ > 0 and the xt move smoothly, which is typically the case, thenwe can show

X′ΩX > X′X

(in the sense of exceeding by a positive semi-definite matrix) and

(X′X)−1X′ΩX(X′X)−1 > (X′X)−1X′X(X′X)−1

so typically,

σ2u(X′X)−1X′ΩX(X′X)−1 > s2(X′X)−1

and the usual estimated variances (and standard errors) calculated by OLSpackages understate the true variances.

Thus, the ratios,

βi − βi√s2djj

= zi

have a distribution that this fatter than tn−k, since we are dividing by too smalla value. Graphically

Thus, we end up (apparently) in the tails too often and commit Type I errorswith increased probability.

14.2.3 For forecasting

Suppose we use

yp = x′pβ

as a predictor of

yp = x′pβ + up

Then since E[β] = β,

E[yp] = x′pβ = E[yp]

So yp is still unbiased. Since β is no longer BLUE, yp is not BLUP.

14.3. DETECTION OF SERIAL CORRELATION 195

14.3 Detection of Serial Correlation

14.3.1 Graphical Techniques

Serial correlation implies some sort of nonzero covariances between adjacentdisturbances. This suggests that there will be some type of linear relationshipbetween (either positive or negative) disturbances of adjacent periods. Suchpatterns of serial relationship are perhaps best detected (at least initially) byplotting the disturbances against time. Since we don’t have the ut = yt − x′tβ,we use the OLS residuals

et = yt − x′tβ

which will be similar.Graphical techniques are most useful in detecting first-order serial correla-

tion, i.e.E[utut−1] = µ 6= 0

Let

ρ =cov(ut, ut−1)√

var(ut) · var(ut−1)

=µ√σ2uσ

2u

=µ

σ2u

ThenE[utut−1] = ρσ2

u

Thus, graphic techniques can be effective in detecting whether or not ρ = 0.Suppose ρ > 0, then we have positive serial correlation between ut and ut−1.

When this is true, then ut will tend to be highly positive (negative) when ut−1was highly positive (negative). Thus, ut and hence et might be expected tomove relatively smoothly, with extreme values occurring together in adjacentperiods. For example,

b

bbbb

b

bb b b

b b bbb

b

b b b

bbb

b

bb

b

b bbbb

b

bb

bb

b

b

b

b

b b

bb

b b bbbb bb

b

bbb

b

b

b

bbb

bbb

bb

b

b

b b b

b

bbbb

b

b

b b b

bb bb

bbb

bbb

b b bb

b

b

b

b bb

et

t

Suppose ρ < 0, then we have negative serial correlation between ut andut−1. Here ut will typically be extremely negative (positive) when ut−1 is ex-tremely positive (negative). Thus, we have a jagged pattern with extreme valuesclustered together, but alternating in sign. For example,


et

t

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

14.3.2 Von Neumann Ratio

Suppose we observe ut then Von Neumann suggests tests may be based on thestatistic

VNR =

∑nt=1 (ut − ut−1)2/(n− 1)∑n

t=1 u2t/n

Expanding and invoking a LLN for nonindependent data, we have

VNR =1

n−1∑nt=1 (ut)

2 − 2n−1

∑nt=1 utut−1 + 1

n−1∑nt=1 (ut−1)2

1n

∑nt=1 u2t/n

p−→ σ2u − 2ρσ2

u + σ2u

σ2u

= 2(1− ρ)

Thus, under the null hypothesis of no serial correlation ρ = 0, VNRp−→ 2.

In fact, under the null hypothesis and general conditions on ut, Von Neu-mann shows

VNR ∼ N(2, 4/n)

for large samples. So,

z =VNR− 2√

4/n=√n

VNR− 2

2

d−→ N(0, 1)

Under the alternative (ρ 6= 0), we expect VNR to converge to values other than2 and hence z to be extreme. Values in the left-hand side tail indicate positiveserial correlation (ρ > 0) and right-hand side tail values indicate negative serialcorrelation. Since ut is unobservable, we use et since the difference disappearsin large samples.

14.3. DETECTION OF SERIAL CORRELATION 197

a

b

dUdL

14.3.3 Durbin-Watson Test

the VNR is, strictly speaking, appropriate only in large samples. In smallsamples, Durbin and Watson propose the related statistic

d =

∑nt=1 (et − et−1)2∑n

t=1 e2t

= VNR · n− 1

n

Thus, in large samples, the difference between d and VNR will become negligible.Now

d =e′Ae

e′e

where

A =

1 −1 0 · · · 0 0 0−1 2 −1 · · · 0 0 0

0 −1 2 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · −1 2 −10 0 0 · · · 0 −1 2

Under the null hypothesis of no serial correlation, D-W proceed to show

E[d] =tr(A)− tr(X′AX(X′X)−1)

n− k−→ 2

thus the distribution (in small samples) of d depends in a complicated way uponX. And we cannot calculate the exact distribution of d without taking explicitaccount of the values of X. This is, needless to say, quite complicated.

Accordingly, D-W consider two distributions for each value of n− k.Here a represents the best-case (most compact) distribution of d while b is theworst case (fattest) distribution that d can have for any X. For α = 0.05, say,dU is the correct critical value if a is correct while dL is the critical value if b iscorrect. Note that we are only looking at the left-hand side tail so are thinkingabout a one-sided alternative. Moreover, we have in mind a positive value of ρunder the alternative.

Thus, while we don’t precisely know what the appropriate distribution is,we know if


d > dU not in left 0.05 taild < dL in left 0.05 tail regardless of values in X

dL < d < dU we are in the inconclusive region and may or may not be in taildepending on X

In practice, however, we can often reduce or eliminate the inconclusive region.If ρ > 0 and the xt move smoothly over time then the true critical value will beclose to dU and we can use dU as a critical value in the usual way.

One caution on the Durbin-Watson statistic is warranted when the regressorsinclude lagged endogenous variables. Obviously, this violates the assumption ofnon-stochastic regressors so we could at best expect the test to be appropriateonly in large samples. There are additional complications, however, since, inthis case the d statistic will be biased upward, under the alternative of ρ > 0,so the test will lose power. The null distribution will still be correct in largesamples. Thus evidence of serial correlation revealed by the test still appliesregardless of the presence of lagged endogenous variables.

14.3.4 A Regression Approach

Suppose that the ut are generated by the first-order autoregressive process

ut = ρut−1 + εt

Substitute for ut and ut−1 from

ut = yt − x′tβ

ut−1 = yt−1 − x′t−1β

to obtainyt − x′tβ = ρ(yt−1 − x′t−1β) + εt

oryt = x′tβ + ρ(yt−1 − x′t−1β) + εt

Under H0 : ρ = 0, then ρβ = 0, thus, a direct test is to use

yt =(

x′t yt−1 x′t−1)βργ

+ εt

as an unrestricted regression and

yt = x′tβ + ut

as the restricted regression since ρ = 0 and γ = 0 under H0. Having performedthese two regressions, we form the usual statistic

(SSRr − SSRu)/# rest

SSRu/(n− k)

14.4. CORRECTING SERIAL CORRELATION 199

which will have approximately an F# rest,n−k distribution. Under the alterna-tive, ρ 6= 0 and the statistic would become large. Note that this procedureis only appropriate asymptotically since the regressors in the unrestricted caseinclude yt−1 and are no longer non-stochastic.

Rather obviously, due to multiollinearity, this approach breaks down if theregessors include one-period lags of the endogenous variable. In this case, weeliminate the second instance of yt−1 from the list of regressors and consider theunrestricted regression

yt =(

x′t x′t−1)( β∗

γ

)+ εt

where yt−1 is included in xt and its coefficient is now β∗j = βj + ρ. Under thenull hypothesis, we still have γ = 0 but β∗j 6= 0 is possible. So now we proceedas before but only test γ = 0 using the F -test.

14.4 Correcting Serial Correlation

14.4.1 With known ρ

The basic regression equation yt = x′tβ+ut can be rewritten ut = yt−x′tβ andthe autoregressive equation ut = ρut−1 + εt can be rewritten εt = ρut + ut−1.Substitution of the first into the second now yields

εt = (yt − x′tβ)− ρ(yt−1 − x′t−1β)

which can be rewritten as

(yt − ρyt−1) = (x′t − ρx′t−1)β + εty∗t = x∗t

′β + εt

t = 2, . . . , n

This procedure is known as the Cochran-Orcutt transformation. Since x∗t isnon-stochastic and εt has ideal properties, this transformed model satisfies theclassical assumptions and least-squares will have the usual nice properties.

Note, however, that we have lost one observation (the first) in obtaining thistransformed model. It is possible to use a different transformation to recover thefirst observation. Let ε∗1 =

√1− ρ2u1 =

√1− ρ2(ρu0 + ε1), then E[ε∗21 ] = σ2

ε

and E[ε∗1ε∗s] = 0 for s 6= t. Accordingly, we obtain the following transformations

for the first observation

y∗1 =√

1− ρ2y1x∗1 =

√1− ρ2x1.

This procedure for recovering the first observation is known as the Prais-Wintsentransformation.

Combining the two transformations and using matrix notation, we obtainthe GLS transformation,

y∗ = X∗β + ε.


where ε1 = ε∗1. Since ρ is known and X is non-stochastic then the transformedmodel satisfies all the ideal properties, namely E[ε] = 0, E[εε′] = σ2

ε In, X∗ non-stochastic and full column rank. Thus, the OLS estimator of the transformedmodel

β = (X∗′X∗)−1X∗′y∗

is unbiased and efficient. Specifically,

E[β] = β

cov(β) = σ2ε (X∗′X∗)−1

And for unknown variance,

est cov(β) = s2ε(X∗′X∗)−1

is unbiased where s2ε is the usual variance estimator for the transformed model.

14.4.2 With estimated ρ

In general, we will not know ρ so the GLS transformation is not feasible. Sup-pose we have an estimate ρ that is consistent then we obtain the feasible trans-formations

(yt − ρyt−1) = (x′t − ρx′t−1)β + εty∗t = x∗t

′β + εt

t = 2, . . . , n

and for the first observation

√1− ρ2y1 =

√1− ρ2x′1β + ε1

y∗1 = x∗1β + ε1.

Applying OLS to this transformed model yields the feasible GLS estimator

β = (X∗′X∗)−1X∗′y∗.

Under fairly general conditions, the assumptions (a)-(d) from Chapter 11 canbe verified and this estimator has the same large-sample limiting distributionas β. Note that since the results only apply in large sample one can bypassusing the Prais-Wintsen transformation to recover the first observation and stillenjoy the same asymptotic results.

This leaves us with the problem of estimating ρ consistently. If ut wereobservable, we might estimate ρ in

ut = ρut−1 + εt

directly by OLS

ρ =

∑nt=2 utut−1∑nt=2 u2t

.

14.4. CORRECTING SERIAL CORRELATION 201

Since ut is unobservable, we instead form

ρ =

∑nt=2 etet−1∑nt=2 e2t

where et are OLS residuals. This estimator is, in fact, consistent under generalconditions.

An alternative approach is to use the DW statistic to estimate ρ. Recallthat the DW and VNR statistics converge in probability to the same value and

VNRp−→ 2(1− ρ). Rewriting this we have (1−VNR/2)

p−→ ρ. Thus we have

ρ = (1−DW/2)

as a consistent estimator. We can also take the coefficient of yt−1 in theregression

yt = x′tβ + ρyt−1 + x′t−1γ + εt

as our estimate of ρ. This estimate will also be consistent. Both of these alter-native estimators are not only consistent but have the same limiting distributionas the regression approach above.

14.4.3 Maximum Likelihood Estimation

The above treatment can lead naturally to estimation of the model by maximumlikelihood techniques. In the following, we will disregard the first observation.Since it will turn out to be a nonlinear problem, we can only establish the prop-erties of our estimator asymptotically in which case the use of one observationis irrelevant. In any event, we can easily modify the following to incorporatethe first observation using the Prais-Wintsen transformation.

Combining the basic equation yt = x′tβ+ut with the autoregressive equationut = ρut−1 + εt and rewriting yields

yt = ρyt−1 + (xt − ρxt−1)′β + εtt t = 1, 2, . . . , n.

Suppose that εt ∼ i.i.d.N(0, σ2ε ) then

yt|yt−1,xt,xt−1 ∼ N(ρyt−1 + (xt − ρxt−1)′β, σ2ε )

and the different observations of yt are conditionally independent but not iden-tical. Independence is seen by ignoring the xt and looking at the unconditionaldistribution as the product of contional distributions

f(yt, yt−1, yt−1, ...) = f(yt|yt−1)f(yt−1, yt−2, yt−3, ...)

= f(yt|yt−1)f(yt−1|yt−2)f(yt−2, yt−3, yt−4, ...)

= f(yt|yt−1)f(yt−1|yt−2)f(yt−3|yt−4)...

The conditional distribution depend only on the previous realiztion due to theform of the generating equation.


Thus the conditional density of yt given xt,xt−1 is

f(yt|yt−1,xt,xt−1;β, σ2ε , ρ) =

1√2πσ2

ε

exp

1

σ2ε

((yt − ρyt−1) + (xt − ρxt−1)′β)2

=1√

2πσ2ε

exp

1

σ2ε

(y∗t − xt∗′β)2

Due to conditional independence, the joint density and likelihood function ig-noring the first observation is therefore

L = Pr(y1, y2, . . . , yn) = Pr(y2|y1) · Pr(y3|y2) · · ·Pr(yn|yn−1)

=1

(2πσ2ε )n/2

exp

1

σ2ε

n∑

t=2

(y∗t − xt∗′β)2

.

and the log-likelihood is

L = ln L = −n2

ln 2π − n

2lnσ2

ε −1

σ2ε

n∑

t=2

(y∗t − xt∗′β)2.

Note that the log-likelihood depends on β and ρ only in the summation.Thus, for any σ2

ε ,

maxρ,βL ⇐⇒ min

ρ,β

n∑

t=2

(y∗t − xt∗′β)2

and, making ρ explicit again, the problem is

minρ,β

n∑

t=2

((yt − ρyt−1) + (xt − ρxt−1)′β)2.

Given a value of ρ, say ρ, the problem is to minimize∑nt=2(y∗t − xt

∗′β)2

w.r.t. β which yields the familiar feasible GLS result

β = (X∗′X∗)−1X∗′y∗.

And given a value of β, say β, the problem is to minimize∑nt=2(et − ρet−1)2

with respect to ρ, where et = yt − x∗t′β, which also yields a familiar result

ρ =

∑nt=2 etet−1∑nt=2 e2t

.

In order to find the maximum likelihood solution, we have to satisfy both con-ditions at once, so we iterate back and forth between them until the estimatesstabilize. Convergence is guaranteed since we have a quadratic problem ateach stage. The asymptotic limiting distribution is not changed by this furtheriteration. And we see that the feasible GLS estimator proposed in the previoussection is asymptotically fully efficient.

Chapter 15

Misspecification

15.1 Introduction

15.1.1 The Model and Ideal Conditions

In the prequel, we have considered the linear relationship

yi = β1xi1 + β2xi2 + . . .+ βkxik + ui

where i= 1, 2, . . . , n. The elements of this model are assumed to satisfy theideal conditions

(i) E[ui] = 0

(ii) E[u2i ] = σ2

(iii) E[uiul] = 0, l 6= i

(iv) xij non-stochastic

(v) (xi1, xi2, . . . , xik) not linearly related

(vi) ui normally distributed

15.1.2 Misspecification

The model together with the ideal conditions comprise a complete specificationof the joint stochastic behavior of the independent and dependent variables.We might say the model is misspecified if any of the assumptions are violated.Usually, however, we take misspecification to mean the wrong functional formhas been chosen for the model. That is, yi is related to xij and ui by somefunction

yi = g(xi1, . . . , xik, ui)

that differs from the specified linear relationship given above. When any of theideal conditions are violated, we use the terminology specific to that case.

203

204 CHAPTER 15. MISSPECIFICATION

15.1.3 Types of Misspecification

15.1.3.1 Omitted variables

Suppose the correct model is linear

yi = β1xi1 + β2xi2 + . . .+ βkxik

+βk+1xik+1 + βk+2x+ . . .+ βk+1xik+` + ui

but we estimate


In matrix notation, the true model is

y = X1β1 + X2β2 + u

but we estimate

y = X1β1 + u

Thus, we obtain estimates

β1 = (X′1X1)−1X′1y

where we use the tilde to denote estimation of β1 based only on the X1 variables.

15.1.3.2 Extraneous variables

Suppose the correct model is

yt = β1xi1 + β2xi2 + . . .+ βkxik + ui

but we estimate

yi = β1xi1 + β2xi2 + . . .+ βkxik

+βk+1xik+1 + βk+2xik+2 + . . .+ βk+1xik+` + ui

In matrix notation, the true model is

y = X1β1 + u

but we estimate

y = X1β1 + X2β2 + u∗

= Xβ + u

where X = (X1 : X2), β′ = (β′1,β′2).

Thus, we obtain the estimates

β = (X′X)−1X′y

15.2. CONSEQUENCES OF MISSPECIFICATION 205

15.1.3.3 Nonlinearity

We specify the model


but the true model is intrinsically nonlinear:

yi = g(xi, ui)

That is, it cannot be made linear in the parameters and disturbances throughany transformation.

15.1.3.4 Disturbance Misspecification

There are a number of possible ways to misspecify the way that the disturbanceterm enters the relationship. However, for simplicity, we will only look at thecase of specifying an addititve error when it should be multiplicative and thereverse. Specifically, the correct model is

yi = f(xi)ui

but we specify yi = f(xi) + ui or vice versa.

15.2 Consequences of Misspecification

15.2.1 Omitted variables

For estimation we have

β1 = (X′1X1)−1X′1y

= (X′1X1)−1X′1(X1β1 + X2β2 + u)

= β1 + (X′1X1)−1X′1X2β2 + (X′1X1)−1X′1u.

In terms of first moments, then

E[β1] = β1 + (X′1X1)−1X′1X2β2 6= β1

and the estimates are biased unless X′1X2 = 0. And, in terms of secondmoments,

E[(β1 − E[β1])(β1 − E[β1])′ = σ2(X′1X1)−1.

Note that E[s21] > σ2 since X2β2 has been thrown in with the disturbances.It is instructive to compare this covariance of the estimator of β1 based on

X1 only with the estimator when we also include X2. For this case, we have

β = (X′X)−1X′y


where β = (β′1, β′1)′. Moreover,

E[(β − E[β])(β − E[β])′ = σ2(X′X)−1

so

E[(β1 − E[β1])(β1 − E[β1])′ = σ2[(X′X)−1]11

where σ2[(X′X)−1]11 denotes the upper left-hand submatrix of σ2(X′X)−1. Us-ing results for partitioned inverses, we find that

σ2[(X′X)−1]11 = σ2(X′1X1 −X′1X2(X′2X2)−1X′2X1)−1

> σ2(X′1X1)−1

where greater than means exceeding by a positive definite matrix. Thus theestimates while biased are less variable than the OLS estimates based on thecomplete set of x’s.

The findings in the previous paragraph are worthy of elaboration. We findthat incorrectly omitting variables, which amounts to incorrectly setting theircoefficients to zero results in possibly biased estimates with smaller variance.This is an example of the classic bias-variance trade-off in estimation of multiple-parameter models. Imposing incorrect restrictions, or under-parameterizingthe model in the case of zero restriction, results in biased estimates but smallervariance.

For inference, the statistics

q =βi − βi√s2mjj

tn−k

under the null hypothesis. Specifically, it will not be centered around 0 (due tobias) and it will be less spread (since s2 is larger). Thus, graphically we have,

E(βj)

f(q)

tn−k1

and our confidence intervals constructed from q may well not bracket 0 depend-ing on the size of the bias. This means we will commit a Type I error by rejectinga true null hypothesis with probability greater than the ostensible size.


For forecasting, yp will be responsive to xp1 in the wrong way (due to bias

of β1) and unresponsive to xp2. Thus the predictions will be biased and can besystematically off. Moreover, s2 will be larger and hence R2 will be smaller sowe expect worsened prediction performance.

15.2.2 Extraneous Variable

In a very real sense, the model that includes both X1 and X2 is still correct,except that β2 = 0. Thus, for estimation

E

(β1

β2

)=

(β1

0

)

and both vectors of estimators are unbiased. Moreover, we still have

E[(β − E[β])(β − E[β])′ = σ2(X′X)−1

andE(s2) = σ2.

Thus s2(X′X)−1 provides an unbiased estimator of the variances and covari-ances.

However, β1 when also estimating β2 is not efficient, since estimating usingonly X1 would yield smaller variances. Specifically, from the previous section,we have

E[(β1 − E[β1])(β1 − E[β1])′ = σ2[(X′X)−1]11

= σ2(X′1X1 −X′1X2(X′2X2)−1X′2X1)−1

> σ2(X′1X1)−1

which is the variance of the estimator when β2 = 0 is imposed. Thus wesee, in the linear model, the price of overparameterizing the model (includingextraneous variables with accompanying parameters) is a loss of efficiency. Thisresult continues to hold for estimators in a much more general context. If weknow parameters should be zero or otherwise restricted, we should impose therestiction.

For inference, the ratios

q =βj − βj√

s2[(X′X)−1]jj

have the t distribution with n − (k1 + k2) degrees of freedom. If we estimateusing only X1, however,

q =βj − βj√

s2[(X′1X1)−1]jj

will have a t distribution with n− k1 d.f.


tn−(k1+k2)

tn−k1

Thus, there will be loss of power and the probability of a Type II error (not re-jecting a false hypothesis) will be higher. Note, however, that the power, thoughsmaller, will still approach 1 asymptotically and the test remains consistent.

(Alternative behavior)For forecasting the predictors will be unbiased but more variable. However,

the increase in the variance is O(1/n) so it disappears asymptotically. In termsof these second-order impacts the estimated coefficients will not be estimatedas efficiently so the prediction intervals will be wider. Specifically, the criticalvalues that establish the width of the intervals come from the tn−(k1+k2) distri-bution rather than the tn−k1 . Both converge to the N(0, 1) in large samples.

15.2.3 Nonlinearity

For simplicity we look at a bivariate case. Suppose the underlying model is

yi = f(xi, θ) + ui

for i = 1, 2, ..., n, but we mistakenly estimate the linear relationship

yi = β0 + β1xi + ui.

by least squares. The consequences for estimation are similar to the case ofomitted variables. Expanding the nonlinear relationship about x0 in a TaylorSeries yields

f(xi, θ) = f(x0, θ) +∂f(x0, θ)

∂x(xi − x0) +Ri

=

[f(x0, θ)−

∂f(x0, θ)

∂xx0

]+

[∂f(x0, θ)

∂x

]xi +Ri

= α+ βxi +Ri

where α =[f(x0, θ)− ∂f(x0,θ)

∂x x0

], β =

[∂f(x0,θ)

∂x

]and Ri = R(xi) is the re-

mainder term. Thus, estimating the linear model is equivalent to omitting the


remainder term, which will be a function of xi. Note that the values of β and αdepend on the point of expansion so α and β cannot be unbiased or consistentfor all values.

For example, with a scatter of points generated by a concave nonlinear re-lationship and then fitted by linear least squares we might have the followinggeometric relationship

f(xi)

yi

xi

b

b

b

b

b

b

b

b

b

b

b

bb

b

bb

b

b

b

b

bb

b

b

b

b

bbb

b

α + βxi

b

b

bb

b

b

b

bb

b

b

b

b

b

It is clear that: (a) the intercept is no longer correct, (b) the function yields theproper value at only two points, and (c) the slope of the estimated function iscorrect at only one point.

The consequences for inference are similar to the case of omitted variables.Even though a variable may be highly non-linearly related to yi, we may acceptβj = 0 on a linear basis. Thus, we may commit Type II errors in choosing thecorrect variables.

The consequences for prediction are evident. As a consequence of (b),depending on the point of evaluation, we will either systematically over- orunder-estimate the regression function when used for conditional prediction.Depending on the nonlinearity, the farther we get from the estimated range, theworse the linear relationship will approximate f(xi).

15.2.4 Misspecified Disturbances

Supposeyi = f(xi) + ui

is correct but we mistakenly estimate

yi = f(xi)u∗i .

(For example, suppose the true model is yi = αxβi + ui but we estimate yi =

αxβi u∗i or ln yi = lnα + β lnxi + lnu∗i ). Solving for u∗i in the general case, we

obtainu∗i =

yif(xi)

= 1 +ui

f(xi).


So

E(u∗i ) = 1

var(u∗i2) =

σ2

f(xi)

Thus, the disturbances are heteroskedastic. If we transform into logs to obtainlinear in disturbance model, the transformed errors ln(u∗i ) = ln(1 + ui

f(xi)) will

still be heteroskedastic.Conversely, suppose the true model is

yi = f(xi)ui

with E[ui] = 1, but we mistakenly estimate

yi = f(xi) + u∗i

thenu∗i = f(xi)(1− ui)

and u∗i is conditional mean zero but heteroskedastic.The estimation, inferential, and prediction complications arising from het-

eroskedasticity are well-covered in Chapter 12. Likewise, the detection andcorrection of this problem is covered quite well there and will not be discussedbelow.

15.3 Detection of Misspecification

15.3.1 Testing for irrelevant variables

Suppose the model isy = X1β1 + X2β2 + u

and we suspect that β2 = 0, i.e., X2 does not enter into the regression.We may test each coefficient of β2 directly using the test. Specifically, per-

formβ = (X′X)−1X′y

the regression of y on X = (X1 : X2). Now,

βj√s2djj

∼ tn−k

under the null hypothesis that βj = 0. Thus, we can use the printed t-values totest whether the coefficients βj = 0.

A more powerful test may be based on the F -distribution. Perform,

β1 = (X′1X1)−1X′1y

15.3. DETECTION OF MISSPECIFICATION 211

as the restricted regression of y on X1 under H0 : β2 = 0. Using β from theregression of y on X = (X1 : X2) as the unrestricted regression, we form

(SSEr − SSEu)/k

SSRu/(n− k)∼ Fk2,n−k

as a test of the joint H0 : β2 = 0.

15.3.2 Testing for omitted variables

Suppose the estimated model is

y = X1β1 + u

but we suspect that X2 may also enter into the relationship. That is, we suspectthat the true relationship is

y = X1β1 + X2β2 + u

were β2 6= 0. Obviously, we may proceed as in the case of testing for extraneousvariables to see whether X2 enters the relationship.

The above procedure is flexible only if we have some idea of what variableshave been omitted. That is, we know X2. If X2 is unknown, then we mustutilize the OLS residuals.

e1 = y − y

= y −X1β1

= y −X1(X′1X1)−1X′1y

= (In −X1(X′1X1)−1X′1)y

to detect whether variables have been omitted.

Suppose X2 has been omitted, then

e1 = (In −X1(X′1X1)−1X′1)(X1β1 + X2β2 + u)

= (In −X1(X′1X1)−1X′1)(X2β2 + u)

= M1(X2β2 + u)

Since multiplication by M1 orthogonalizes w.r.t. X1, e1 is linearly related tothat part of X2 that is orthogonal to X1.

In time series analysis, we expect x′t2β2 to move smoothly. Consequently,we might expect et1 to exhibit (some) smooth movement over time due to therelationship. That is


et1

t

0b

b

b

b

b

bb

b

bbbb b

b

b

bb

bb b

bb

b

bb

bb

bb

b

b bbb

bb b b

b

bb

b

bb

b

b

bbb

b

b

b

b

bbb

b

b

b

bb b

bb

b

b

b

b

b

bb

b

b

Suppose xi1 and xi2 move together, i.e. are related in some fashion. The eiwill react non-linearly to changes in xi1 and hence x′i1β = yi. Thus, a general

technique is to plot ei against yi = x′i1β. Under H0 : β2 = 0, we expect to findno relationship. If β2 6= 0, however et may have some relationship (nonlinear)to yt.

et

yt

bbb bb b0

b

bb

b

b

b

b

b

b

b

b

b

b

b

b

bb

bb b

b

bbb

b

b

b b

bb

bb

b

b

b

bb

b

bb

bbb

b

b

bb

b

bb

b

b

b

b

b

bb

bbb

This procedure may be formalized. Form the regression

et = γ1y2t + γ2y

3t + vt

Under H0 : β2 = 0, et should be unrelated to xt1 and x′t1β = yt, i.e., γ1 = γ2 =0. If this hypothesis is rejected by the regression results, we have evidence ofomitted variables.

15.3. DETECTION OF MISSPECIFICATION 213

15.3.3 Testing for Nonlinearity

Suppose the true regression is nonlinear:

yi = f(xi, θ) + ui

Expanding, as above about x0 in a Taylor Series, yields

yt = α+ βxi +Ri + ui

where Ri = R(xi) is the remainder term. In this case, since ei ≈ R(xi) + ui,then et will be a (nonlinear) function of xi. Thus, plots, of ei against xi willreveal a pattern:

ei

xi

bbbbbb

0b

b

b

b

b

b

b

b

b

b

bb

b

b

b

bb

bb

b

bb b b

b

b

bb

bb

bb

b

b

b

b b

b

b b

b b b

b

b

b bb

bb

b

b

b

bb

bb

bb b

Of course the pattern must be nonlinear in xi, since the OLS residuals ei areby definition linearly unrelated to the regressors.

A analogous formal test is to examine the regression

ei = γ0 + γ1x2i + γ2x

3i + vi

Under the null hypothesis of linearity, then ei is mean-zero and unrelated to xiwhereupon H0 : γ0 = γ1 = γ2 = 0. If nonlinearity is present we expect to findsignificant terms. Higher-order polynomials can be entertained but third-orderusually suffices.. In the multivariate case setting up all the various second-and third-order terms in the polynomial becomes more involved. Note that thelinear term is always omitted and the intercept is included in the null for powerpurposes.


15.4 Correcting Misspecification

15.4.1 Omitted or Extraneous Variables

If the two competing models are explicitly available then these two cases areeasily handled since one model nests the other. In the case of omitted vari-ables that should be included the solution is to include the variables if they areavailable. In the case of extraneous variables the solution is to eliminate theextraneous variables. The tests proposed in the previous section can be usedto guide these decisions.

If we have evidence of omitted variables but are uncertain of the nature ofthe omitted variables, then we need to undertake a model selection process todetermine the correct variables. This process is touched on below.

15.4.2 Nonlinearities

If we have evidence of nonlinearities in the relationship, then the solution in-volves entertaining a nonlinear regression. Usually, if we are proceeding froman initial linear model we have no idea of what the nature of the nonlinearityis but do know which variables to use. In the bivariate case, we can simplyintroduce higher-order polynomial terms, as in the testing approach, except wecontinue to include a linear term:

yi = β0 + β1xi + +β2x2i + β3x

3i + ui.

We can start with safe choice of the order of the polynomial and eliminateinsignificant higher-order terms. This process can be very involved when wehave a number of variables that are possibly nonlinear. This approach naturallyleads one to entertain nonparametric regression where we allow the regressionfunction to take on a very flexible form. These approaches have slower thannormal rates of convergence to the target function and are beyond the scope ofour analysis here.

Sometimes, we may have a specific nonlinear function in mind as the alter-native. For example, if the model was a Cobb-Douglas function and we decidedthat the log-linear form was inadequate due to heteroskedasticity, we might beinclined to estimate the nonlinear function directly with additive errors. Thisis the subject on nonlinear regression, which is given extended treatment in thenext chapter.

15.5 Model Selection Approaches

15.5.1 J-Statistic Approach

We begin with the simplified problem of choosing between two completely spec-ified models. Suppose the first model under consideration is

y = X1β1 + u1

15.5. MODEL SELECTION APPROACHES 215

but we suspect the alternative

y = X2β2 + u2

may be correct. How can we determine whether the first (or second) model ismisspecified? If the matrix X1 includes all the variables in X2 or visa versa,then the models are nested and we are dealing with possible extraneous variablesor possibly known omitted variables. These cases are treated above and thedecision can be made on the basis of an F -test.

So we are left with the case where the two models are non-nested. Eachmodel includes explanatory variables that are not in the other. Davidson andMacKinnon (D-M) propose we consider a linear combination regression model

y = (1− λ)X1β1 + λX2β2 + u

= X1((1− λ)β1) + X2(λβ2) + u

= X1β∗1 + X2β

∗2 + u.

This formulation allows us to state the problem of which is the correct modelin parametric form. Specifically, we have H0 : Model 1 is correct or λ = 0 andH1 : Model 2 is correct or λ = 1. A complication arises, however, since neitherλ nor (1− λ) are identifiable but are subsumed into the scale of the parametervectors β∗1 and β∗2.

Under the null hypothesis, however, the combined model simplifies to Model1, which suggests we just treat the combined regression in the usual way andjointly test for β∗2 = 0. The problem is that X1 and X2 may, and likely will,share variables (columns) so extreme collinearity will be a problem and theunrestricted regression is infeasible. However, any linear combination of thecolumns of X2 will also have a zero coefficient under the null. This suggeststhe following two-step approach: in the first regress y on X2 and calculate thefitted values, say y2 = X2(X′2X2)−1X′2y; in the second we regress using themodel

y = X1β∗1 + γy2 + u.

Under the null hypothesis, at least asymptotically, γ = 0, so we can just use theusual t-ratio, which is called the “J-statistic” by D-M, to test whether γ = 0.If we fail to reject we conclude that Model 1 is correct.

If we reject, then we conclude that X2 adds important explanatory power toModel 1. We should now entertain the possibility that Model 2 is the correctmodel and run the test the other way, reversing the roles of Models 1 and 2.If we don’t reject the reversed hypotheses then we conclude that Model 2 iscorrect.

A very real possibility, and one that frequently occurs in practice, is that wereject going both ways, so neither Model 1 nor Model 2 is correct. This is aparticular problem in large samples since we know all models are misspecifiedat some level and this will be revealed asymptotically with consistent tests.Unfortunately, if we are still interested in choosing between the two models,this approach gives us no guidance. Thus the sequel of this section will dealwith how to choose between the models when they are both misspecified.


15.5.2 Residual Variance Approach

First, we will develop a metric that can be interpreted as a measure of misspec-ification. Now, for the first model we obtain

β1 = (X′1X1)−1X′1y

e1 = y − y1

= y −X1β1

= y −X1(X′1X1)−1X′1y

= (In −X1(X′1X1)−1X′1)y

= M1y

and

s21 =e′1e1

n− k1=

y′M′1M1y

n− k1=

y′M1y

n− k1Similarly, for the second model, we have

β2 = (X′2X2)−1X′2y

and

s22 =y′M2y

n− k2where M2 = (In −X2(X′2X2)−1X′2)

Now, if the first model is correct

E[s21] = σ2

while

E[s22] = E

[y′M2y

n− k2

]

= E

[(X1β1 + u1)′M2(X1β1 + u1)

n− k2

]

=1

n− k2E [(X1β1 + u1)′M2(X1β1 + u1)]

=1

n− k2E[β′1X

′1M2X1β + 2u′1M2X1β1 + u′1M2u1

]

=β′1X

′1M2X1β1

n− k2+ σ2 ≥ E[s2]

Thus, s2 is smallest, on average, for the correct specification. This suggests acriterion where we select the model with the smallest estimated variance. Thisis the residual-variance criterion. And the expression (β′1X

′1M2X1β1)/(n−k2)

can be taken as a measure of misspecification in terms of mean squared error.


Provided that both models have the same dependent variable, which wasthe case above, it is tempting to select between them on the basis of “goodness-of-fit” or R2, which tells us the percentage of total variation in the dependentvariable that is explained by the model. The problem with this approach isthat R2 always rises as we add explanatory variables, so we would end up witha model using all possible explanatory variables whether or not they really enterthe relationship or not. Such a model would not work well out-of-sample if theextra variables move independently of the correctly included variables.

A possible solution is to penalize the statistic for the number of includedregressors. Specifically, we propose the adjusted R2 statistic

R2

= 1− n− 1

n− k(1−R2)

= 1− (e′e)/(n− k)∑ni=1(yi − y)2/(n− 1)

= 1− s2∑ni=1(yi − y)2/(n− 1)

Note that ranking models in terms of increasing R2

is equivalent to rankingthem in terms of decreasing s2, since the other elements in the expression do

not change between models. One advantage of the R2

approach is that it isunit and scale-free.

15.5.3 A Model Selection Statistic

For the residual variance criterion, the rather obvious question is whether oneestimated variance is “enough” larger than another to justify selection of onemodel over another. We will formulate a test statistic as a guide in selectingbetween any two models. Specifically let

mi =n

n− k2(yi − x′2iβ2)2 − n

n− k1(yi − x′1iβ1)2

m =1

n

n∑

i=1

mi = s22 − s21

vm =1

n

n∑

i=1

(mi −m)2

and define

Dn =

√nm√vm

as our statistic of interest. Suppose mp−→ 0, then under appropriate conditions

Dnd−→ N(0, 1) and tail occurrences (|Dn| > 1.96, say) would be rare events.

Suppose that the true model is denoted by subscript 0, then mp−→ 0 implies

s22 − s21p−→ 0 and hence

β′1X′1M0X1β1

n−k0 − β′2X′2M0X2β2

n−k0p−→ 0. If, from above,


we takeβ′lX

′lM0Xlβln−k0 as an index of misspecification of the model l relative to

the true model 0, then we see that mp−→ 0 indicates that the two models are

equally misspecified. This leaves us with three possibilities

(1) Model 1 is ”equivalent” to Model 2: Dnd−→ N(0, 1)

(2) Model 1 is ”better” than Model 2: Dna.s.−→ +∞

(3) Model 2 is ”better” than Model 1: Dna.s.−→ −∞.

Large positive values favor Model 1. Large negative value favor Model 2. Theintermediate values (e.g. |Dn| > 1.96 for 95 percent significance) comprise aninconclusive region.

15.5.4 Stepwise Regression

In the above model selection approaches, we have reduced the number of com-peting models to a small number before undertaking statistical inference. Suchreductions are usually guided by competing economic theories and experiencefrom previous empirical exercises. Such reductions are really at the heart ofeconometrics which is seen as the intersection of data, economic theory, andstatistical analysis. The model we end up with really depends on all threecontributors to winnow the competing models down to a select few. Such anapproach is at the heart of structural and even informed reduced form modeling.

It is of interest to think about the model selection process from a more data-driven standpoint. This might be the case if economic theory does not givefirm guidance as to which explanatory variables enter the relationship for whichdependent variable. Such a situation might be characterized by the true modelhaving the form

y = X1β1 + u1

where the variables comprising X1 are not known but are a subset of the avail-able economic variables X. Somehow, we would like to use the data to determinethe variables entering the relationship. Unfortunately, some problems arise inany purely data-driven approach as we will see below.

At this point, it seems appropriate to compare the consequences of incor-rectly omitting a variable against incorrectly including an extraneous variable.They are really two sides of the same coin but the consequences are very dif-ferent. Omitting the variable incorrectly leads to bias and inconsistency andincorrect size due to increased probability of a Type I error. Including anextraneous variable increases the probability of a Type II error and hence de-creased power but the test will still be consistent. Thus we are faced with theoft encountered trade-off between size and power in deciding how to conductinference.

In the usual inferential exercise we carefully calibrate the test so that thesize takes on a prespecified value, such as .05 and take the power that results.Implicitly, this approach places greater value on getting the size correct at the


expense of power and the consequences of omitted variables should be viewedas more serious than those of including extraneous variables. This suggests the“kitchen-sink” approach when deciding which variables to include in the modelunder construction. When in doubt, leave it in, would seem to be good advicebased on this reasoning.

An approach suggested by this reasoning is the backward elimination step-wise regression approach. In this approach, we start with a regression includingall candidate regressors and eliminate them one-by-one on the basis of statisti-cal insignificance. The resulting regression model will have all the regressorsstatistically significant, at some level, by the measure used. An example ofa decision criterion at each step would be to eliminate the variable with thehighest t-ratio until the remaining t-ratios are no smaller than one. At this

point the elimination of any of the remaining variables would reduce the R2

andnot be preferred.

There are a couple of problems with this approach. The first is the fact thatwe do not necessarily end up with the best set of regressors that are significant.Some variables that were eliminated in early rounds may end up being veryimportant once others have been eliminated. A modification is to entertain allpossible competing models that have a given number of explanatory variables,but this becomes very complicated when the number of variables in X is large.

A second problem is that the regression model that results from this processwould not have standard statistical properties. This is easily seen. The usualratios on the estimated coefficients could not have a t distribution because smallvalues of the ratio have been eliminated in the selection process. For exampleall the ratios would be larger in absolute value than, say 1. Nor could they evenbe normal in large samples. This is an example of what is called pre-test bias.If parts of a model have been selected on the basis of a previous testing step,the properties of tests in the current step will not have the usual properties.

The serious problems with the stepwise regression approach indicated by thenon-standard distribution can be seen as resulting from data-mining. Supposewe start out with a variable yi, which we label as the dependent variable, and alarge set of other variables, which we label as candidate regressors. In additionsuppose that the candidate regressors have been randomly generated and bearno true relationship to the dependent variable. If the sample size is not largeand the number of candidate regressors is large, then just by chance a subset ofthe regressors will be correlated with the dependent variable.

Specifically, a subset will have a large t-ratio just by chance. This is alsoeasy to see. The true regression coefficients are all zero so the t-ratios will havea t-distribution and a certain fraction, say 0.16 for 30 degrees of freedom, willhave t-ratios larger than 1. But this is purely a result of spurious correlation.Thus we see that the stepwise regression approach is fraught with difficulties. Itis really a pure data-mining approach and cannot be expected to yield importanteconomic insights. Instead we should let our choice of variables and modelsbe guided to the extent possible by economic theory and previous econometricresults.

multicollinearity - rice universitybwbwn/econ510_files/part_3.pdfvariable as collinearity increases,...

Documents