chapter2_linearregression_partii

University of Lausanne - École des HEC

LECTURE NOTES

ADVANCED ECONOMETRICS

Preliminary version, do not quote, cite and reproducewithout permission

Professor : Florian Pelgrin

2009-2010

CHAPTER 2: THE MULTIPLE LINEARREGRESSION MODEL (PART II)

Contents

1 Some concepts in point estimation . . . . . . . . . . . . . . . . . 21.1 What is an estimator? . . . . . . . . . . . . . . . . . . . 21.2 What constitutes a "good" estimator? . . . . . . . . . . . 31.3 Decision theory . . . . . . . . . . . . . . . . . . . . . . . 41.4 Unbiased estimators . . . . . . . . . . . . . . . . . . . . 81.5 Best unbiased estimator in a parametric model . . . . . . 131.6 Best invariant unbiased estimators . . . . . . . . . . . . 16

2 Statistical properties of the OLS estimator . . . . . . . . . . . . 182.1 Semi-parametric model . . . . . . . . . . . . . . . . . . . 18

Nonstochastic regressors . . . . . . . . . . . . . . . . . . 18Stochastic regressors . . . . . . . . . . . . . . . . . . . . 22

2.2 Parametric model . . . . . . . . . . . . . . . . . . . . . . 25Fixed regressors . . . . . . . . . . . . . . . . . . . . . . . 26Stochastic regressors . . . . . . . . . . . . . . . . . . . . 27

1

1 Some concepts in point estimation 2

In this part, some of the statistical properties of the ordinary least squaresare reviewed. More specifically, we are focusing on the unbiasedness and ef-ficiency property of the ordinary least squares estimator when the multiplelinear regression model is parametric or semi-parametric. In the next part, theasymptotic properties of the ordinary least squares estimator are analyzed.

1 Some concepts in point estimationBefore presenting the unbiasedness and efficiency properties of the ordinaryleast squares estimator, it is important to define some concepts that will beused throughout this part and the course.

Generally speaking, point estimation refers to providing a "best guess" ofsome quantity of interest. The latter could be the parameters of the regressionfunction, the probability density function in a nonparametric model, etc.

1.1 What is an estimator?

To define an estimator, we consider the simple case in which we have n i.i.d.(independent and identically distributed) random variables, Y1,Y2,· · · ,Yn—each random variable is characterized by a parametric distribution and thusdepends on a parameter (or a vector of parameters), say θ—and we observethe realizations of these random variables. In this respect, one can define anestimator of θ as follows.

Definition 1. A point estimator is any function T (Y1, Y2, · · · , Yn) of a sam-ple. Any statistic is a point estimator.

Examples: Assume that Y1,· · · ,Yn are i.i.d. N (m, σ2) random variables.

1. The sample mean

Yn =1

n

n∑i=1

Yi

is a point estimator (or an estimator) of m.

2. The sample variance

S2n =

1

n− 1

n∑i=1

(Yi − Yn

)2is a point estimator of σ2.


Remarks:

1. There is no correspondence between the estimator and the parameter toestimate.

2. In the previous definition, there is no mention regarding the range of thestatistic T (Y1, · · · , Yn). More specifically, the range of the statistic canbe different from the one of the parameter.

3. An estimator is a function of the sample: it is a random variable (orvector).

4. An estimate is the realized value of an estimator (i.e. a number) that isobtained when a sample is actually taken. For instance, yn is an estimateof Yn and is given by:

yn =1

n

n∑i=1

yi.

1.2 What constitutes a "good" estimator?

While this question is simple, there is no straightforward answer...A "good"estimator may be characterized in different ways.1 For instance, one mayconsider a "good" estimator as one that:

1. ...minimizes a given loss function and has the smallest risk;

2. ...is unbiased, i.e.

E(T (Y )) = θ orE(T (Y )) = g(θ)

where g is known;

3. ...satisfies some asymptotic properties (when the sample size is large);

4. ...is efficient, i.e. has the minimum variance among all estimators of thequantity of interest;

5. ...is the best estimator in a restricted class of estimators that sat-isfies some desirable properties (search within a subclass)—for instance,the class of unbiased estimators;

6. ...is the best estimator, which has some appropriate properties, by max-imizing or minimizing a criterion (or objective function);

1For sake of simplicity, we will almost always consider a statistical parametric model.


7. ...

Some of these "different" interpretation are briefly reviewed below.2

1.3 Decision theory

A first answer to our question is found in decision theory which is a formaltheory for comparing statistical procedures.

Let θn = T (Y ) denote an estimator of θ and θ0 the true value of the pa-rameter θ. We can define a loss function, L

(θn, θ0

), which tells us the loss

that occurs when θn is used when the true value is θ0. As we explain before(Chapter 2, part I), a common loss function is the quadratic one:

L(θn, θ0) =(θn − θ

)2

or

L(θn, θ0) =(θn − g (θ)

)2

However, other loss functions are possible.

1. Absolute norm:

L1

(θn, θ

)= |θn − θ|

2. Truncated loss function:

L(θn, θ

)=

|θn − θ| if |θn − θ| ≤ c

c if |θn − θ| ≥ c.

3. The zero-one loss function:

L(θn, θ

)=

0 if θn = θ

1 if θn 6= θ.

4. The Lp or Lp loss function:

L(θn, θ

)= |θn − θ|p

5. The Kullback-Leibler loss function:

L(θn, θ

)=

∫Y

log

(f(y; θ)

f(y; θn)

)f(y; θ)dy.

where f is the probability density function of y.2Note that the way of presenting these different interpretations is somehow artificial in

the sense that there are some relationships among them.


6. Etc.

In the sequel, we focus on the quadratic loss function. Taking the loss function,we can calculate the risk function, i.e. the expected value of the loss forany θn.

Definition 2. The risk function, i.e. the expected value of the loss isdefined to be:

R(θn, θ0

)= E

[L(θn, θ0

)]= E [L (T (Y ), θ0)]

=

∫Y

L (T (Y ), θ0) f(y; θ0)dy

For example, the risk function function associated to the quadratic loss func-tion is called the mean squared error (MSE):

R(θn, θ

)= E

[(θn − θ

)2]

.

The mean squared error can be written as follows:

R(θn, θ

)= Vθ

(θn

)+[B(θn

)]2where B

(θn

)= E

(θn

)− θ is the bias of the estimator θn.

In particular, if E(θn

)= θ for all θ, then θn is an unbiased estimator of

θ and

R(θn, θ

)= Vθ

(θn

).

Exercises:

1. Consider a quadratic loss function such that the risk function is givenby:

R(θn, θ

)= Eθ

[(θn − g(θ)

)2]

.

Show that:

R(θn, θ

)= Vθ

(θn

)+[B(θn

)]2where B

(θn

)= E

(θn

)− g(θ).


2. Let Y1 and Y2 be two i.i.d. P(λ) random variables and let Y2 and S22

denote:

Y2 =1

2

2∑i=1

Yi

S22 =

1

2− 1

2∑i=1

(Yi − Y2

)2=

(Y1 − Y2)2

2.

(a) Show that:

R(Y2, λ

)=

λ

2

R(S2

2 , λ)

=λ

2+ 2λ2.

(b) Conclude.

3. Let X1,· · · ,Xn denote a sequence of i.i.d. N (0, σ2). An estimator of σ2

is given by:

S2n =

1

n− 1

n∑i=1

(Xi − Xn

)2where Xn = 1

n

n∑i=1

Xi.

(a) Show that S2n is an unbiased estimator of σ2.

(b) Determine the root mean squared error of S2n.

(c) Consider another estimator:

σ2n =

1

n

n∑i=1

(Xi − Xn

)2.

(i) Is it an unbiased estimator of σ2? an asymptotically unbiasedestimator of σ2?

(ii) Determine the root mean squared error of σ2n.

(iii) Compare with question (b). Conclude.

Taking Definition 2, a ”good” estimator is one that has small risk and the bestestimator has the smallest risk. Therefore, to compare two (or more) estima-tors one can compare their risk functions. At a first glance, the risk function


depends on the unknown true parameter vector θ0 and thus the previous def-inition is not operational. However, the risk function (and loss function) canbe still redefined as follows:

R(θn, θ) = E[L(θn, θ)

]=

∫L (T (Y ), θ) f(y; θ)dy.

But there is still another problem: it may happen that none estimator (amongdifferent estimators) uniformly dominates the others.3 Indeed, consider thefollowing example.

Example: Suppose that X ∼ N (θ, 1) and that the loss function is a quadraticform. Consider two estimators θ1 = X and θ2 = 3 (point-mass distribution).

• The risk function of the first estimator is:

R(θ1, θ) = Eθ(X − θ)2 = 1.

• The risk function of the second estimator is:

R(θ2, θ) = Eθ(3− θ)2 = (3− θ)2.

• If 2 < θ < 4, then

R(θ2, θ) < R(θ1, θ).

• Neither estimator uniformly dominates the other.

In this case, how can we proceed? Different strategies are possible. In doingso, let define first the maximum risk and the Bayes risk.

Definition 3. The maximum risk is defined to be:

R(θn, θ0

)= sup

θR(θn, θ0

).

The Bayes risk is given by:

r(f0, θn

)=

∫R(θn, θ0

)f0(θ)dθ

where f0 is a prior density function for θ.

Taking this definition, we can define (at least) two strategies.3A given loss function only yields (in general) a pre-ordering and not a total order!


• Strategy 1: Find an estimator that works ”well” for a range values of θ.

For instance, if we know that θ ∈ Θ, we might try to find an estimatorthat solves:

minθn

maxθ∈Θ

R(θn, θ).

This is a so-called minimax estimator or minimax rule.

Remark: More generally, θn is minimax if:

supθ

R(θn, θ) = infθ

supθ

R(θ, θ).

• Strategy 2: Find a decision rule that minimize the Bayes risk:

Definition 4. A decision rule that minimizes the Bayes risk is called a Bayesrule. More specifically, θn is a Bayes rule with respect to the prior f0 if:

r(f0, θn

)= inf

θr(f0, θ

).

.

During this course, we will not consider these strategies. However, it should bestressed that decision theory is an important part of the statistical foundationsof econometrics and that a number of concepts introduced in this course arederived from it!

1.4 Unbiased estimators

A "good" estimator may be defined as an unbiased one.

Definition 5. Given a parametric model (Y , Pθ, θ ∈ Θ), an estimator T (Y )is unbiased for θ if:

Eθ [T (Y )] = θ for all θ ∈ Θ.

This definition can be generalized to any function of θ.


Definition 6. Given a parametric model (Y , Pθ, θ ∈ Θ), an estimator T (Y )is unbiased for a function g(θ) ∈ Rp of the parameter θ if:

Eθ [T (Y )] = g(θ) for all θ ∈ Θ.

Examples:

1. Let Y1,· · · ,Yn be a random sampling from a Bernoulli distribution. Anunbiased estimator of p is:

T (Y ) =1

n

p∑i=1

Yi.

2. Let Y1,· · · ,Yn be a random sample from the uniform distribution U[0,θ].An unbiased estimator of θ is:

T (Y ) =2

n

n∑i=1

Yi.

3. Let T (Y ) be an unbiased estimator of g(θ). Then the linear transforma-tion AT (Y ) + B is an unbiased estimator of Ag(θ) + B where A and Bare constant (nonrandom) matrices.

4. Consider the multiple linear regression model:

Y = Xβ0 + u

where Y ∈ Rn, X ∈Mn×k is nonrandom, E(u) = 0n×1, and V(u) = σ2In.The OLS estimator

T (Y ) =(X tX

)−1X tY

is an unbiased estimator of β0.

5. Consider the generalized multiple linear regression model:

Y = Xβ0 + u

where Y ∈ Rn, X ∈Mn×k is nonrandom, E(u) = 0n×1, and V(u) = σ2Ω(Ω matrix is known). The generalized least squares (GLS) estimator

T (Y ) =(X tΩ−1X

)−1X tΩ−1Y

is an unbiased estimator of β0.


More generally, we can define the unbiasedness of an estimator conditionally onsome random variables. This will be particularly useful in the multiple linearregression model when one assumes a mixture of stochastic and nonstochasticregressors.

Definition 7. T (X,Y ) is an estimator conditionally unbiased for θ if andonly if:

Eθ [T (Y, X) | X = x] = θ for all θ ∈ Θ and ∀x ∈ X .

This definition also generalizes in the case of a function g of θ.

Definition 8. T (X, Y ) is an estimator conditionally unbiased for g(θ) if andonly if:

Eθ [T (Y,X) | X = x] = g(θ) for all θ ∈ Θ and ∀x ∈ X .

Example: Conditional static linear modelsThe OLS estimator given by:

βn =(X tX

)−1X tY

is unbiased for all β ∈ Θ and ∀x ∈ X . We have:

βn = β0 +(X tX

)−1X tu.

Taking the expectation and iterating over X, we get:

E[βn | X

]= β0 + E

[(X tX

)−1X tu | X

]= β0 +

(X tX

)−1X tE [u | X]

= β0.

Therefore,

E[βn

]= EX

[E[βn | X

]]= β0 ∀β ∈ Θ and ∀x ∈ X .

Remarks:


1. The unbiasedness condition must hold for every possible value of theparameter and not only for some of these values.

For instance, if Eθ [T (Y )] = θ when θ = θ0, then the estimator is notunbiased because the unbiasedness condition is not satisfied for everyother parameter value.

2. In general, the property of unbiasedness is not conserved after a nonlineartransformation of the estimator.

Proposition 1. Let T (Y ) be an unbiased estimator of θ. If h is a nonlinearfunction of θ, then Eθ [h (T (Y ))] = h [Eθ (T (Y ))] no longer holds.

One can show the following proposition.

Proposition 2. Let T (Y ) be an unbiased estimator of θ ∈ R.

(a) If h is convex, then h(T (Y )) overestimates h(θ) on average.

(b) If h is concave, then h(T (Y )) underestimates h(θ) on average.

Example: Let Y1,· · · ,Yn be i.i.(m,σ2). The sample variance

S2n =

1

n− 1

n∑i=1

(Yi − Yn

)2is an unbiased estimator of σ2. However,

√S2

n is a biased estimator of σand it underestimates the true value.

3. Asymptotically unbiased estimators:

Definition 9. The sequence of estimators θn ≡ Tn(Y ) (with n ∈ N) isasymptotically unbiased if:

limn→∞

Eθ (Tn(Y )) = θ for all θ ∈ Θ

where Eθ is defined with respect to Pθ,n.


An estimator θn is asymptotically unbiased if the sequence of estimatorsθn is asymptotically unbiased.

4. Existence of an unbiased estimator:

Proposition 3. If θ is a non-identified parameter, then there does not existan unbiased estimator of θ. If g(θ) is a non-identified parameter function,then there does not exist an unbiased estimator of g(θ).

A necessary condition for the existence of an unbiased estimator isthe identification of the parameter function to be estimated.

5. The unbiasedness condition writes:

Eθ [T (Y )] =

∫Y

T (y)`(y; θ)dy = g(θ)

for all θ ∈ Θ. Differentiating this condition with respect to θ yields:

∂g

∂θ′(θ) =

∂

∂θ′

∫Y

T (y)`(y; θ)dy

=

∫Y

T (y)∂`

∂θ′(y; θ)dy

=

∫Y

T (y)∂ log `

∂θ′(y; θ)`(y; θ)dy

= Eθ

[T (Y )

∂ log `

∂θt(Y ; θ)

].

While the unbiasedness property may appear to be interesting per se, it is notso much! More specifically,

(a) The absence of bias is not a sufficient criterion to discriminate amongcompetitive estimators.

(b) A best unbiased estimator may be inadmissible (e.g., Stein’s estimator—see further).

(c) It may exists many unbiased estimators for the same parameter (vector)of interest.


(d) This is also true if one requires that the estimator is asymptoticallyunbiased.

To illustrate this statement, consider the simple linear regression model:

yi = xiβ0 + ui

where xi is non-random and the error terms ui are i.i.d. N (0, σ2). The OLSestimator

βn =

(n∑

i=1

x2i

)−1( n∑i=1

xiyi

)

is an unbiased estimator of β0. However, is it unique? Unfortunately, no! Thefollowing three estimators, which are linear with respect to Y , are unbiased:

β1,n =1

n

n∑i=1

yi

xi

β2,n =

n∑i=1

yi

n∑i=1

xi

β3,n =1

n− 1

n∑i=2

yi − yi−1

xi − xi−1

.

More generally, any estimator (linear with respect to Y ) such that:

n∑i=1

wixi = 1

is unbiased.

Exercise: Verify the previous condition for β1,n,β2,n, and β3,n.

1.5 Best unbiased estimator in a parametric model

When an estimator is unbiased, its (matrix) quadratic risk function is givenby:

Rθ (T (Y ), θ) = Eθ

[(T (Y )− θ)(T (Y )− θ)t

]and thus reduced to its variance-covariance matrix Vθ(T (Y )). Therefore, com-paring two (or more) unbiased estimators becomes equivalent to comparingtheir variance-covariance matrices.


Definition 10. Suppose that T1(Y ) and T2(Y ) are two unbiased estimators.T1(Y ) dominates T2(Y ) if and only if:

Vθ(T2(Y )) Vθ(T1(Y ))

i.e. the matrix Vθ(T2(Y ))− Vθ(T1(Y )) is a positive semi-definite matrix forall θ ∈ Θ.

In this respect, a natural question is whether there exists a lower bound for thevariance-covariance matrix of unbiased estimators in parametric models? Toanswer this question, we first need to define the so-called Fisher informationmatrix.

Definition 11. A parametric model with density `(y; θ), θ ∈ Θ, is definedto be regular if:

1. Θ is an open subset of Rp;

2. `(y; θ) is differentiable with respect to θ;

3.∫Y `(y; θ)dy is differentiable with respect to θ and

∂

∂θ

∫Y

`(y; θ)dy =

∫Y

∂

∂θ`(y; θ)dy;

4. The Fisher information matrix

I(θ) = Eθ

[∂log`(Y ; θ)

∂θ

∂ log `(Y ; θ)

∂θt

]exists and is nonsingular for every θ ∈ Θ.

Taking this definition, the following fundamental theorem can be shown.

Theorem 1. Suppose that the parametric model is regular. Every estimatorT (Y ) that is regular and unbiased for θ ∈ Θ ⊂ Rk has a variance-covariancematrix satisfying:

Vθ(T (Y )) I(θ)−1.

The quantity I(θ)−1 is called the Frechet-Darmois-Cramer-Rao lowerbound.


Remark: T (Y ) is regular if it is square integrable:

Eθ‖T (Y )‖2 < ∞ for all θ ∈ Θ

and it satisfies:

∂

∂θ

∫Y

T (y)`(y; θ)dy =

∫Y

T (y)∂`

∂θ(y; θ)dy

The previous theorem can also be announced in the case of a function g of θ.

Theorem 2. Suppose that the parametric model is regular. Every estima-tor T (Y ) that is regular and unbiased for g(θ) ∈ Θ ⊂ Rp has a variance-covariance matrix satisfying:

Vθ(T (Y )) ∂g(θ)

∂θtI(θ)−1∂g(θ)t

∂θ.

where ∂g(θ)∂θt is the p × k Jacobian matrix. The quantity ∂g(θ)

∂θt I(θ)−1 ∂g(θ)t

∂θis

also called the Frechet-Darmois-Cramer-Rao lower bound.

We are now in a position to define the efficiency of an estimator.

Definition 12. Assume that the parametric model is regular. A regular unbi-ased estimator of θ (respectively g(θ)) is efficient if its variance-covariancematrix equals the FDCR lower bound:

Vθ(T (Y )) = I(θ)−1

or

Vθ(T (Y )) =∂g(θ)

∂θtI(θ)−1∂g(θ)t

∂θ.

Remarks:

1. It is worth noting that we restrict the class of estimators under consideration—the unbiased estimators.

Proposition 4. An efficient estimator of θ or g(θ) is optimal in the class ofunbiased estimators.


2. An efficient estimator of θ or g(θ) is necessarily unique.

Proposition 5. The best unbiased estimator of θ or g(θ) is unique. More-over, this best unbiased estimator is uncorrelated with the difference betweenitself and every other unbiased estimators of θ or g(θ).

Example: Let X1,· · · ,Xn denote a sequence of i.i.d. B(p). The sample mean

Xn =1

n

n∑i=1

Xi

is an unbiased estimator of p and is the maximum likelihood estimator of p(see Chapter 3). We get:

V(Xn) =p(1− p)

n.

The FDCR lower bound is given:

E

[(nXn

p− n− nXn

1− p

)2]

= E

[(n(Xn − p)

p(1− p)

)2]

=n

p(1− p).

Therefore V(Xn) equals the inverse of the FDCR lower bound and Xn isefficient.

1.6 Best invariant unbiased estimators

The previous results can only be used to find best unbiased estimators in para-metric models. These results no longer apply in the case of semi-parametricmodels. The class of estimators must be again restricted, i.e. we imposeinvariance conditions. More specifically, we are interested in two forms ofinvariance:

1. When (i) the parameters of interest appear linearly in the first momentand (ii) when the class of estimators is restricted to be linear in theobservations;

2. When (i) the parameters of interest appear linearly in the second momentand (ii) when the class of estimators estimators is restricted to quadraticestimators.


In the former, the so-called Gauss-Markov theorem defines the best linear un-biased estimators. In the latter, one can define the best quadratic unbiasedestimators.

In doing so, consider the (conditional static) linear regression model:

Y = Xβ0 + u

with E(u | X) = 0 and V(u | X) = σ2In, Y is an n-dimensional vector andX is an n × k matrix of rank k (or P(rk(X) = k) = 1). The parametervector of interest is the k-dimensional vector β0 which is linearly related toE(Y | X). If we now consider the class of unbiased estimators that are linearin Y (conditional on X), one could show the following theorem.

Theorem 3. The ordinary least squares estimator of β0 defined by

βOLS = (X tX)−1X tY

is the best estimator in the class of linear (in Y ) unbiased estimators of β0.Its variance is:

V[βOLS | X

]= σ2(X tX)−1.

This theorem is known as the Gauss-Markov theorem. It can be easily gen-eralized in the case of non-spherical error terms (assuming that the variance-covariance matrix of the error terms is known) as follows.

Theorem 4. Consider the conditional static linear model:

Y = Xβ0 + u

with E(u | X) = 0, V(u | X) = σ2Ω0, Ω0 is known positive definite, Y is ann-dimensional vector, and X is an n × k matrix of rank k (or P(rk(X) =k) = 1).The generalized least squares estimator defined by

βGLS =(X tΩ−1

0 X)−1

X tΩ−10 Y

is the best estimator in the class of linear unbiased estimators of β0. Itsvariance is given by:

V[βGLS | X

]= σ2

(X tΩ−1

0 X)−1

.

2 Statistical properties of the OLS estimator 18

Finally, it is also possible to show that the unbiased estimator of σ2 isthe best quadratic unbiased estimator. In this case, we need some additionalassumptions regarding the third and four moments of the error terms condi-tionally on X. This result is stated in the theorem below.

Theorem 5. Consider the (conditional static) linear regression model:

Y = Xβ0 + u

with E(ui | X) = 0 and V(ui | X) = σ2, E(u3i | X) = 0, E(u4

i | X) = 3σ4, Yis an n-dimensional vector and X is an n×k matrix of rank k (or P(rk(X) =k) = 1).The estimator of σ2 defined by

s2 =1

n− kY ′MXY =

u′u

n− k

is the best quadratic unbiased estimator of σ2.

2 Statistical properties of the OLS estimatorTaking the previous results, we can now study the unbiasedness and efficiencyproperties of the ordinary least squares estimator. First, we consider the semi-parametric multiple linear regression model and we make the distinction be-tween nonstochastic and stochastic regressors. Then, we analyze these prop-erties in the parametric multiple linear regression model. In the latter, onemain advantage is that we can derive the exact distribution of the ordinaryleast squares estimator.

2.1 Semi-parametric model

Nonstochastic regressors

In the case of nonstochastic regressors, the multiple linear regression modelwrites:

Y = Xβ + u

where E(u) = 0n×1, V(u) = σ2In, and X is matrix of nonstochastic regressorswith rk(X) = k. This is a semi-parametric specification.

• The ordinary least squares estimator of β is given by:

βOLS =(X tX

)−1X tY.


• The corresponding ”naive” least squares estimator of σ2 is given by:

σ2 =‖Y −XβOLS‖2

In

n.

On the one hand, the ordinary least squares estimator of β is unbiased, un-correlated with the adjusted error terms and its variance-covariance matrix isgiven by σ2(X tX)−1.

Proposition 6. The ordinary least squares of β satisfies the following prop-erties:

E[βOLS

]= β for all β ∈ Θ

V[βOLS

]= σ2(X tX)−1

and

Cov[βOLS, u

]= 0k×1

where u = Y − Y = MXY .

Proof:

1. By definition, βOLS = (X ′X)−1X ′Y . Therefore,

βOLS = (X ′X)−1X ′(Xβ + ε)

= β + (X ′X)−1X ′u.

It follows that:

E(βOLS) = β + E((X ′X)−1X ′u)

= β + (X ′X)−1X ′E(u) = β.

2. By definition,

V(βOLS

)= E

[(βOLS − E

(βOLS

))(βOLS − E

(βOLS

))′]= E

[(βOLS − β

)(βOLS − β

)′].

Using the previous proof, one has:

βOLS − β = (X ′X)−1

X ′u


Therefore (since[(X ′X)−1]′ = (X ′X)−1),

V(βOLS

)= E

[(X ′X)

−1X ′uu′X (X ′X)

−1]

= (X ′X)−1

X ′E [uu′] X (X ′X)−1

= (X ′X)−1

X ′σ2InX (X ′X)−1

= σ2 (X ′X)−1

(X ′X) (X ′X)−1

= σ2 (X ′X)−1

.

Remark: The result can also be shown in another way. One has V (Y ) =

V (Xβ + u) = V (u) = σ2In. It follows that: V(βOLS

)= V

((X ′X)−1 X ′Y

)=

(X ′X)−1 X ′V (Y ) X (X ′X)−1 = σ2 (X ′X)−1.

3. We have:

Cov(βOLS, u

)= E

[(βOLS − β

)u′]

= E[(X ′X)

−1X ′uu′MX

]since E (u) = 0n×1 and u = MXu. Therefore,

Cov(βOLS, u

)= (X ′X)

−1X ′E [uu′] MX

= σ2 (X ′X)−1

X ′MX

= 0k×n

since MXX = 0n×k.

On the other hand, the naive estimator of σ2 is biased. To obtain an un-biased estimator of σ2, we need to adjust the denominator (i.e, instead ofdividing by n, we adjust for the number of explanatory variables).

Proposition 7. The ”naive” ordinary least squares estimator of σ2,


In

n,

is biased. The unbiased ordinary least squares estimator of σ2 is given by:


In

n− k

where k is the number of explanatory variables.


Proof: By definition,

u = Y − Y = Y − PXY = (I − PX)Y

i.e.

u = MXY = MXu

(since MXX = 0).

Therefore,n∑

i=1

u2i can be defined as follows:

u′u = u′M ′XMXu = u′MXu = Tr(MXuu′).

Therefore,

E[u′u] = E [Tr (MXuu′)]

= Tr [E (MXuu′)]

= Tr [MXE (uu′)]

= Tr[MXσ2In

]= σ2Tr (MX)

= σ2(n− k).

It follows that σ2 is a biased estimator of σ2 since:

E[σ2]

=n− k

nσ2.

Finally, σ2 = nn−k

σ2 is an unbiased estimator of σ2.

Remark: The variance of σ2 cannot be derived without some assumptions re-garding the third and fourth moments of u.4 If we assume that such momentsexist, then:

V(σ2)

=2σ4

n− k+

∑i(µ4i − 3σ4)m2

X,ii

(n− k)2

where mX,ii is the ith diagonal element of MX and µ4i is the moment of order4.

As we explain before, unbiasedness is not sufficient per se in order to dis-criminate among estimators. We now turn to the efficiency problem. First,we study the efficiency of the ordinary least squares estimator of β. Then wefocus on the efficiency of σ2.

4To define the semi-parametric model, we only make assumptions regarding the first twomoments.


Theorem 6. Consider the static multiple linear regression model:

Y = Xβ0 + u

where E(ui) = 0 and V(ui) = σ2 for all i, Y is an n-dimensional vector andX is an n × k matrix of rank k. The ordinary least squares estimator of β0

defined by


is the best estimator in the class of linear (in Y ) unbiased estimators of β0.Its variance is

V[βOLS

]= σ2(X tX)−1.

Proof: See Theorem 3.

Theorem 7. Consider the static multiple linear regression model:

Y = Xβ0 + u

where E(ui) = 0, V(ui) = σ2, E(u3i ) = 0, and E(u4

i ) = 3σ4 for all i, Y is ann-dimensional vector and X is an n× k matrix of rank k.The estimator of σ2 defined by

s2 =1

n− kY ′MXY =

u′u

n− k


Proof: See Theorem 5.

Stochastic regressors

We now study the properties of the estimator of (β′, σ)′ when the regressorsare stochastic. All in all, the main properties are not altered. This is only theway to prove the results that is changing.

In the presence of stochastic regressors. the multiple linear regression modelwrites:

Y = Xβ + u


where E(u | X) = 0n×1, V(u | X) = σ2In, and X is matrix of random regressorswith P(rk(X) = k) = 1. This is a semi-parametric specification.

• The ordinary least squares estimator of β is given by:

βOLS =(X tX

)−1X tY.

• The corresponding ”naive” least squares estimator of σ2 is given by:


In

n.

Proposition 8. The ordinary least squares of β satisfies the following prop-erties:

E[βOLS | X


E[βOLS


V[βOLS | X

]= σ2(X tX)−1

V[βOLS

]= σ2EX

[(X tX)−1

]and

Cov[βOLS, u | X

]= 0k×n

where u = Y − Y = MXY .

Remark: To obtain the unconditional properties of the ordinary least squaresestimator of β in the presence of stochastic regressors, one generally proceedsin two steps:

1. Obtain the desired result conditioned on X;

2. Find the unconditional result by averaging (i.e., by integrating over) theconditional distribution.

Selected proofs:

1. Conditional unbiasedness property of βOLS:

E(βOLS | X

)= β + E

((X tX)−1X tu | X

)= β.


2. Unconditional unbiasedness property of βOLS:

E(βOLS

)= EX

E[βOLS | X

]= EX [β] = β.

3. Conditional variance property of βOLS:

V(βOLS | X

)= E

[(X tX)−1X tuutX(X tX)−1 | X

]= (X tX)−1X tE

[uut | X

]X(X tX)−1

= σ2(X tX)−1.

4. Unconditional variance property of βOLS:

V(βOLS

)= EX

[V(βOLS | X)

]+ V

[E(βOLS

)]= EX

[V(βOLS | X)

]= σ2EX

[(X tX)−1

].

As in the case of nonstochastic regressors, one can also show that σ2 is anunbiased estimator of σ2.

Proposition 9. The unbiased ordinary least squares estimator of σ2 is givenby:


In

n− k

where k is the number of explanatory variables.

Proof: We get:

(n− k)E[σ2 | X

]= E

[utMXu | X

]= E

[Tr(MXuut) | X

]= Tr

[MXE

(uut | X

)]= σ2Tr(MX).

The result follows.

Finally, we study the efficiency properties.


Theorem 8. Consider the conditional static multiple linear regression model:

Y = Xβ0 + u

where E(ui | X) = 0 and V(ui | X) = σ2 for all i, Y is an n-dimensionalvector and X is an n × k matrix of rank k. The ordinary least squaresestimator of β0 defined by:


is the best estimator in the class of linear (in Y ) unbiased estimators of β0.Its (conditional) variance is:

V[βOLS | X

]= σ2(X tX)−1

Theorem 9. Consider the conditional static multiple linear regression model:

Y = Xβ0 + u

where E(ui | X) = 0, V(ui | X) = σ2, E(u3i | X) = 0, and E(u4

i | X) = 3σ4

for all i, Y is an n-dimensional vector and X is an n× k matrix of rank k.The estimator of σ2 defined by:

s2 =1

n− kY ′MXY =

u′u

n− k


Summary

Proposition 10. The unbiasedness results of the ordinary least squares es-timator of β and σ2 and the Gauss-Markov theorem hold whether or not thematrix X is considered as random.

2.2 Parametric model

Instead of defining the first two moments of the error terms, we now assumethat the error terms are normally distributed (parametric model). In thiscase, the exact (as opposed to asymptotic) distribution of βOLS and σ2 can bederived.


Fixed regressors

Proposition 11. Consider the multiple linear regression model:

Y = Xβ + u

where u ∼ N (0, σ2In), and X is matrix of fixed regressors with rk(X) = k.βOLS and (n−k)bσ2

σ2 are distributed as follows:

βOLS ∼ N (β, σ2(X ′X)−1)

(n− k)σ2

σ2∼ χ2(n− k).

Moreover, βOLS and (n−k)bσ2

σ2 are independent.

Proof:

1. Since Y = Xβ + u, we get Y ∼ N (Xβ, σ2In). Moreover, βOLS =

(X ′X)−1X ′Y ,which implies that βOLS is normally distributed. There-fore, we just need to characterize the first two moments, E[βOLS] andV[βOLS]. It follows that: βOLS ∼ N (β, σ2(X ′X)−1).

2. As we show before, u = MXY . Therefore, E[u] = E[MXY ] = MXE[Y ] =MXXβ (with MXX = 0) and V[u] = σ2MX . It follows that: MXY ∼N (MXXβ, σ2MX), which implies MXY

σ∼ N (MXXβ, MX). This can

be rewritten as ‖MXYσ‖2 ∼ χ2(rg(MX)). This is also equivalent to

‖MXY ‖2σ2 ∼ χ2(n− k), i.e. (n−k)bσ2

σ2 ∼ χ2(n− k).

Finally, the efficiency of βOLS can be stated using the maximum likelihoodtheory (see further). One has the following proposition.

Proposition 12. Consider the multiple linear regression model:

Y = Xβ + u

where u ∼ N (0, σ2In), and X is matrix of fixed regressors with rk(X) = k.

1. The ordinary least squares estimator of β is efficient: its variance-covariance matrix equals the inverse of the Fisher information matrix.

2. The unbiased ordinary least squares estimator of σ2 is not efficient.There exits no best quadratic unbiased estimator of σ2 which is efficient.


Proof : See chapter 3 (Maximum likelihood theory).

Stochastic regressors

Proposition 13. Consider the conditional static multiple linear regressionmodel:

Y = Xβ + u

where u | X ∼ N (0, σ2In), and X is a matrix of random regressors withP(rk(X) = k) = 1.βOLS is distributed as follows:

βOLS | X ∼ N (β, σ2(X ′X)−1).

chapter2_linearregression_partii

Documents

fdcr lower

multiple linearregression

probability

dimensional

linear regression

ordinaryleast

quadratic

error terms