asymptotic properties we now look at properties of the ols

Asymptotic properties

We now look at properties of the OLS estimator that hold in large sam-

ples, or more precisely, as the sample size N tends to infinity, keeping the

number of explanatory variables K fixed

The idea is that, at least for models estimated using large data samples,

these asymptotic results will provide a useful approximation to the behaviour

of our estimators and test statistics

1

The power of this approach is that we can establish useful asymptotic

results under much weaker assumptions than those needed to show that

the OLS estimator is unbiased and effi cient in the classical linear regression

model

And those needed to establish the exact small sample distributions of the

OLS estimator and the t- and F-statistics in the classical linear regression

model with Normally distributed errors

2

(Weak) Consistency

We first consider a suffi cient set of assumptions for the OLS estimator to

be consistent in the linear model

i) yi = x′iβ + ui for i = 1, ..., N or y = Xβ + u

ii) The data on (yi, xi) are independent over i = 1, ..., N ,

with E(ui) = 0 and E(xiui) = 0 for all i = 1, ..., N

iii) X is stochastic and full rank

iv) The K ×K matrixMXX = p limN→∞(X ′XN

)= p limN→∞

1N

N∑i=1

xix′i

exists and is non-singular

Then p limN→∞ βOLS = β, or βOLSP→ β

3

Recall that a sequence of random variables {xN : N = 1, 2, ...} converges

in probability to the constant a if, for all δ > 0 and ε > 0, we have

Pr(|xN − a| < ε) > 1− δ for all N > N ∗

or equivalently

limN→∞

Pr(|xN − a| < ε) = 1

limN→∞

Pr(|xN − a| > ε) = 0

This is denoted by p limN→∞ xN = a, or xNP→ a

4

Convergence in probability (associated with weak laws of large numbers) is

implied by the property of almost sure convergence (associated with strong

laws of large numbers)

A random variable xN converges almost surely to a if

Pr( limN→∞

xN = a) = 1

This is denoted by xNa.s.→ a

5

These definitions are applied element-by-element to random vectors or ma-

trices

For example, if we have xNP→ a and yN

P→ b, the randomvector (xN , yN)P→

(a, b)

An important property of probability limits is known as Slutsky’s The-

orem

Let the random vector xN have a probability limit given by the vector a,

and let g(.) be a real-valued function which is continuous at the point a

Then g(xN)P→ g(a)

6

For example, let xN = (x1N , x2N)P→ (a1, a2) and let g(x1, x2) = x1x2

Then we have g(x1N , x2N) = x1Nx2NP→ a1a2,

or p limN→∞(x1Nx2N) = p limN→∞ x1Np limN→∞ x2N

The probability limit of the product of two random variables x1N and x2N

is the product of their probability limits

Note that this property does not hold in general for expected values

We use this property to prove the consistency of βOLS under the stated

assumptions

7

βOLS = (X ′X)−1X ′y = (X ′X)−1X ′(Xβ + u)

= (X ′X)−1X ′Xβ + (X ′X)−1X ′u = β + (X ′X)−1X ′u

Equivalently

βOLS = β + N(X ′X)−1

(1

N

)X ′u = β +

(X ′X

N

)−1(X ′u

N

)Now

p limN→∞

βOLS = p limN→∞

β +

(p limN→∞

(X ′X

N

))−1(p limN→∞

(X ′u

N

))= β + M−1

XXp limN→∞

(X ′u

N

)

8

p limN→∞

βOLS = β + M−1XXp lim

N→∞

(X ′u

N

)Consistency thus requires that

p limN→∞

(X ′u

N

)= p lim

N→∞

(1

N

N∑i=1

xiui

)= 0

This in turn is implied by our assumption that E(xiui) = 0

Using a suitable Law of Large Numbers for independent observations over

i = 1, ..., N , we have that p limN→∞

(1N

N∑i=1

xiui

)= E(xiui) = 0

9

For independent and identically distributed (iid) observations, we can use

the Kolmogorov Law of Large Numbers

For independent but not identically distributed observations, we can use

the Markov Law of Large Numbers

For time series models with lagged dependent variables, the observations

(yi, xi) are not independent over i = 1, ..., N , since xi includes yi−1, violating

assumption (ii)

In this case, we can still use Laws of Large Numbers for stationary, ergodic

processes to establish conditions under which βOLS remains consistent10

[Sections 2.2 and 2.3 of Hayashi (2000) provide the details]

In contrast, we know that βOLS is not unbiased in dynamic models with

lagged dependent variables, since the strict exogeneity assumptionE(y|X) =

Xβ was needed to establish unbiasedness

When we state that βOLS is consistent, or βOLSP→ β

• we view the sequence of OLS estimates from increasing sample sizes as a

sequence of random vectors

• consistency holds for each element of the vector, so that βkP→ βk for

k = 1, ..., K

11

Notice that we do not need to assume either conditional homoskedasticity

(V (ui|xi) = σ2) or Normality (ui|xi has a Normal distribution) to establish

consistency of the OLS estimator

Nor do we require that the conditional expectation of yi given xi is linear

(E(yi|xi) = x′iβ), let alone the stronger form E(y|X) = Xβ ↔ E(u|X) = 0

The key assumption for the OLS estimator to be consistent is that the

error term ui is uncorrelated with (orthogonal to) each of the

explanatory variables in the vector xi, as implied by the statement

E(xiui) = 0 (here a K × 1 vector with each element equal to zero)

12

And E(ui|xi) = 0 implies that cov(xi, ui) = 0, since

cov(xi, ui) = E(xiui)− E(xi)E(ui) = 0− 0 = 0

So we have that E(ui|xi) = 0 implies cov(xi, ui) = 0

In general the converse is not true

Although we know that if xi and ui are both Normally distributed, then

zero covariance does imply independence, in which case we haveE(ui|xi) = 0

14

The statements

yi = x′iβ + ui

and

E(ui) = 0 and E(xiui) = 0

define the linear projection of yi on xi to be E∗(yi|xi) = x′iβ

With independent observations on (yi, xi) over i = 1, ..., N , the OLS es-

timator provides a consistent estimator for the parameters of this linear

projection

15

More accurately, this property should be called ‘weak consistency’, since

we have only established convergence in probability

But this is commonly abbreviated to the term ‘consistency’

Consistency is a large sample or asymptotic property, i.e. a property that

holds in the limit as the sample size N increases to infinity

Unbiasedness is a small sample or finite sample property

16

An unbiased estimator (with E(β) = β) will usually be consistent, but

need not be

Consistency further requires that V (β) falls as the sample size increases

A consistent estimator need not be unbiased

For example, if β is an unbiased and consistent estimator of β, then β =

β + 1N is a consistent but biased estimator of β

E(β) = E(β) + 1N = β + 1

N 6= β ↔ biased

p limN→∞(β) = p limN→∞(β)+p limN→∞( 1N ) = β+0 = β ↔ consistent

17

Asymptotic Normality

Consistency is a desirable property for an estimator to have, particularly

if we have the luxury of working with large datasets

But this is not suffi cient for inference

To obtain confidence intervals and to conduct hypothesis tests, we require

an asymptotic distribution result, relating the distribution of the estimator

in large samples to a known (and ideally a convenient) distribution

We start by considering a suffi cient set of assumptions for the OLS esti-

mator, suitably scaled, to have a limit distribution that is Normal18

To link this asymptotic distribution result to the finite sample distribution

result obtained previously for the classical Normal linear regression model,

we first assume that the conditional variance of the error term (E(u2i |xi))

satisfies the conditional homoskedasticity assumption E(u2i |xi) = σ2 for all

i = 1, ..., N

This assumption is needed for the OLS estimator to have the particular

asymptotic variance derived here, although not to establish that the asymp-

totic distribution is Normal

19

Assumptions

i) yi = x′iβ + ui for i = 1, ..., N or y = Xβ + u

ii) The data on (yi, xi) are independent over i = 1, ..., N ,

with E(ui) = 0 and E(xiui) = 0 and E(u2i |xi) = σ2 for all i = 1, ..., N

iii) X is stochastic and full rank

iv) The K ×K matrixMXX = p limN→∞(X ′XN

)= p limN→∞

1N

N∑i=1

xix′i

exists and is non-singular

v) The K × 1 vector(X ′u√N

)= 1√

N

N∑i=1

xiuiD→ N(0, σ2MXX)

Then√N(βOLS − β)

D→ N(0, σ2M−1XX)

20

Recall that a sequence of random variables {xN : N = 1, 2, ...} converges

in distribution to the random variable x if and only if

FN → F as N →∞ or limN→∞

FN = F

where FN is the cumulative distribution function of xN and F is the cumu-

lative distribution function of x

If x ∼ N(µ, σ2), we write xND→ N(µ, σ2)

This definition extends to random vectors, with FN and F interpreted as

the cumulative distribution functions of the random vectors xN and x

If the vector x ∼ N(µ,Σ), we write xND→ N(µ,Σ)

21

The distribution to which the random vector xN converges in the limit as

N →∞ is referred to as the limit distribution or the limiting distribution

An important property of limit distributions is known as the continuous

mapping theorem

Let the random vector xND→ x and let g(.) be a continuous, real-valued

function

Then g(xN)D→ g(x)

Again if xN = (x1N , x2N)D→ (x1, x2) and g(x1, x2) = x1x2, we have that

g(x1N , x2N) = x1Nx2ND→ x1x2

22

Other useful results are known variously as the transformation theo-

rem, Cramer’s theorem or (again) as Slutsky’s theorem

For scalar random variables xN and yN , these can be stated as

If xND→ x and yN

P→ y, where x is a random variable and y is a constant,

then

i) xN + yND→ x + y

ii) xNyND→ xy (product rule)

iii) xN/yND→ x/y provided y 6= 0

23

Results (i) and (ii) extend to (conformable) random vectors

If xN is aK× 1 random vector, x is aK× 1 random vector, yN is aK× 1

random vector, and y is a K × 1 non-stochastic vector, we have

i) xN + yND→ x + y

ii) y′NxND→ y′x

If xN is a K × 1 random vector and AN is a K ×K random matrix, with

xND→ x and AN

P→ A, we also have

iii) ANxND→ Ax

iv) x′NA−1N xN

D→ x′A−1x provided A−1 exists

24

The matrix version of the product rule ANxND→ Ax implies that if

xND→ N(0,Σ) and AN

P→ A then ANxND→ N(0, AΣA′)

We use this result to show the asymptotic Normality of βOLS under the

stated assumptions

Using

βOLS = β +

(X ′X

N

)−1(X ′u

N

)we have

βOLS − β =

(X ′X

N

)−1(X ′u

N

)

25

and√N(βOLS − β

)=

(X ′X

N

)−1(X ′u√N

)

Now the K ×K matrix(X ′XN

)−1 P→M−1XX from assumption (iv)

And the K × 1 vector(X ′u√N

)D→ N(0, σ2MXX) from assumption (v)

The product(X ′XN

)−1(X ′u√N

)thus has the same limit distribution asM−1

XX

(X ′u√N

),

which gives us that(X ′X

N

)−1(X ′u√N

)D→ N(0, σ2M−1

XXMXXM−1XX)

using the symmetry of M−1XX

26

Hence√N(βOLS − β

)D→ N(0, σ2M−1

XX)

More primitive assumptions can be used to derive assumption (v):(X ′u√N

)D→

N(0, σ2MXX)

Given assumption (ii), with independence and conditional homoskedas-

ticity, we can use the Lindeberg-Levy Central Limit Theorem for

independent and identically distributed random vectors

27

If {zi : i = 1, 2, ...} is a sequence of iidK×1 randomvectors withE(zi) = 0

and V (zi) = E(ziz′i) = Σ finite, then

1√N

N∑i=1

ziD→ N(0,Σ)

Letting zi = xiui, we have iid K × 1 random vectors with E(xiui) = 0

Assuming finite variance, we obtain

1√N

N∑i=1

xiui =

(X ′u√N

)D→ N(0, V (xiui))

28

Moreover

V (xiui) = E [(xiui)(xiui)′] = E(xiuiu

′ix′i) = E(u2

ixix′i)

since ui is a scalar

But

E(u2ixix

′i) = Ex

[Eu|x(u

2ixix

′i|xi)

]= Ex

[xix′iEu|x(u

2i |xi)

]= Ex

[xix′iσ

2]

= σ2Ex(xix′i) = σ2MXX

using the conditional homoskedasticity assumption E(u2i |xi) = σ2

Then V (xiui) = σ2MXX and(X ′u√N

)D→ N(0, σ2MXX), as stated in assump-

tion (v)29

The conditional homoskedasticity assumption is only required to obtain

this particular expression for variance of the limit distribution of(X ′u√N

), and

hence the implied variance of the limit distribution of βOLS

With independence and conditional heteroskedasticity (i.e. E(u2i |xi) 6= σ2

for all i = 1, ..., N), we can use the Liapounov Central Limit Theorem for

independent but not identically distributed random vectors, to obtain that

the limit distribution of(X ′u√N

)is Normal, but with a different expression for

the variance

30

For time series models with lagged dependent variables, we cannot assume

independent observations

In this case, we can still use Central Limit Theorems for stationary, ergodic

Martingale difference sequences to establish conditions under which βOLS

remains asymptotically Normal

[Sections 2.2 and 2.3 of Hayashi (2000) provide the details]

31

Asymptotic inference

(Weak) Consistency βOLSP→ β

The parameter vector β is not stochastic, so the limit distribution of βOLS

itself is degenerate

In practice, we do not get to estimate parameter vectors using infinitely

large samples

To form confidence intervals and to test hypotheses, we need to approxi-

mate the distribution of the estimator in finite samples

Asymptotic distribution results allow us to approximate this distribution

accurately if the (finite) sample size N is suffi ciently large

32

Asymptotic Normality√N(βOLS − β)

D→ N(0, σ2M−1XX)

Appropriately scaled, we have a non-degenerate limit distribution

To use this limit distribution to approximate the distribution of βOLS in

large samples, we multiply by the scalar 1√Nand add the non-stochastic

vector β to both sides

This gives us the approximation to the distribution of the OLS estimator

based on our asymptotic distribution theory, or more simply the ‘asymptotic

distribution’of βOLS, as

33

βOLSa∼ N

(β,

(1

N

)σ2M−1

XX

)The variance matrix

(1N

)σ2M−1

XX is the approximation to the variance of

the OLS estimator based on our asymptotic distribution theory, or more

simply the ‘asymptotic variance’of βOLS

We denote this by avar(βOLS) =(

1N

)σ2M−1

XX or (if it is clear from the

context that we are referring to the asymptotic variance) more simply by

V (βOLS) =(

1N

)σ2M−1

XX

*** Note that some authors use the term ‘asymptotic variance’and the

notation avar() differently ***

34

Warning: some authors, including Hayashi (2000), use the term ‘asymp-

totic variance’and the notation avar(βOLS) to denote the variance of the

limit distribution of√N(βOLS − β), here σ2M−1

XX, rather than the approxi-

mation to the variance of βOLS that we obtain from this limit distribution,

here(

1N

)σ2M−1

XX, as we use the term here

Expressions in Hayashi (2000) which involve the term avar() or avar()

should be divided by the sample size N , or multiplied by(

1N

), to make them

consistent with the notation on the slides

See Hayashi (2000) p.95

35

Neither σ2 nor MXX are known

The obvious choice to estimate MXX = p limN→∞(X ′XN

)is

MXX =

(X ′X

N

)Since p limN→∞ MXX = p limN→∞

(X ′XN

)= MXX, this gives us a consistent

estimator of MXX

Both the OLS estimator

σ2 =1

N −K

N∑i=1

u2i =

u′u

N −K

36

and the Maximum Likelihood estimator

σ2ML =

1

N

N∑i=1

u2i =

u′u

N

can be shown to be consistent estimators of σ2 under our assumptions

Using theOLS estimator of σ2 gives us a consistent estimator of avar(βOLS)

as

avar(βOLS) =

(1

N

)σ2M−1

XX

=

(1

N

)σ2

(X ′X

N

)−1

=

(1

N

)σ2N(X ′X)−1

= σ2(X ′X)−1

37

Since avar(βOLS)P→ avar(βOLS), we also have

βOLSa∼ N

(β, avar(βOLS)

)or βOLS

a∼ N(β, σ2(X ′X)−1

)Note that this asymptotic approximation to the distribution of βOLS (in the

model with conditional homoskedasticity) coincides with the exact result we

obtained for the distribution of βOLS in the classical linear regression model

with Normally distributed errors

There is at least one version of the linear model for which we know this

approximation based on the asymptotic distribution theory will be accurate38

In large samples, this asymptotic distribution result provides a motivation

for using this expression to estimate the variance of βOLS, and for treating

βOLS as a random vector with a Normal distribution, even if

• we have no reason to assume that the error terms ui are Normally dis-

tributed

• or to maintain the linear conditional expectation (strict exogeneity) as-

sumption of the classical linear regression model

This will be useful in practice in sample sizes that are large enough for the

asymptotic distribution to provide a good approximation to the distribution

of βOLS, but not so large for the variance of βOLS to become negligibly small39

To interpret this expression for the asymptotic variance of βOLS (under

conditional homoskedasticity), avar(βOLS) = σ2(X ′X)−1, it may be helpful

to consider the special case with 1 explanatory variable (denoted xi) and

no intercept

yi = xiβ + ui

where we assume that both yi and xi have zero mean, perhaps because

original variables have been transformed into deviations from their sample

means

Then X ′X is simplyN∑i=1

x2i

40

Since all the x2i terms are non-negative, X

′X =N∑i=1

x2i increases with the

sample size N

σ2 P→ σ2, so σ2 does not increase with the sample size N

In this example, it is clear that

avar(βOLS) = σ2(X ′X)−1 =σ2

N∑i=1

x2i

decreases with the sample size N

41

So although the expression avar(βOLS) = σ2(X ′X)−1 does not appear to

depend on the sample size N , our estimate of the asymptotic variance will

tend to fall as we add more observations to the sample

The square root of this (scalar) asymptotic variance, termed the asymptotic

standard error, also falls as N increases, indicating that we can estimate the

parameter β more precisely as we add more observations to the sample

In the limit as N →∞, we have

avar(βOLS) =σ2

N∑i=1

x2i

→ 0

consistent with our result that the limit distribution of βOLS is degenerate

42

The observation thatX ′X =N∑i=1

x2i increases withN is perfectly consistent

with our assumption that(X ′XN

)= 1

N

N∑i=1

x2iP→MXX

In the scalar case, with xi having zero mean, 1N

N∑i=1

x2i is the sample variance

of the random variable x

With independent observations, 1N

N∑i=1

x2i is a consistent estimator of the

population variance

In this case, our assumption simply requires that the population variance

of x exists, and is finite

43

With cross-section data, this assumption is usually innocuous

With stationary time series data, this assumption is also satisfied

Although this suggests that a different approach will be required for non-

stationary time series (e.g. random walks, unit roots, ...)

44

Now let the scalar vkk again denote the kth element on the main diagonal

of the K ×K (asymptotic) variance matrix σ2(X ′X)−1 = avar(βOLS)

Since

βka∼ N(βk, vkk)

we also have

tk =βk − βk√

vkk=βk − βksek

a∼ N(0, 1)

Simple hypothesis tests of the formH0 : βk = β0k can be conducted in

large samples by comparing the test statistic tk = βk−β0ksek

computed under the

null hypothesis to critical values of the standard Normal N(0, 1) distribution

at the desired significance level45

For example, to test the nullH0 : βk = β0k against the two-sided alternative

H1 : βk 6= β0k at the 5% level of significance, we can compare the absolute

value of tk = βk−β0ksek

that we obtain to the critical value 1.96

Strictly speaking, the standard error sek =√vkk should be referred to here

as the asymptotic standard error, sometimes denoted asek

This test is sometimes referred to as an asymptotic t-test

46

Why can we treat tk here as a draw from the standard Normal distribution,

even though we have estimated the unknown σ2?

The limit distribution result

√N(βOLS − β)

D→ N(0, σ2M−1XX)

implies that

zk =βk − βk√

vkk

D→ N(0, 1)

where vkk is the (unknown) kth element on the main diagonal of(

1N

)σ2M−1

XX

Using sek =√vkk to estimate the unknown

√vkk gives

tk =βk − βk√

vkk=

(√vkk√vkk

)(βk − βk√

vkk

)=

(√vkk√vkk

)zk

47

Since√vkk is a consistent estimator of

√vkk, we have

√vkk

P→ √vkk, or(√vkk√vkk

)P→ 1

Then tk has the same limit distribution as zk, so that we also have

tk =βk − βk√

vkk=βk − βksek

D→ N(0, 1)

However, since this test is only justified by an asymptotic approximation,

and since we know that a t(N−K) random variable converges in distribution

to a N(0, 1) random variable as N → ∞ with K fixed, there is also an

asymptotic justification for comparing the test statistic tk = βk−β0ksek

computed

under the null hypothesis to critical values of the t(N −K) distribution

48

Some econometric packages use the t(N −K) approximation to the distri-

bution of tk = βk−β0ksek

for reporting p-values and confidence intervals

You may have noticed that Stata does this

In general, the exact distribution of the test statistic is unknown, and

depends on features of the data generating process

Both the N(0, 1) approximation and the t(N − K) approximation only

have an asymptotic justification, and they coincide in the limit as N → ∞

for given K

In large samples, this difference becomes negligibly small49

Testing linear restrictions

The asymptotic distribution of the p × 1 random vector θOLS = HβOLS,

whereH is a non-stochastic p×K matrix with rank(H) = p (full row rank),

is given by

θOLSa∼ N(Hβ,Havar(βOLS)H ′)

where p limN→∞ θOLS = Hp limN→∞ βOLS = Hβ

and avar(θOLS) = Havar(βOLS)H ′ = σ2H(X ′X)−1H ′

Then the quadratic form

w = (θOLS −Hβ)′[avar(θOLS)

]−1

(θOLS −Hβ)a∼ χ2(p)

50

Joint hypothesis tests of the formH0 : Hβ = θ0 can thus be conducted

by computing the test statistic

w = (θOLS − θ0)′[avar(θOLS)

]−1

(θOLS − θ0)

which has a χ2(p) asymptotic distribution if the null hypothesis is true

We can compare the value of the test statistic we obtain to critical values

of the χ2(p) distribution at the desired level of significance

For example, if p = 2 and we want a test at the 5% significance level, we

can compare the computed value of the test statistic w to 5.99, the upper

95% percentile of the χ2(2) distribution

51

Alternatively, since this test only has an asymptotic justification, and we

know that p times aF (p,N−K) randomvariable converges in distribution to

a χ2(p) random variable asN →∞ withK fixed, there is also an asymptotic

justification for comparing the test statistic w/p computed under the null

hypothesis to critical values of the F (p,N −K) distribution

Some econometric packages use the F (p,N − K) approximation to the

distribution of w/p for reporting p-values

You may also have noticed that Stata does this

In large samples, this difference also becomes negligibly small52

Testing non-linear restrictions

These asymptotic Wald tests of restrictions on the parameter vector β can

be extended to cover non-linear restrictions

For example, in the model

yi = β1 + β2x2i + β3x3i + ui

we may be interested in testing the non-linear restriction that β2β3 = 1 ↔

β2β3 − 1 = 0

53

This hypothesis can be formulated as

h

β1

β2

β3

= β2β3 − 1 = 0

or h(β) = β2β3 − 1 = 0 for a vector-valued function h(β) with continuous

first derivatives

Let H(β) be the p ×K matrix of first derivatives H(β) = ∂h(β)/∂β′, so

that in our example we have

H(β) = ∂h(β)/∂β′ =

(0 β3 β2

)with full row rank (i.e. rank(H(β)) = 1 in our example)

54

We can then use the delta method to show that

w = h(βOLS)′[H(βOLS)avar(βOLS)H(βOLS)′

]−1

h(βOLS)a∼ χ2(p)

under the null hypothesis that h(β) = 0

We compute the test statistic w using our estimate of the parameter vector

βOLS and its asymptotic variance avar(βOLS), and compare the result to

critical values of the χ2(p) distribution

The Wald test statistic for linear restrictions, stated previously, can be

obtained as a special case, setting h(β) = Hβ − θ0 = 0

Note that the representation h(β) = 0 for non-linear restrictions is not

unique55

For example, to test the restriction that β2β3 = 1, we may consider the

representations

β2β3 − 1 = 0

or β2 −1

β3

= 0

or β3 −1

β2

= 0

Although Wald test statistics based on different representations of non-

linear restrictions are asymptotically equivalent, they can take different val-

ues in finite samples, and there is no guarantee that different Wald tests of

the same non-linear restriction will give the same outcome in finite samples

56

asymptotic properties we now look at properties of the ols

Documents