asymptotic properties we now look at properties of the ols
TRANSCRIPT
Asymptotic properties
We now look at properties of the OLS estimator that hold in large sam-
ples, or more precisely, as the sample size N tends to infinity, keeping the
number of explanatory variables K fixed
The idea is that, at least for models estimated using large data samples,
these asymptotic results will provide a useful approximation to the behaviour
of our estimators and test statistics
1
The power of this approach is that we can establish useful asymptotic
results under much weaker assumptions than those needed to show that
the OLS estimator is unbiased and effi cient in the classical linear regression
model
And those needed to establish the exact small sample distributions of the
OLS estimator and the t- and F-statistics in the classical linear regression
model with Normally distributed errors
2
(Weak) Consistency
We first consider a suffi cient set of assumptions for the OLS estimator to
be consistent in the linear model
i) yi = x′iβ + ui for i = 1, ..., N or y = Xβ + u
ii) The data on (yi, xi) are independent over i = 1, ..., N ,
with E(ui) = 0 and E(xiui) = 0 for all i = 1, ..., N
iii) X is stochastic and full rank
iv) The K ×K matrixMXX = p limN→∞(X ′XN
)= p limN→∞
1N
N∑i=1
xix′i
exists and is non-singular
Then p limN→∞ βOLS = β, or βOLSP→ β
3
Recall that a sequence of random variables {xN : N = 1, 2, ...} converges
in probability to the constant a if, for all δ > 0 and ε > 0, we have
Pr(|xN − a| < ε) > 1− δ for all N > N ∗
or equivalently
limN→∞
Pr(|xN − a| < ε) = 1
limN→∞
Pr(|xN − a| > ε) = 0
This is denoted by p limN→∞ xN = a, or xNP→ a
4
Convergence in probability (associated with weak laws of large numbers) is
implied by the property of almost sure convergence (associated with strong
laws of large numbers)
A random variable xN converges almost surely to a if
Pr( limN→∞
xN = a) = 1
This is denoted by xNa.s.→ a
5
These definitions are applied element-by-element to random vectors or ma-
trices
For example, if we have xNP→ a and yN
P→ b, the randomvector (xN , yN)P→
(a, b)
An important property of probability limits is known as Slutsky’s The-
orem
Let the random vector xN have a probability limit given by the vector a,
and let g(.) be a real-valued function which is continuous at the point a
Then g(xN)P→ g(a)
6
For example, let xN = (x1N , x2N)P→ (a1, a2) and let g(x1, x2) = x1x2
Then we have g(x1N , x2N) = x1Nx2NP→ a1a2,
or p limN→∞(x1Nx2N) = p limN→∞ x1Np limN→∞ x2N
The probability limit of the product of two random variables x1N and x2N
is the product of their probability limits
Note that this property does not hold in general for expected values
We use this property to prove the consistency of βOLS under the stated
assumptions
7
βOLS = (X ′X)−1X ′y = (X ′X)−1X ′(Xβ + u)
= (X ′X)−1X ′Xβ + (X ′X)−1X ′u = β + (X ′X)−1X ′u
Equivalently
βOLS = β + N(X ′X)−1
(1
N
)X ′u = β +
(X ′X
N
)−1(X ′u
N
)Now
p limN→∞
βOLS = p limN→∞
β +
(p limN→∞
(X ′X
N
))−1(p limN→∞
(X ′u
N
))= β + M−1
XXp limN→∞
(X ′u
N
)
8
p limN→∞
βOLS = β + M−1XXp lim
N→∞
(X ′u
N
)Consistency thus requires that
p limN→∞
(X ′u
N
)= p lim
N→∞
(1
N
N∑i=1
xiui
)= 0
This in turn is implied by our assumption that E(xiui) = 0
Using a suitable Law of Large Numbers for independent observations over
i = 1, ..., N , we have that p limN→∞
(1N
N∑i=1
xiui
)= E(xiui) = 0
9
For independent and identically distributed (iid) observations, we can use
the Kolmogorov Law of Large Numbers
For independent but not identically distributed observations, we can use
the Markov Law of Large Numbers
For time series models with lagged dependent variables, the observations
(yi, xi) are not independent over i = 1, ..., N , since xi includes yi−1, violating
assumption (ii)
In this case, we can still use Laws of Large Numbers for stationary, ergodic
processes to establish conditions under which βOLS remains consistent10
[Sections 2.2 and 2.3 of Hayashi (2000) provide the details]
In contrast, we know that βOLS is not unbiased in dynamic models with
lagged dependent variables, since the strict exogeneity assumptionE(y|X) =
Xβ was needed to establish unbiasedness
When we state that βOLS is consistent, or βOLSP→ β
• we view the sequence of OLS estimates from increasing sample sizes as a
sequence of random vectors
• consistency holds for each element of the vector, so that βkP→ βk for
k = 1, ..., K
11
Notice that we do not need to assume either conditional homoskedasticity
(V (ui|xi) = σ2) or Normality (ui|xi has a Normal distribution) to establish
consistency of the OLS estimator
Nor do we require that the conditional expectation of yi given xi is linear
(E(yi|xi) = x′iβ), let alone the stronger form E(y|X) = Xβ ↔ E(u|X) = 0
The key assumption for the OLS estimator to be consistent is that the
error term ui is uncorrelated with (orthogonal to) each of the
explanatory variables in the vector xi, as implied by the statement
E(xiui) = 0 (here a K × 1 vector with each element equal to zero)
12
Note that the conditional independence assumption E(ui|xi) = 0 implies
the orthogonality assumption E(xiui) = 0
Using the Law of Iterated Expectations
E(xiui) = Ex[Eu|x(xiui|xi)] = Ex[xiEu|x(ui|xi)] = Ex[0] = 0
if E(ui|xi) = 0
E(ui|xi) = 0 also implies that E(ui) = 0, since
E(ui) = Ex[Eu|x(ui|xi)] = Ex[0] = 0
13
And E(ui|xi) = 0 implies that cov(xi, ui) = 0, since
cov(xi, ui) = E(xiui)− E(xi)E(ui) = 0− 0 = 0
So we have that E(ui|xi) = 0 implies cov(xi, ui) = 0
In general the converse is not true
Although we know that if xi and ui are both Normally distributed, then
zero covariance does imply independence, in which case we haveE(ui|xi) = 0
14
The statements
yi = x′iβ + ui
and
E(ui) = 0 and E(xiui) = 0
define the linear projection of yi on xi to be E∗(yi|xi) = x′iβ
With independent observations on (yi, xi) over i = 1, ..., N , the OLS es-
timator provides a consistent estimator for the parameters of this linear
projection
15
More accurately, this property should be called ‘weak consistency’, since
we have only established convergence in probability
But this is commonly abbreviated to the term ‘consistency’
Consistency is a large sample or asymptotic property, i.e. a property that
holds in the limit as the sample size N increases to infinity
Unbiasedness is a small sample or finite sample property
16
An unbiased estimator (with E(β) = β) will usually be consistent, but
need not be
Consistency further requires that V (β) falls as the sample size increases
A consistent estimator need not be unbiased
For example, if β is an unbiased and consistent estimator of β, then β =
β + 1N is a consistent but biased estimator of β
E(β) = E(β) + 1N = β + 1
N 6= β ↔ biased
p limN→∞(β) = p limN→∞(β)+p limN→∞( 1N ) = β+0 = β ↔ consistent
17
Asymptotic Normality
Consistency is a desirable property for an estimator to have, particularly
if we have the luxury of working with large datasets
But this is not suffi cient for inference
To obtain confidence intervals and to conduct hypothesis tests, we require
an asymptotic distribution result, relating the distribution of the estimator
in large samples to a known (and ideally a convenient) distribution
We start by considering a suffi cient set of assumptions for the OLS esti-
mator, suitably scaled, to have a limit distribution that is Normal18
To link this asymptotic distribution result to the finite sample distribution
result obtained previously for the classical Normal linear regression model,
we first assume that the conditional variance of the error term (E(u2i |xi))
satisfies the conditional homoskedasticity assumption E(u2i |xi) = σ2 for all
i = 1, ..., N
This assumption is needed for the OLS estimator to have the particular
asymptotic variance derived here, although not to establish that the asymp-
totic distribution is Normal
19
Assumptions
i) yi = x′iβ + ui for i = 1, ..., N or y = Xβ + u
ii) The data on (yi, xi) are independent over i = 1, ..., N ,
with E(ui) = 0 and E(xiui) = 0 and E(u2i |xi) = σ2 for all i = 1, ..., N
iii) X is stochastic and full rank
iv) The K ×K matrixMXX = p limN→∞(X ′XN
)= p limN→∞
1N
N∑i=1
xix′i
exists and is non-singular
v) The K × 1 vector(X ′u√N
)= 1√
N
N∑i=1
xiuiD→ N(0, σ2MXX)
Then√N(βOLS − β)
D→ N(0, σ2M−1XX)
20
Recall that a sequence of random variables {xN : N = 1, 2, ...} converges
in distribution to the random variable x if and only if
FN → F as N →∞ or limN→∞
FN = F
where FN is the cumulative distribution function of xN and F is the cumu-
lative distribution function of x
If x ∼ N(µ, σ2), we write xND→ N(µ, σ2)
This definition extends to random vectors, with FN and F interpreted as
the cumulative distribution functions of the random vectors xN and x
If the vector x ∼ N(µ,Σ), we write xND→ N(µ,Σ)
21
The distribution to which the random vector xN converges in the limit as
N →∞ is referred to as the limit distribution or the limiting distribution
An important property of limit distributions is known as the continuous
mapping theorem
Let the random vector xND→ x and let g(.) be a continuous, real-valued
function
Then g(xN)D→ g(x)
Again if xN = (x1N , x2N)D→ (x1, x2) and g(x1, x2) = x1x2, we have that
g(x1N , x2N) = x1Nx2ND→ x1x2
22
Other useful results are known variously as the transformation theo-
rem, Cramer’s theorem or (again) as Slutsky’s theorem
For scalar random variables xN and yN , these can be stated as
If xND→ x and yN
P→ y, where x is a random variable and y is a constant,
then
i) xN + yND→ x + y
ii) xNyND→ xy (product rule)
iii) xN/yND→ x/y provided y 6= 0
23
Results (i) and (ii) extend to (conformable) random vectors
If xN is aK× 1 random vector, x is aK× 1 random vector, yN is aK× 1
random vector, and y is a K × 1 non-stochastic vector, we have
i) xN + yND→ x + y
ii) y′NxND→ y′x
If xN is a K × 1 random vector and AN is a K ×K random matrix, with
xND→ x and AN
P→ A, we also have
iii) ANxND→ Ax
iv) x′NA−1N xN
D→ x′A−1x provided A−1 exists
24
The matrix version of the product rule ANxND→ Ax implies that if
xND→ N(0,Σ) and AN
P→ A then ANxND→ N(0, AΣA′)
We use this result to show the asymptotic Normality of βOLS under the
stated assumptions
Using
βOLS = β +
(X ′X
N
)−1(X ′u
N
)we have
βOLS − β =
(X ′X
N
)−1(X ′u
N
)
25
and√N(βOLS − β
)=
(X ′X
N
)−1(X ′u√N
)
Now the K ×K matrix(X ′XN
)−1 P→M−1XX from assumption (iv)
And the K × 1 vector(X ′u√N
)D→ N(0, σ2MXX) from assumption (v)
The product(X ′XN
)−1(X ′u√N
)thus has the same limit distribution asM−1
XX
(X ′u√N
),
which gives us that(X ′X
N
)−1(X ′u√N
)D→ N(0, σ2M−1
XXMXXM−1XX)
using the symmetry of M−1XX
26
Hence√N(βOLS − β
)D→ N(0, σ2M−1
XX)
More primitive assumptions can be used to derive assumption (v):(X ′u√N
)D→
N(0, σ2MXX)
Given assumption (ii), with independence and conditional homoskedas-
ticity, we can use the Lindeberg-Levy Central Limit Theorem for
independent and identically distributed random vectors
27
If {zi : i = 1, 2, ...} is a sequence of iidK×1 randomvectors withE(zi) = 0
and V (zi) = E(ziz′i) = Σ finite, then
1√N
N∑i=1
ziD→ N(0,Σ)
Letting zi = xiui, we have iid K × 1 random vectors with E(xiui) = 0
Assuming finite variance, we obtain
1√N
N∑i=1
xiui =
(X ′u√N
)D→ N(0, V (xiui))
28
Moreover
V (xiui) = E [(xiui)(xiui)′] = E(xiuiu
′ix′i) = E(u2
ixix′i)
since ui is a scalar
But
E(u2ixix
′i) = Ex
[Eu|x(u
2ixix
′i|xi)
]= Ex
[xix′iEu|x(u
2i |xi)
]= Ex
[xix′iσ
2]
= σ2Ex(xix′i) = σ2MXX
using the conditional homoskedasticity assumption E(u2i |xi) = σ2
Then V (xiui) = σ2MXX and(X ′u√N
)D→ N(0, σ2MXX), as stated in assump-
tion (v)29
The conditional homoskedasticity assumption is only required to obtain
this particular expression for variance of the limit distribution of(X ′u√N
), and
hence the implied variance of the limit distribution of βOLS
With independence and conditional heteroskedasticity (i.e. E(u2i |xi) 6= σ2
for all i = 1, ..., N), we can use the Liapounov Central Limit Theorem for
independent but not identically distributed random vectors, to obtain that
the limit distribution of(X ′u√N
)is Normal, but with a different expression for
the variance
30
For time series models with lagged dependent variables, we cannot assume
independent observations
In this case, we can still use Central Limit Theorems for stationary, ergodic
Martingale difference sequences to establish conditions under which βOLS
remains asymptotically Normal
[Sections 2.2 and 2.3 of Hayashi (2000) provide the details]
31
Asymptotic inference
(Weak) Consistency βOLSP→ β
The parameter vector β is not stochastic, so the limit distribution of βOLS
itself is degenerate
In practice, we do not get to estimate parameter vectors using infinitely
large samples
To form confidence intervals and to test hypotheses, we need to approxi-
mate the distribution of the estimator in finite samples
Asymptotic distribution results allow us to approximate this distribution
accurately if the (finite) sample size N is suffi ciently large
32
Asymptotic Normality√N(βOLS − β)
D→ N(0, σ2M−1XX)
Appropriately scaled, we have a non-degenerate limit distribution
To use this limit distribution to approximate the distribution of βOLS in
large samples, we multiply by the scalar 1√Nand add the non-stochastic
vector β to both sides
This gives us the approximation to the distribution of the OLS estimator
based on our asymptotic distribution theory, or more simply the ‘asymptotic
distribution’of βOLS, as
33
βOLSa∼ N
(β,
(1
N
)σ2M−1
XX
)The variance matrix
(1N
)σ2M−1
XX is the approximation to the variance of
the OLS estimator based on our asymptotic distribution theory, or more
simply the ‘asymptotic variance’of βOLS
We denote this by avar(βOLS) =(
1N
)σ2M−1
XX or (if it is clear from the
context that we are referring to the asymptotic variance) more simply by
V (βOLS) =(
1N
)σ2M−1
XX
*** Note that some authors use the term ‘asymptotic variance’and the
notation avar() differently ***
34
Warning: some authors, including Hayashi (2000), use the term ‘asymp-
totic variance’and the notation avar(βOLS) to denote the variance of the
limit distribution of√N(βOLS − β), here σ2M−1
XX, rather than the approxi-
mation to the variance of βOLS that we obtain from this limit distribution,
here(
1N
)σ2M−1
XX, as we use the term here
Expressions in Hayashi (2000) which involve the term avar() or avar()
should be divided by the sample size N , or multiplied by(
1N
), to make them
consistent with the notation on the slides
See Hayashi (2000) p.95
35
Neither σ2 nor MXX are known
The obvious choice to estimate MXX = p limN→∞(X ′XN
)is
MXX =
(X ′X
N
)Since p limN→∞ MXX = p limN→∞
(X ′XN
)= MXX, this gives us a consistent
estimator of MXX
Both the OLS estimator
σ2 =1
N −K
N∑i=1
u2i =
u′u
N −K
36
and the Maximum Likelihood estimator
σ2ML =
1
N
N∑i=1
u2i =
u′u
N
can be shown to be consistent estimators of σ2 under our assumptions
Using theOLS estimator of σ2 gives us a consistent estimator of avar(βOLS)
as
avar(βOLS) =
(1
N
)σ2M−1
XX
=
(1
N
)σ2
(X ′X
N
)−1
=
(1
N
)σ2N(X ′X)−1
= σ2(X ′X)−1
37
Since avar(βOLS)P→ avar(βOLS), we also have
βOLSa∼ N
(β, avar(βOLS)
)or βOLS
a∼ N(β, σ2(X ′X)−1
)Note that this asymptotic approximation to the distribution of βOLS (in the
model with conditional homoskedasticity) coincides with the exact result we
obtained for the distribution of βOLS in the classical linear regression model
with Normally distributed errors
There is at least one version of the linear model for which we know this
approximation based on the asymptotic distribution theory will be accurate38
In large samples, this asymptotic distribution result provides a motivation
for using this expression to estimate the variance of βOLS, and for treating
βOLS as a random vector with a Normal distribution, even if
• we have no reason to assume that the error terms ui are Normally dis-
tributed
• or to maintain the linear conditional expectation (strict exogeneity) as-
sumption of the classical linear regression model
This will be useful in practice in sample sizes that are large enough for the
asymptotic distribution to provide a good approximation to the distribution
of βOLS, but not so large for the variance of βOLS to become negligibly small39
To interpret this expression for the asymptotic variance of βOLS (under
conditional homoskedasticity), avar(βOLS) = σ2(X ′X)−1, it may be helpful
to consider the special case with 1 explanatory variable (denoted xi) and
no intercept
yi = xiβ + ui
where we assume that both yi and xi have zero mean, perhaps because
original variables have been transformed into deviations from their sample
means
Then X ′X is simplyN∑i=1
x2i
40
Since all the x2i terms are non-negative, X
′X =N∑i=1
x2i increases with the
sample size N
σ2 P→ σ2, so σ2 does not increase with the sample size N
In this example, it is clear that
avar(βOLS) = σ2(X ′X)−1 =σ2
N∑i=1
x2i
decreases with the sample size N
41
So although the expression avar(βOLS) = σ2(X ′X)−1 does not appear to
depend on the sample size N , our estimate of the asymptotic variance will
tend to fall as we add more observations to the sample
The square root of this (scalar) asymptotic variance, termed the asymptotic
standard error, also falls as N increases, indicating that we can estimate the
parameter β more precisely as we add more observations to the sample
In the limit as N →∞, we have
avar(βOLS) =σ2
N∑i=1
x2i
→ 0
consistent with our result that the limit distribution of βOLS is degenerate
42
The observation thatX ′X =N∑i=1
x2i increases withN is perfectly consistent
with our assumption that(X ′XN
)= 1
N
N∑i=1
x2iP→MXX
In the scalar case, with xi having zero mean, 1N
N∑i=1
x2i is the sample variance
of the random variable x
With independent observations, 1N
N∑i=1
x2i is a consistent estimator of the
population variance
In this case, our assumption simply requires that the population variance
of x exists, and is finite
43
With cross-section data, this assumption is usually innocuous
With stationary time series data, this assumption is also satisfied
Although this suggests that a different approach will be required for non-
stationary time series (e.g. random walks, unit roots, ...)
44
Now let the scalar vkk again denote the kth element on the main diagonal
of the K ×K (asymptotic) variance matrix σ2(X ′X)−1 = avar(βOLS)
Since
βka∼ N(βk, vkk)
we also have
tk =βk − βk√
vkk=βk − βksek
a∼ N(0, 1)
Simple hypothesis tests of the formH0 : βk = β0k can be conducted in
large samples by comparing the test statistic tk = βk−β0ksek
computed under the
null hypothesis to critical values of the standard Normal N(0, 1) distribution
at the desired significance level45
For example, to test the nullH0 : βk = β0k against the two-sided alternative
H1 : βk 6= β0k at the 5% level of significance, we can compare the absolute
value of tk = βk−β0ksek
that we obtain to the critical value 1.96
Strictly speaking, the standard error sek =√vkk should be referred to here
as the asymptotic standard error, sometimes denoted asek
This test is sometimes referred to as an asymptotic t-test
46
Why can we treat tk here as a draw from the standard Normal distribution,
even though we have estimated the unknown σ2?
The limit distribution result
√N(βOLS − β)
D→ N(0, σ2M−1XX)
implies that
zk =βk − βk√
vkk
D→ N(0, 1)
where vkk is the (unknown) kth element on the main diagonal of(
1N
)σ2M−1
XX
Using sek =√vkk to estimate the unknown
√vkk gives
tk =βk − βk√
vkk=
(√vkk√vkk
)(βk − βk√
vkk
)=
(√vkk√vkk
)zk
47
Since√vkk is a consistent estimator of
√vkk, we have
√vkk
P→ √vkk, or(√vkk√vkk
)P→ 1
Then tk has the same limit distribution as zk, so that we also have
tk =βk − βk√
vkk=βk − βksek
D→ N(0, 1)
However, since this test is only justified by an asymptotic approximation,
and since we know that a t(N−K) random variable converges in distribution
to a N(0, 1) random variable as N → ∞ with K fixed, there is also an
asymptotic justification for comparing the test statistic tk = βk−β0ksek
computed
under the null hypothesis to critical values of the t(N −K) distribution
48
Some econometric packages use the t(N −K) approximation to the distri-
bution of tk = βk−β0ksek
for reporting p-values and confidence intervals
You may have noticed that Stata does this
In general, the exact distribution of the test statistic is unknown, and
depends on features of the data generating process
Both the N(0, 1) approximation and the t(N − K) approximation only
have an asymptotic justification, and they coincide in the limit as N → ∞
for given K
In large samples, this difference becomes negligibly small49
Testing linear restrictions
The asymptotic distribution of the p × 1 random vector θOLS = HβOLS,
whereH is a non-stochastic p×K matrix with rank(H) = p (full row rank),
is given by
θOLSa∼ N(Hβ,Havar(βOLS)H ′)
where p limN→∞ θOLS = Hp limN→∞ βOLS = Hβ
and avar(θOLS) = Havar(βOLS)H ′ = σ2H(X ′X)−1H ′
Then the quadratic form
w = (θOLS −Hβ)′[avar(θOLS)
]−1
(θOLS −Hβ)a∼ χ2(p)
50
Joint hypothesis tests of the formH0 : Hβ = θ0 can thus be conducted
by computing the test statistic
w = (θOLS − θ0)′[avar(θOLS)
]−1
(θOLS − θ0)
which has a χ2(p) asymptotic distribution if the null hypothesis is true
We can compare the value of the test statistic we obtain to critical values
of the χ2(p) distribution at the desired level of significance
For example, if p = 2 and we want a test at the 5% significance level, we
can compare the computed value of the test statistic w to 5.99, the upper
95% percentile of the χ2(2) distribution
51
Alternatively, since this test only has an asymptotic justification, and we
know that p times aF (p,N−K) randomvariable converges in distribution to
a χ2(p) random variable asN →∞ withK fixed, there is also an asymptotic
justification for comparing the test statistic w/p computed under the null
hypothesis to critical values of the F (p,N −K) distribution
Some econometric packages use the F (p,N − K) approximation to the
distribution of w/p for reporting p-values
You may also have noticed that Stata does this
In large samples, this difference also becomes negligibly small52
Testing non-linear restrictions
These asymptotic Wald tests of restrictions on the parameter vector β can
be extended to cover non-linear restrictions
For example, in the model
yi = β1 + β2x2i + β3x3i + ui
we may be interested in testing the non-linear restriction that β2β3 = 1 ↔
β2β3 − 1 = 0
53
This hypothesis can be formulated as
h
β1
β2
β3
= β2β3 − 1 = 0
or h(β) = β2β3 − 1 = 0 for a vector-valued function h(β) with continuous
first derivatives
Let H(β) be the p ×K matrix of first derivatives H(β) = ∂h(β)/∂β′, so
that in our example we have
H(β) = ∂h(β)/∂β′ =
(0 β3 β2
)with full row rank (i.e. rank(H(β)) = 1 in our example)
54
We can then use the delta method to show that
w = h(βOLS)′[H(βOLS)avar(βOLS)H(βOLS)′
]−1
h(βOLS)a∼ χ2(p)
under the null hypothesis that h(β) = 0
We compute the test statistic w using our estimate of the parameter vector
βOLS and its asymptotic variance avar(βOLS), and compare the result to
critical values of the χ2(p) distribution
The Wald test statistic for linear restrictions, stated previously, can be
obtained as a special case, setting h(β) = Hβ − θ0 = 0
Note that the representation h(β) = 0 for non-linear restrictions is not
unique55
For example, to test the restriction that β2β3 = 1, we may consider the
representations
β2β3 − 1 = 0
or β2 −1
β3
= 0
or β3 −1
β2
= 0
Although Wald test statistics based on different representations of non-
linear restrictions are asymptotically equivalent, they can take different val-
ues in finite samples, and there is no guarantee that different Wald tests of
the same non-linear restriction will give the same outcome in finite samples
56