applied microeconometrics with stata nonparametric...

Applied Microeconometrics with Stata

Nonparametric Econometrics

Spring Term 2011

1 / 37

Contents

I Introduction

I The histogram estimator

I The kernel density estimator

I Nonparametric regression estimators

I Semi- and nonparametric extensions

I Summary and references

2 / 37

Introduction

I Up to now, all regression models relied on functions (or densities)which depend on an unknown finite-dimensional parameter.

I A finite-dimensional parameter is an element of Rq with q ∈ N.

I For example, linear regression models use an additive combinationof the covariates (x ′β).

I Nonlinear regression models specify a known function of (a linearindex of) the covariates.

I ML theory is based on a assumption about the density of the data,which depends on a finite dimensional parameter.

I If the functional form or the distributional assumption is wrong, theparameter estimators of these models are inconsistent, however.

I To circumvent this kind of misspecification problem, nonparametricmethods can be used, which do not impose assumptions on somefunctional form.

I In the following, nonparametric density and regression estimatorswill be described, which are the base for most nonparametriceconometric models.

3 / 37

The Histogram Estimator

I Consider a sample {Xi}ni=1 of a random variable X with unknowndensity fX (x).

I First, the support of X is divided in K intervals [aj−1, aj) withmin(X ) = a0 < a1 < . . . < aK = max(X ).

I The histogram estimator for the probability that X takes a value inthe interval [aj−1, aj) is then defined by

Pr(X ∈ [aj−1, aj)) =1

n

n∑i=1

1{Xi ∈ [aj−1, aj)}.

I This estimator can be viewed as a (crude) approximation of fX atsome point x ∈ [aj−1, aj).

I Furthermore, the histogram estimator is a step-function even if fX iscontinuous (see the examples of the next slide).

I A solution is to use smaller intervals [aj−1, aj), which converge inthe limit to a point.

I This approach is called a local estimation approach, and is the basicidea of the kernel density estimator.

4 / 37

The Histogram Estimator

The graphs show histograms of the logarithm of wage (from the data set Mroz.dta). The histogram of the top left

panel divides the data in 20 classes (the default value of the histogram command of Stata), that of the top right

panel uses 30 classes, and those of the bottom left and right panels use 50 and 100 classes, respectively. For the

histogram of the bottom right panel, a normal density fitted to the data is added (blue line). Note that the normal

density does not seem to fit the data well.5 / 37

The Kernel Density Estimator

I Consider now the local estimation of a density function.

I For some intuition of the kernel density estimator, note first that thefollowing holds by the definition of a derivative:

f (x) =d

dxF (x) = lim

h→0

F (x + h)− F (x − h)

2h

I The numerator of the last term can be rewritten as

F (x + h)− F (x − h) = Pr(X ∈ (x − h, x + h)).

I Define now the following kernel function:

k(z) =

{1/2 if |z | ≤ 1,0 otherwise.

The kernel function can be viewed as a weighting function.

I From this definition of k(z), it follows that

k

(Xi − x

h

)=

1

21{Xi ∈ (x − h, x + h)}.

See the next slides for a proof of this claim.

6 / 37


I Consider first the following general results: For some a and ε > 0,

|a| < ε ⇒ a < ε and − a < ε,

⇒ −ε < a < ε.

I The first claim holds as

a ≤ |a| < ε ⇒ a < ε,

−a ≤ |a| < ε ⇒ −a < ε.

I By −a < ε⇔ −ε < a, and by combining the inequalities a < ε and−ε < a, the second claim (−ε < a < ε) follows.

I Next, as h is positive, it holds that∣∣∣∣Xi − x

h

∣∣∣∣ < 1 ⇔ |Xi − x | < h.

I By the general inequality just derived (with a = Xi − x and ε = h),it holds that

|Xi − x | < h ⇔ −h < Xi − x < h

⇔ x − h < Xi < x + h.7 / 37


I It follows therefore that

k

(Xi − x

h

)=

{1/2 if |Xi − x | ≤ h ⇔ x − h < Xi < x + h,0 otherwise,

=1

21{Xi ∈ (x − h, x + h)}.

I Now, recall that for some event A,

Pr(A) = E [1{A}],where the expectation can be estimated by

E [1{A}] =1

n

n∑i=1

1{Ai}.

I Using these results, the kernel density estimator of fX (x) is given by

f (x) =1

nh

n∑i=1

k

(Xi − x

h

).

The factor h−1 originates from the definition of f (x) as thederivative of the cdf F (x).

8 / 37


I The parameter h is called the bandwidth (or smoothing parameteror window width) of the kernel density estimator.

I The bandwidth is the most important parameter of the kerneldensity estimation problem.

I Note that f (x) is the value of the density of X at some given pointx , that is, it is a local estimate.

I To obtain an estimate of f for each value of the support of X ,estimates for several points x ∈ supp(X ) are necessary.

I The kernel used above is called the uniform kernel.

I Different kernel functions may be used, as long as they have certainproperties (which are listed later).

I Examples for kernel functions are the standard normal density, i.e.,k(z) = φ(z), or the Epanechnikov kernel, which is defined as

k(z) =

{34 (1− z2) if |z | ≤ 1,0 otherwise.

I The choice of the kernel function is of minor importance in practice.The choice of h has a much larger impact on the estimation results. 9 / 37


The graphs show kernel density estimates of the same variable as in the previous example. The top left panel shows

the estimates using the default bandwidth of the kdensity command. The graph in the top right panel adds a

normal density function (red line) to the default kernel density estimate. The bottom left and right panels use a

smaller and a larger bandwidth, respectively (compared to the default value of the kdensity command).

10 / 37


I Consider now the following general properties which have to hold fora kernel function.

I A kernel function k integrates to one, i.e.,∫k(v)dv = 1,

is symmetric,

k(v) = k(−v),

and has a finite second moment, i.e.,∫v2k(v)dv = κ2 < ∞.

I Note that by the symmetry condition it holds that∫vk(v)dv = 0.

I Every kernel function with these properties can be used for akernel-based local estimation method.

11 / 37


I To proceed, consider first the definition of the mean squared error.

I The mean squared error MSE(θ)

of an estimator θ of θ is defined as

MSE(θ)

= E

[(θ − θ

)2]

= E

[(θ − E

[θ]

+ E[θ]− θ)2]

= E

[(θ − E

[θ])2

+ 2(θ − E

[θ])(

E[θ]− θ)

+(E[θ]− θ)2]

= E

[(θ − E

[θ])2]

+(E[θ]− θ)2

,

as it holds for the cross-product term that

E[(θ − E

[θ])(

E[θ]− θ)]

=(E[θ]− θ)

E[(θ − E

[θ])]

= 0

I The MSE can be rewritten as

MSE(θ)

= Var(θ)

+(

bias(θ))2

,

where the bias term is defined by bias(θ)

= E[θ]− θ.

I MSE convergence implies convergence in probability. 12 / 37


I To derive asymptotic properties of the kernel density estimator, thefollowing assumptions are used.

I The data {Xi}ni=1 is an iid sample. This is a standard assumption.

I The true density f (x) is three times differentiable, where thederivatives are denoted by f (s)(x), with s ∈ {1, 2, 3}.The derivatives of the density are used for a Taylor series expansion.

I The kernel function satisfies the assumptions stated previously.

By these properties, some elements of the Taylor series expansionare equal to zero.

I The point x at which the the density is estimated is an interiorpoint of the support of X .

For an element at the boundary of the support of X , certain termsof the expansion are unequal to zero, which leads to a larger bias.

I For n→∞, h→ 0 and nh→∞.

This assumption means that the bandwidth h converges ‘slower’ tozero than n tends to infinity. As n→∞, observations of a smallerenvironment of x are used for estimation (local approach).

13 / 37


I To describe the results, the order notation is useful.

I In the following, let an and bn denote some sequences, and considerthe case of n→∞.

I Boundedness can be described by the O(·)-notation:

an = O(1) ⇔ |an| < C

for some positive constant C (and n→∞). More general,

an = O(bn) ⇔∣∣∣∣an

bn

∣∣∣∣ < C .

I Convergence to zero is denoted by an = o(bn) ⇔ |an/bn| → 0. Aspecial case is bn ≡ 1, i.e., an = o(1)⇔ an → 0.

I This notation is used to abbreviate terms which converge faster tozero or to some constant than the other remaining terms.

I For random variables Xn, Xn = Op(1) denotes Pr(|Xn| > M) ≤ ε for

n→∞. Xnp→ 0 is denoted by Xn = op(1).

I Op(Yn) and op(Yn) are defined similarly as O(Yn) and o(Yn).

14 / 37


I Now, consider the MSE of the kernel density estimator at point x ,which is given by

MSE(f (x)

)= Var

(f (x)

)+(

bias(f (x)

))2

.

I First, the bias term is analyzed, which is equal to

E

[1

nh

n∑i=1

k

(Xi − x

h

)]− f (x) =

1

hE

[k

(X1 − x

h

)]− f (x).

I This equality follows by the identical distribution of the sample(which is expressed by setting E [Xi ] = E [X1] for all i):

E

[1

nh

n∑i=1

k

(Xi − x

h

)]=

1

nh

n∑i=1

E

[k

(Xi − x

h

)]

=1

nh

n∑i=1

E

[k

(X1 − x

h

)]=

1

nhnE

[k

(X1 − x

h

)]=

1

hE

[k

(X1 − x

h

)].

15 / 37


I Next, by the definition of the expectation, the bias can be written as

1

hE

[k

(X1 − x

h

)]− f (x) =

1

h

∫k

(x1 − x

h

)f (x1)dx1 − f (x)

I Consider now the following change of the integration variable:

x1 − x

h= v ⇔ x1 = x + vh.

I Using this, the bias can be rewritten as follows (see the next slidefor a derivation):

1

h

∫k(v)f (x + vh)hdv − f (x) =

∫k(v)f (x + vh)dv − f (x).

I Note that by this reformulation, the properties of the kernelfunction can be used to derive the properties of the bias term.

I Furthermore, a suitable Taylor expansion of f (x + vh) leads to anexpression which is interpretable using the basic assumptions statedabove.

16 / 37


I Consider now the following equality just stated:

1

h

∫k

(x1 − x

h

)f (x1)dx1 =

1

h

∫k(v)f (x + vh)hdv .

I To see this, the formula for integration by substitution is used: Forsuitable functions w and g it holds that∫ g(b)

g(a)

w(x)dx =

∫ b

a

w(g(t))g ′(t)dt.

I For the problem at hand, w(x1) = k((x1 − x)/h)f (x1) andg(v) = x + vh (= x1).

I Let I =∞ and I = −∞ be the upper and lower bounds of theintegral. Therefore, I = g(b) and hence b = g−1(I ) = (I − x)/h.

I For I =∞ (and similarly for I = −∞), b = g−1(I ) =∞. For finiteI and/or I , boundaries may change due to the substitution.

I Using g ′(v) = h, the equality claimed above follows.

17 / 37


I A Taylor expansion of a m-times differentiable function g(x) atsome point x0 is given by

g(x) = g(x0) + g (1)(x0)(x − x0) +1

2!g (2)(x0)(x − x0)2 + . . .

+1

m!g (m)(ξ)(x − x0)m,

where g (j)(x0) = ∂jg(x)/(∂x)j |x=x0 and ξ lies between x and x0.

I Now, consider a Taylor expansion of the density f (x + vh) at thepoint x :

f (x + vh) = f (x) + f (1)(x)(x + vh − x) +1

2f (2)(x)(x + vh − x)2 + . . .

= f (x) + f (1)(x)vh +1

2f (2)(x)v2h2 + . . .

I For the next term in the expansion it holds that

1

3!f (3)(x)v3h3 = O(h3),

as f (3)(x) and v3 are bounded (by assumption).18 / 37


I Now insert

f (x + vh) = f (x) + f (1)(x)vh +1

2f (2)(x)v2h2 + O(h3)

into the bias expression:∫k(v)f (x + vh)dv − f (x)

=

∫k(v)

(f (x) + f (1)(x)vh +

1

2f (2)(x)v2h2 + O(h3)

)dv − f (x).

I By using this expansion, the term f (x + vh) can be expressed byterms involving x alone, which enables the computation of theintegral with respect to v .

I The bias is therefore equal to∫k(v)f (x + vh)dv − f (x) = f (x)

∫k(v)dv

+ f (1)(x)h

∫vk(v)dv +

1

2f (2)(x)h2

∫v2k(v)dv + O(h3) − f (x)

19 / 37


I As it holds by the properties of a kernel function that∫k(v)dv = 1 and

∫vk(v)dv = 0,

the bias is equal to

bias(f (x)

)=

h2

2f (2)(x)

∫v2k(v)dv + O(h3).

I A similar expression can be derived for the variance of the kerneldensity estimator at point x .

I Combining these results, the MSE of the kernel density estimator isgiven by

MSE(f (x)

)=

h4

4

(κ2f

(2)(x))2

+κf (x)

nh+ o

(h4 +

1

nh

),

where

κ =

∫k2(v)dv and κ2 =

∫v2k(v)dv .

I From this result pointwise consistency of the kernel densityestimator follows (i.e., for a given point x). 20 / 37


I To compute the kernel density estimator, a value of h is needed.

I It can be shown that the MSE of f (x) is minimized by the followingbandwidth:

hopt =

(κf (x)

(κ2f (2)(x))2

)−1/5

n−1/5.

I This expression depends on κ and κ2, which can be computed byknowledge of the used kernel function, and on the true densityfunction f (x) and its second derivative, which are unknown.

I One simple possibility of this problem is to use a pilot bandwidthhpilot to estimate f (x) and f (2)(x) nonparametrically. Of course,this bandwidth needs also to be chosen.

I A solution of this is to assume (for deriving a pilot bandwidth) thatf (x) is a normally distributed with variance σ2. Then one can derivehpilot ≈ 1.06σn−1/5 (Silverman’s rule of thumb).

I This expression can be used to estimate the expressions needed toderive hopt . Often hpilot is used directly as hopt .

I Various other methods for bandwidth choice exist.21 / 37


I Kernel density estimators can also be used to estimate multivariatedensities.

I Given a sample {Xi}ni=1, where Xi ∈ Rq and q > 1, the densityf (x) = f (x1, x2, . . . , xq) can be estimated by

f (x) =1

nh1 . . . hq

n∑i=1

K

(Xi − x

h

)

=1

nh1 . . . hq

n∑i=1

k

(Xi1 − x1

h1

)× · · · × k

(Xiq − xq

hq

).

where K (·) is called a product kernel, and the functions k(·) areunivariate kernels (as used previously).

I Under some assumptions, it can be shown that√nh1 . . . hq

(f (x)− f (x)− κ2

2

q∑s=1

h2s fss(x)

)d→ N (0, κqf (x)).

Here, fss is some derivative of the density f .

I Note that an asymptotic bias term occurs here.22 / 37


I f (x) is (for univariate and multivariate estimators alike) a consistentestimator of f (x).

I From the expression of the asymptotic distribution it follows thatthe speed of convergence of nonparametric kernel density estimatesis much smaller than for parametric estimators, and decreases withan increasing number of variables, i.e., with increasing q.

I To understand the concept of convergence speed, consider some

parametric estimator θ, for which θp→ θ0 (that is, θ − θ0 = oP(1)).

I Multiplying this difference by√

n leads to an expression which doesnot tend to infinity or to zero, but to a random variable.

√n is

called the speed of convergence of the estimator.

I As can be seen from the expression of the previous slide, the kerneldensity estimator has a much slower speed, as

√nh1 . . . hq (or

√nh

in the univariate case) is smaller than√

n, as the bandwidthsconverge to zero.

I In practice, this means that nonparametric methods should only beapplied if large samples are available.

23 / 37

Nonparametric Regression

I Consider the general regression model

yi = g(Xi ) + εi ,

where g(x) = E [y |X = x ], and no assumptions on the form of g(x)are imposed.

I The conditional expectation of y is defined by

E [y |X = x ] =

∫yfY |X (y |x)dy =

∫y

fY ,X (y , x)

fX (x)dy =

∫yfY ,X (y , x)dy

fX (x),

where the second equality follows by Bayes’ law.

I The denominator fX (x) can be estimated by a nonparametricdensity estimator.

I Consider now the following kernel estimator of fy ,x(y , x):

fY ,X (y , x) =1

nh0h1 . . . hq

n∑i=1

K

(Xi − x

h

)k

(yi − y

h0

),

where K (·) is the product kernel for multivariate covariates definedearlier.

24 / 37


I Consider now the numerator of the estimator of g(x), i.e., consider∫yfY ,X (y , x)dy with fY ,X replaced by the estimator fY ,X just

defined:∫y fY ,X (y , x)dy =

∫y

1

nh0h1 . . . hq

n∑i=1

K

(Xi − x

h

)k

(yi − y

h0

)dy

=1

nh0h1 . . . hq

n∑i=1

K

(Xi − x

h

)∫yk

(yi − y

h0

)dy .

I By a change of variables ((yi − y)/h0 = v ⇔ y = yi − vh0), theright hand side of this expression can be rewritten as

1

nh0h1 . . . hq

n∑i=1

K

(Xi − x

h

)∫(yi − vh0)k(v)h0dv .

I As∫

k(v)dv = 1 and∫

vk(v)dv = 0, this simplifies to

1

nh1 . . . hq

n∑i=1

K

(Xi − x

h

)yi .

25 / 37


I Using this result, the estimator of g(x) can be rewritten as:

g(x) =

∫y fY ,X (y , x)dy

fX (x)=

∑ni=1 K

(Xi−x

h

)yi∑n

i=1 K(

Xi−xh

) .

I This estimator is called local constant regression or Nadaraya-Watson estimator.

I Note that g(x) is a weighted average of y :

g(x) =n∑

i=1

ωiyi

where the weights ωi are defined by

ωi =K(

Xi−xh

)∑nj=1 K

(Xj−x

h

) ,and have the properties ωi ≥ 0 and

∑ni=1 ωi = 1.

I The nonparametric regression estimator can be viewed as aweighted average of y , where the weights depend on the distance ofthe covariates Xi to the point x at which g(x) is estimated.

26 / 37


I Under some assumptions, the following holds:√nh1 . . . hq

(g(x)− g(x)−

q∑s=1

h2s Bs(x)

)d→ N

(0,κqσ2(x)

f (x)

).

I Again, an asymptotic bias term occurs.

I The bandwidth parameter(s) can be determined by several methods,for example by the plug-in approach or by cross-validation.

I g(x) can be expressed as the solution of an optimization problem:

g(x) = arg minµ

n∑i=1

(yi − µ)2K

(Xi − x

h

).

I That is, the nonparametric regression estimate at point x is theconstant of a weighted OLS regression without further covariates.The covariates appear only in the kernel function, i.e., they are onlyused for computing the weights K (·).

I Better asymptotic properties may be obtained by more generalmodels, which will be discussed next.

27 / 37


I Consider as a first extension of the local constant regression modelthe local linear regression model.

I Here, a local linear instead of a local constant function is used toapproximate the unknown g(x).

I The local linear regression at a point x is determined by thefollowing optimization problem:

δ(x) =

(a(x)

b(x)

)= arg min

a,b

n∑i=1

(yi − a− (Xi − x)′b

)2k

(Xi − x

h

).

I To derive a closed form expression of the estimators, define thefollowing [n × n] and [n × (k + 1)] dimensional matrices:

Kx = diag

(k

(X1 − x

h

), . . . , k

(Xn − x

h

)),

Xx = (1, (Xi − x)′)i=1,...,n .

I The optimization problem can be restated as

δ(x) = arg minδ(x)

(y − Xxδ(x))′Kx(y − Xxδ(x)).

28 / 37


I The solution can be expressed explicitely as

δ(x) = (X ′xKxXx)−1

X ′xKxy .

I The estimate g(x) of the mean function at point x is equal to a(x).

I A further generalization is to approximate g(x) locally by apolynomial of order p.

I For univariate X , the objective function of the local parametersδ(x) = (δ0(x), . . . , δp(x))′ is given by

δ(x) = arg minδ

n∑i=1

(yi − δ0 − (Xi−x)δ1 −. . .− (Xi−x)pδp

)2k

(Xi − x

h

).

I In this case, the covariate matrix Xx is defined as:

Xx =

1 X1 − x (X1 − x)2 . . . (X1 − x)p

1 X2 − x (X2 − x)2 . . . (X2 − x)p

......

... . . ....

1 Xn − x (Xn − x)2 . . . (Xn − x)p

.

29 / 37


I With Kx as defined above, the parameter estimators are given by

δ(x) =(X ′xKxXx

)−1X ′xKxy .

I Again, g(x) is equal to the (weighted) constant of the polynomialexpression above, i.e., g(x) = δ0(x), where δ0(x) is the first elementof the vector δ(x).

I The additional parameters δs for s = 1, . . . , p are estimators of thederivatives of the regression function, i.e.,

δs(x) =1

s!

∂sg(x)

∂x s.

I In fact, the polynomial expression used above corresponds to aTaylor approximation of g(x), which explains the factor 1

s! .

I When estimating the sth derivative of g(x), p − s should be odd tolower the bias of the estimate.

I The advantage of local polynomial regression models are smallerbias terms compared to local constant regressions.

I Extensions for multivariate covariates exist, which have a somewhatinvolved notation. 30 / 37


I For all nonparametric methods presented here (kernel density andlocal regression estimators), it is assumed that the covariates arecontinuous.

I Therefore, indicator variables and discrete ordered variables can notbe used directly for nonparametric regression.

I A possibility to use multivariate nonparametric estimators in thepresence of discrete variables is to divide the data set into cells ofdifferent values of the discrete variables and to compute thenonparametric estimators using the remaining continuous variables.

I As an example, consider the regression estimate for a dependentvariable y and two covariates x1 and x2. If x1 and x2 are continuous,m(y |x1, x2) is a function with two (continuous) arguments.

I Assume now that x2 is an indicator variable. The procedure abovecomputes two regressions m(y |x1, x2 = 1) and m(y |x1, x2 = 0).

I The problem of this approach is that the number of observations foreach cell can be quite low when there are several discrete variables.

I If the data set contains only discrete variables, the nonparametricregressions correspond to the cell means of y . 31 / 37

Semi- and Nonparametric Extensions

I The convergence rate of nonparametric regression estimatorsdecreases with an increasing number of regressors (curse ofdimensionality).

I A solution of this problem is to impose some further structure.

I One possibility for such restrictions are additive models, whichmodel the conditional mean of the dependent variable by a sum ofunivariate unknown functions of the covariates.

I A second class of models which circumvent the curse ofdimensionality are semiparametric models, which assume aparametric specification for a part of the model.

I Consider first the basic additive nonparametric model:

E [y |x ] = g1(x1) + g2(x2) + . . .+ gk(xk).

I Here, the rate of convergence is equal to that of an univariatenonparametric regression, which is faster as in the case of a generalmultivariate nonparametric model.

I The unknown functions gj(xj) can be estimated by a backfittingalgorithm, for example.

32 / 37

Semi- and Nonparametric Extensions

I A second possibility to circumvent the curse of dimensionality is toimpose some parametric structure for a part of the model.

I One example of this model class is the partial linear model, whichuses an unknown function g(x2) to specify the true model as

y = x ′1β + g(x2) + ε.

I To estimate the model, consider first the expectation of y given x2:

E [y |x2] = E [x1|x2]′β + g(x2).

I Now, the following difference does not contain g(x2) and can beestimated by OLS to obtain an estimator of β:

y − E [y |x2] =(x1 − E [x1|x2]

)′β + ε.

I The nonparametric part can be estimated by

g(x2) = E [y |x2]− E [x1|x2]′β.

I A further example is the single-index model, which uses a linearindex of X and an unknown function g(·) to specify the conditionalmean as E [y |x ] = g(x ′β). 33 / 37

Review and Definitions

I Parametric models depend on a finite dimensional parameter, whichis an element of Rq for q ∈ N.

I Nonparametric models are based on one (or several) parameter(s) ofinfinite dimension, that is, the parameter(s) cannot be representedby a finite dimensional value, i.e., by a scalar (or a vector thereof).

I The parameter of a nonparametric model is a function, like aconditional expectation (i.e., a regression) or a probability density.

I Nonparametric models weaken functional form assumptions (like thelinearity assumptions of the OLS model).

I Nonlinear models can also be estimated by OLS or by NLS, however.The difference to nonparametric models is that the structure of thenonlinearity is given by the parametric form, i.e., by assumption.

I A broad class of nonparametric models (all of those presented here)are based on a local estimation approach, where the estimation isbased mainly on observations near to the point of interest.

I There are also nonparametric models which are not based on a localapproach (e.g., the Nelson-Aalen or Kaplan-Meier estimators).

34 / 37

Review and Definitions

I Semiparametric estimators depend on finite as well as on infinitedimensional parameters.

I Examples are the partial linear and singles index models presentedpreviously.

I Two-step estimators with nonparametric first-step estimates canalso be viewed as semiparametric estimators.

I The parametric part of these estimators is the (scalar) second-stepestimator, which is usually an average of the nonparametricfirst-step estimates evaluated for all observations of the sample.

I Examples are nonparametric matching or reweighting evaluationestimators (see the following lecture on evalution methods).

I These estimators are averages of nonparametrically estimatedfunctions. Nevertheless, they are

√n-consistent, as they have the

structure of an U-statistic.

I For more on U-statistics, see Pagan and Ullah, NonparametricEconometrics, Cambridge University Press 1999, p. 358, or Powell,Stock, and Stoker (Econometrica, 1989).

35 / 37

Summary

I Parametric specifications of regression equations and densityfunctions may lead to inconsistent estimates.

I Nonparametric models circumvent the need for distributional orfunctional form assumptions.

I A large number of nonparametric methods are based on the localestimation approach.

I Density functions can be estimated by kernel density estimators.

I A basic local nonparametric regression estimator is the localconstant (or Nadaraya-Watson) regression estimator.

I The local linear and local polynomial regression models aregeneralizations of the Nadaraya-Watson regression estimator andimprove the asymptotic properties.

I Several semi- and nonparametric extensions of the basic modelswere proposed to address the curse of dimensionality.

36 / 37

Basic and Additional References

I Basic references:

Cameron and Trivedi (2005), ch. 9,

Cameron and Trivedi (2009), sec. 2.6.4-2.6.6.

I General textbooks on nonparametric methods:

Q. Li and J. S. Racine, Nonparametric Econometrics, PrincetonUniversity Press, 2007.

A. Pagan and A. Ullah, Nonparametric Econometrics, CambridgeUniversity Press, 1999.

D. Ruppert, M.P. Wand, and R.J. Carroll, SemiparametricRegression, Cambridge University Press, 2003.

A. Yatchew, Semiparametric Regression for the AppliedEconometrician, Cambridge University Press, 2003.

I Survey articles:

J. DiNardo and J. L. Tobias, “Nonparametric Density andRegression Estimation,” Journal of Economic Perspectives, vol.15(4), Fall 2001, p. 11-28.

A. Yatchew, “Nonparametric Regression Techniques in Economics,”Journal of Economic Literature, Vol. 36 (1998), p. 669-721. 37 / 37

applied microeconometrics with stata nonparametric...

Documents