testing parametric conditional distributions using the nonparametric smoothing method

Metrika (2012) 75:455–469DOI 10.1007/s00184-010-0336-2

Testing parametric conditional distributions usingthe nonparametric smoothing method

Xu Zheng

Received: 20 March 2007 / Published online: 23 November 2010© Springer-Verlag 2010

Abstract This paper proposes a new goodness-of-fit test for parametric conditionalprobability distributions using the nonparametric smoothing methodology. An asymp-totic normal distribution is established for the test statistic under the null hypothesisof correct specification of the parametric distribution. The test is shown to have poweragainst local alternatives converging to the null at certain rates. The test can be appliedto testing for possible misspecifications in a wide variety of parametric models. A boot-strap procedure is provided for obtaining more accurate critical values for the test.Monte Carlo simulations show that the test has good power against some commonalternatives.

Keywords Goodness-of-fit test · Parametric conditional distribution ·Nonparametric smoothing method · Local alternatives · Bootstrap ·Monte Carlo simulations

1 Introduction

A fundamental assumption underlying the maximum likelihood based estimation andtesting procedures is that the probability distributions of random variables belong toparametric families of known distributions. Misspecifications of the parametric distri-butions may lead to inconsistent estimator and invalid inference.

Tests of parametric distribution functions based on nonparametric smoothing tech-niques have been investigated by Bickel and Rosenblatt (1973), Rosenblatt (1975),

I thank the two anonymous referees for helpful comments.

X. Zheng (B)School of Economics, Antai College of Economics and Management, Shanghai Jiao Tong University,200052 Shanghai, Chinae-mail: [email protected]

123

456 X. Zheng

Eubank et al. (1987), Ghosh and Huang (1991), Eubank and LaRiccia (1992), Eubanket al. (1993), Zheng (2000), Zheng (2008), Bickel et al. (2006), and others (see Hart1997 for some surveys on the related work). The work in the literature has focusedon testing for unconditional distributions of continuous variables. However, in manyparametric models, such as models of linear regression, categorical or discrete choice,and censored regression, the conditional distribution function of the response variable,given a vector of covariates, is assumed to belong to a parametric family of known dis-tributions, while the marginal distribution of the covariates is often unspecified. Andin these models the conditional distributions can be continuous, discrete, or mixed.Andrews (1997) and Bai (2003) extend the Kolmogorov–Smirnov test to conditionaldistributions for i.i.d. data and time series data respectively and their tests are thusnot based on smoothing nonparametric methods. The main difference between a testbased on the nonparametric smoothing method, such as the test proposed in this paper,and a Kolmogorov–Smirnov type of test is that the later can have power against n−1/2

local alternatives, while the former does not. However, the test of Andrews (1997) hasa quite complicated null distribution that depends on unknown nuisance parameters.Thus the asymptotic critical values for the test of Andrews (1997) can not be tabulatedand a parametric bootstrap procedure is needed to obtain critical values.

This paper proposes a new test for goodness-of-fit of parametric conditional dis-tributions based on the nonparametric smoothing method. The test allows for caseswhere the conditional distribution can be discrete, continuous, or mixed. The test isbased on examining the orthogonality conditions satisfied by the difference betweenthe nonparametric conditional distribution and the null conditional distribution, notbased on their squared difference as in most of the other nonparametric smoothingbased goodness of fit tests, such as Eubank and LaRiccia (1992). Our statistic is easierto calculate and analyze as the test statistic is already centered at mean zero under thenull. Tests based on squared difference need to subtract the asymptotic mean fromthe test statistics and also need to estimate the mean. Though it’s possible to extendthe test of Aït-Sahalia et al. (2001) to the conditional distribution case, its presentform applies only to testing for regression function and is based on the squared differ-ence between nonparametric regression and parametric regression. The tests of Zheng(2000), Zheng (2008) are based on Kullback–Leibler information function and applyonly to continuous distribution and discrete conditional distribution respectively. Thetest proposed in this paper also permits multivariate conditional distributions. It can beapplied to testing for a wide variety of parametric models, including linear regressionmodels, discrete choice models and censored regression models.

The plan of this paper is as follows. Section 2 provides the motivation for the testand derives the test statistic. Section 3 establishes the asymptotic null distribution andthe power of the test. Section 4 provides some Monte Carlo simulations for the test.Section 5 concludes the paper.

2 The test statistic

Let Q(y, x) be the joint probability distribution function of a random vector (Y, X)

defined on some probability space (�,F , P) where (Y, X) takes values on Rm ×

123

Testing parametric conditional distributions 457

Rd . We have a random sample {(yi , xi )}ni=1 from Q(y, x). Denote the conditional

distribution function of Y given X by G(y|x). Let f1(x) be the marginal density of Xwith respect to Lebesgue measure on Rd . Denote the marginal distribution functionsof X and Y by F1(x) and F2(y) respectively.

In many parametric models, the marginal distribution of X is often unspecified.But the conditional distribution function G(y|x) of Y given X is assumed to belongto a parametric family of known probability functions F(y|x, θ) on Rm × Rd × �,where � ⊂ Rq . Let f (y|x, θ) be the conditional density of Y given X with respectto some σ -finite measure μ under the null hypothesis. Since μ is not restricted to beLebesgue measure, the null hypothesis allows cases where the dependent variables Ycan be discrete, continuous, or mixed.

The null hypothesis to be tested is that the data is generated by the parametricmodel:

H0 : G(y|x) = F(y|x, θ0) for some θ0 ∈ �. (2.1)

The alternative consists of all possible departures from the null model.A test of hypothesis (2.1) may be based on the following Cramér-von Mises type

of distance measure between G(y|x) and F(y|x, θ):

D1 =∫ ∫

[G(y|x) − F(y|x, θ)]2d F1(x)d F2(y)

=∫

E[G(y|X) − F(y|X, θ)]2d F2(y). (2.2)

Denote 1(·) as the indicator function, i.e., 1(A) = 1 if event A occurs and 1(A) = 0otherwise. Noting that G(y|X) − F(y|X, θ) = E{[1(Y ≤ y) − F(y|X, θ)]|X}, D1can be written as

D1 =∫

E {[1(Y ≤ y) − F(y|X, θ)] [G(y|X) − F(y|X, θ)]} d F2(y)

=∫ ∫ ∫

[1(z ≤ y) − F(y|x, θ)] [G(y|x) − F(y|x, θ)] d Q(z, x)d F2(y).

(2.3)

By writing D1 in the form of (2.3), our test is based on orthogonality between[G(y|x)−F(y|x, θ)] and [1(z ≤ y)−F(y|x, θ)], not based on [G(y|x)−F(y|x, θ)]2

as in Cramér-von Mises type of tests, such as the test of Eubank and LaRiccia (1992).Though we expect Cramér-von Mises type of tests would yield similar asymptoticdistributions, our test is easier to analyze as it is already centered at mean zero underthe null by construction.

We estimate G(y|x) = E[1(Y ≤ y)|X = x] by the kernel smoothing method,

G(y|x) =∑n

j=1 K(

x−x jh

)1(y j ≤ y)

∑nj=1 K

(x−x j

h

) , (2.4)

123

458 X. Zheng

where K (·) is a kernel function and h is a bandwidth. We also smooth the parametricdistribution function F(y|x, θ) by the kernel method:

F(y|x, θ) =∑n

j=1 K(

x−x jh

)F(y|x j , θ)

∑nj=1 K

(x−x j

h

) . (2.5)

The kernel density estimator f1(x) = ∑nj=1 K

(x−x j

h

)/nhd of f1(x) appeared in the

denominator of (2.4) and (2.5) may cause the test statistic to be ill-behaved. To solvethis problem, as in Powell et al. (1989), we weigh D1 by the density function of X toget

D2 =∫ ∫ ∫

[1(z ≤ y) − F(y|x, θ)][G(y|x) − F(y|x, θ)

]f1(x)d Q(z, x)d F2(y). (2.6)

θ0 is estimated by the maximum likelihood estimator θ . Q(z, x) and F2(y) areestimated by the empirical distribution functions Qn(z, x) and F2n(y) respectively.Substituting these estimators into D2, we obtain the approximation of D2,

D2n =∫ ∫ ∫ [

1(z ≤ y) − F(y|x, θ )] [

G(y|x) − F(y|x, θ)]

f1(x)d Qn(z, x)d F2n(y)

= 1

n3

n∑k=1

n∑i=1

n∑j=1

1

hdK

(xi − x j

h

)

·[1(yi ≤ yk) − F(yk |xi , θ )][1(y j ≤ yk) − F(yk |x j , θ )]. (2.7)

In order to easily obtain the asymptotic distribution of D2n , we drop the negligibleterms for which i = j to obtain the statistic on which the final test statistic will be based

Dn = 1

n2(n − 1)

n∑k=1

n∑i=1

∑j �=i

1

hdK

(xi − x j

h

)

·[1(yi ≤ yk) − F(yk |xi , θ )][1(y j ≤ yk) − F(yk |x j , θ )]. (2.8)

For a matrix A, let ‖A‖ denote its Euclidean norm, i.e., ‖A‖ = [tr(AA′)

]1/2. Weimpose the following assumptions.

Assumption 1 {(y1, x1), (y2, x2), . . . , (yn, xn)} is a random sample from a probabil-ity distribution Q(y, x) on Rm × Rd . Density function f1(x) of xi and its first-orderderivatives are uniformly bounded. The conditional distribution function G(y|x) isdifferentiable with respect to x for every y. supy∈Rm

∫ ‖ ∂G(y|x)∂x ‖ f1(x)dx < ∞.

Assumption 2 For every y, the parametric conditional distribution function F(y|x, θ)

is differentiable with respect to x and twice differentiable with respect to θ on Rm ×Rd × �, where the parameter space � is a compact and convex subset of Rq .

123


supθ∈�

supy∈Rm

∫ ∥∥∥∥∂ F(y|x, θ)

∂x

∥∥∥∥ f1(x)dx < ∞.

supθ∈�

supy∈Rm

∫ ∥∥∥∥∂ F(y|x, θ)

∂θ

∥∥∥∥ f1(x)dx < ∞.

supθ∈�

supy∈Rm

∫ ∥∥∥∥∂ F2(y|x, θ)

∂θ∂θ ′

∥∥∥∥ f1(x)dx < ∞.

Assumption 3 There exists an estimator θ of θ0 such that√

n(θ −θ0) = Op(1) underthe null where θ0 is an interior point of �.

Assumption 4 K (u) is a nonnegative, bounded, continuous, and symmetric functionon Rd such that

∫K (u)du = 1 and

∫ ‖u‖2 K (u)du < ∞.

Note that most joint probability distribution functions used in practice satisfy Assump-tions 1 and 2. The Assumption 3 is satisfied by the maximum likelihood estimator.Most kernel functions in use satisfy Assumption 4.

3 The asymptotic behavior of the test statistic

The following theorem provides the asymptotic distribution of the statistic Dn underthe null hypothesis (all proofs are collected in the appendix). For vectors a =(a1, . . . , am)′ and b = (b1, . . . , bm)′, denote a∧b = (min(a1, b1), . . . , min(am, bm))′and a ∨ b = (max(a1, b1), . . . , max(am, bm))′.

Theorem 1 Given Assumptions 1–4, if h → 0 and nhd → ∞, then under the nullhypothesis (2.1),

nhd/2 DnD−→ N (0, σ 2), (3.1)

where

σ 2 ≡ 2∫

K 2(u)du ·∫ ∫

E{

[G(y1 ∧ y2|X)]2

· [1 − G(y1 ∨ y2|X)]2 f1(X)}

d F2(y1)d F2(y2). (3.2)

σ 2 can be consistently estimated by

σ 2 ≡ 2

n(n − 1)

n∑i=1

∑j �=i

1

hdK 2(

xi − x j

h

)

·{∫ [

1(yi ≤ y) − F(y|xi , θ )] [

1(y j ≤ y) − F(y|x j , θ )]

d F2n(y)

}2

. (3.3)

The final test statistic Tn is based on Dn ,

Tn ≡ nhd/2 Dn/σ . (3.4)

123

460 X. Zheng

It follows from Theorem 1 that TnD−→N (0, 1) under the null hypothesis.

To study the power property of the test, we consider the following sequence of localalternatives

H1n : G(y|x) = F(y|x, θ0) + λn�(y, x), (3.5)

where λn → 0 as n → ∞ and the function �(·, ·) is continuously differentiable anduniformly bounded. The power of our test against local alternatives (3.5) is given inthe following theorem.

Theorem 2 Given Assumptions 1–4, if h → 0 and nhd → ∞, then under the localalternatives (3.5) with λn = n−1/2h−d/4,

TnD−→ N (δ, 1), (3.6)

where

δ =∫

E[�2(X, y) f1(X)

]d F2(y)

/σ. (3.7)

The local power result is similar to that of Eubank and LaRiccia (1992) for testingcontinuous unconditional distributions. The test has non-trivial power against alterna-tives converging to the null at rate n−1/2h−d/4. At α significance level, the test rejectsthe null if Tn > Nα where Nα is the 100(1 − α) percentile. Thus the limiting powerof the test against the alternatives (3.5) is 1 − �(Zα − δ), where � is the standardnormal distribution function. Though the test of Andrews (1997) has power againstn−1/2 local alternatives, it has more complicated null distribution than our test. Ourtest has a convenient null normal distribution.

Similarly to the test of Eubank and LaRiccia (1992), the power of our test is mono-tone in δ, a measure of the size of the departure from the null. This implies that the testhas the same asymptotic power against alternatives of the same size in any direction.Since the Cramér-von Mises type of test, which is based on empirical distributions, forthe conditional distribution case has not yet been developed, we can not make directpower comparisons with such a test, as is done in Eubank and LaRiccia (1992). Notethat our test statistic is weighted by the density function of X in order to avoid extremesmall values in the denominator of the kernel density estimator. An alternative way tosolving the problem is to introduce another trimming parameter to trim out extremesmall values. Theoretically, there is an optimal weight function in the sense of maxi-mizing local power against local alternatives (3.5). But using the density function asthe weight function appears not causing much problem to the size and power of thetest, as illustrated in the simulation study of next section.

4 Bootstrap and simulation studies

In this section, we provide some Monte Carlo simulations for the test proposed in thispaper and compare its finite sample performance with the test of Andrews (1997).Initial simulations show that the actual size of our test tends to be below the nominal

123


size in small samples. So we use a bootstrap procedure to obtain more accurate criti-cal values. As in Andrews (1997), the naive bootstrap based on resampling from theempirical distribution does not work in our case, as it does not reflect the underlyingconditional distribution under the null. So we use the bootstrap procedure proposedby Andrews (1997).

In the bootstrap procedure of Andrews (1997), the bootstrapped covariates are thesame as observed xi and bootstrapped y∗

i comes from the conditional distributionF(y|xi , θ ) where F(y|x, θ ) is the estimated parametric conditional distribution underthe null. That is, for each observation (yi , xi ), we draw y∗

i from F(y|xi , θ ) with xi

fixed. Then we re-estimate θ by MLE based on the bootstrapped sample {(y∗i , xi )}n

i=1

and denote its estimator by θ∗. Since there is no need to bootstrap the variance of ourtest statistic Tn , we only need to bootstrap Dn in (2.8), instead of Tn in (3.4). Thebootstrapped Dn is

D∗n ≡ 1

n2(n − 1)

n∑k=1

n∑i=1

∑j �=i

1

hdK

(xi − x j

h

)

·[1(y∗

i ≤ y∗k ) − F(y∗

k |xi , θ∗)] [

1(y∗j ≤ y∗

k ) − F(y∗k |x j , θ

∗)]. (4.1)

To investigate the size and power of our test, we conduct some Monte Carlo sim-ulations to compare it with the test of Andrews (1997). In all following models, thecovariates x1 and x2 are generated from the bivariate normal distribution with vari-ances equal to 1 and the covariance equal to 0.7. The error term ε is independentof x .

In models 1a–1f, we consider testing for nonnormal distributions and other typesof misspecifications in linear regression models. In model 1a, the dependent vari-able y is generated as y = 1.5 + 10x1 − 3.5x2 + ε where ε has the standard nor-mal distribution. In models 1b–1d, we check whether the test has power againstheavy-tailed, symmetric or skewed, nonnormal distributions. In model 1b, y =1.5 + 10x1 − 3.5x2 + ε where ε has the standard Cauchy distribution. In model1c, y = 1.5 + 10x1 − 3.5x2 + ε, where ε is contaminated normal with dis-tribution function F(ε) = 0.9�(ε) + 0.1�(ε/10). In model 1d, y = 1.5 +10x1 − 3.5x2 + ε, where ε is skewed and contaminated normal with distribu-tion function F(ε) = .9�(ε) + 0.1�((ε − 5)/10). In model 1e and 1f, wecheck for misspecifications of the conditional distribution due to incorrect assump-tions about the regression function or conditional variance. In model 1e, y =(1.5 + 10x1 − 3.5x2) + (2x1x2 + 4x2

2 ) + ε, where ε ∼ N (0, 1). In model1f, y = 1.5 + 10x1 − 3.5x2 + e1+3x1−4.5x2ε, where ε ∼ N (0, 1). The nullhypothesis is that the model is linear and homoscedastic with normal error, i.e.,H0 : F(y|x, θ) = �((y − β0 − β1x1 − β2x2)/σ ) where θ ≡ (β0, β1, β2, σ

2)′.Thus model 1a is true and models 1b–1f are false. Under the null, the true θ0 =(1.5, 10,−3.5, 1)′.

In models 2a–2f, we consider testing for goodness-of-fit of the censored regres-sion models. In these models, the latent dependent variables y0

i are generated thesame as in models 1a–1f respectively. We observe (yi , xi ) where yi = max{y0

i , 0}.The null hypothesis is that the conditional distribution function of Y given X is

123

462 X. Zheng

Table 1 Percentage of rejections in linear regression models 1a–1f

n Tn D∗n Andrews

1% 5% 10% 1% 5% 10% 1% 5% 10%

Model 1a: y = 1.5 + 10x1 − 3.5x2 + ε, ε ∼ N (0, 1)

100 0.7 3.6 6.2 1.0 4.4 9.7 0.8 4.4 10.2

200 1.0 3.3 6.7 1.2 4.7 9.7 1.4 5.4 11.0

300 0.4 3.1 6.9 0.5 4.1 9.6 1.0 3.7 8.3

Model 1b: y = 1.5 + 10x1 − 3.5x2 + ε, ε ∼ Cauchy

100 89.2 92.8 94.7 96.4 97.5 97.5 90.1 94.2 96.0

200 99.9 99.9 99.9 99.9 100 100 99.4 99.7 99.9

300 100 100 100 100 100 100 100 100 100

Model 1c: y = 1.5 + 10x1 − 3.5x2 + ε, F(ε) = 0.9�(ε) + 0.1�(ε/10)

100 44.8 59.3 68.1 65.9 77.0 80.9 23.6 45.1 56.9

200 88.7 94.4 95.8 94.4 96.8 97.7 34.9 60.5 72.6

300 98.9 99.7 99.9 99.9 100 100 44.1 68.8 80.5

Model 1d: y = 1.5 + 10x1 − 3.5x2 + ε, F(ε) = 0.9�(ε) + 0.1�((ε − 5)/10)

100 58.6 71.6 79.8 80.2 86.7 89.9 41.1 61.9 73.7

200 96.5 98.7 99.5 98.9 99.7 99.9 64.5 83.8 88.9

300 99.9 100 100 100 100 100 77.4 92.0 96.2

Model 1e: y = 1.5 + 10x1 − 3.5x2 + 2x1x2 + 4x22 + ε, ε ∼ N (0, 1)

100 100 100 100 100 100 100 100 100 100

200 100 100 100 100 100 100 100 100 100

300 100 100 100 100 100 100 100 100 100

Model 1f: y = 1.5 + 10x1 − 3.5x2 + e1+3x1−4.5x2ε, ε ∼ N (0, 1)

100 100 100 100 100 100 100 100 100 100

200 100 100 100 100 100 100 100 100 100

300 100 100 100 100 100 100 100 100 100

F(y|x, θ) = 1(y ≥ 0)�((y − β0 − β1x1 − β2x2)/σ ) where θ = (β0, β1, β2, σ2)′.

Thus model 2a is true and models 2b–2f are false. Under the null, the true θ0 =(1.5, 10,−3.5, 1)′.

For two covariates x1 and x2, we choose the bandwidths as h1 = c · sx1 n−1/6 forcovariate x1 and h2 = c · sx2 n−1/6 for covariate x2, where sx1 and sx2 are the samplestandard deviations of x1 and x2. Initial simulations show that the size and power ofour asymptotic test Tn are sensitive to the parameter c. But c = 1 seems to give highpower and reasonable size so we fix c = 1 when we compare Tn, D∗

n and the test ofAndrews (1997). The kernel function K is chosen to be K (u1, u2) = K1(u1)K2(u2)

where K1 and K2 are the Epanechnikov kernel

K1(v) = K2(v) = 0.75(1 − v2)1(|v| ≤ 1). (4.2)

Each simulation is based on 1,000 replications. The simulations are conducted to sam-ple sizes n = 100, 200, and 300. We let the significance levels be 0.01, 0.05, and 0.10.The number of bootstrap sample for obtaining critical values is 300.

123


Table 2 Percentage of rejections in censored regression models 2a–2f

n Tn D∗n Andrews

1% 5% 10% 1% 5% 10% 1% 5% 10%

Model 2a: y = max{1.5 + 10x1 − 3.5x2 + ε, 0}, ε ∼ N (0, 1)

100 1.2 4.0 6.1 1.2 5.1 9.2 0.8 5.0 9.3

200 1.2 3.8 6.5 0.9 4.0 10.8 1.2 4.7 9.8

300 0.8 2.9 6.0 0.7 4.1 9.0 0.9 4.9 9.5

Model 2b: y = max{1.5 + 10x1 − 3.5x2 + ε, 0}, ε ∼ Cauchy

100 66.4 72.8 76.2 78.5 83.0 85.4 72.7 81.3 85.2

200 90.9 94.8 96.0 96.3 98.1 98.8 91.2 95.8 97.4

300 98.6 99.1 99.5 99.4 99.9 99.9 98.4 99.5 99.7

Model 2c: y = max{1.5 + 10x1 − 3.5x2 + ε, 0}, F(ε) = 0.9�(ε) + 0.1�(ε/10)

100 23.4 33.5 41.8 38.2 52.1 58.3 16.7 35.7 47.4

200 52.4 67.2 74.0 66.0 80.3 85.3 28.6 49.8 64.9

300 77.6 85.7 89.8 85.2 92.2 94.7 36.5 57.5 71.2

Model 2d: y = max{1.5 + 10x1 − 3.5x2 + ε, 0}, F(ε) = 0.9�(ε) + 0.1�((ε − 5)/10)

100 48.4 62.3 69.5 66.7 77.5 82.7 37.9 58.1 68.6

200 88.1 94.1 95.8 94.7 97.5 98.5 55.1 77.6 86.0

300 98.3 99.4 99.6 99.5 99.7 99.7 72.8 89.6 94.9

Model 2e: y = max{1.5 + 10x1 − 3.5x2 + 2x1x2 + 4x22 + ε, 0}, ε ∼ N (0, 1)

100 100 100 100 100 100 100 100 100 100

200 100 100 100 100 100 100 100 100 100

300 100 100 100 100 100 100 100 100 100

Model 2f: y = max{1.5 + 10x1 − 3.5x2 + e1+3x1−4.5x2ε, 0}, ε ∼ N (0, 1)

100 100 100 100 100 100 100 100 100 100

200 100 100 100 100 100 100 100 100 100

300 100 100 100 100 100 100 100 100 100

The empirical sizes and powers are reported in Tables 1 and 2. For reference we alsoprovide simulations results for our asymptotic test Tn . But since Tn is not corrected forsize, its comparison with Dn and the test of Andrews (1997) is not very meaningful.So we focus on the comparison between D∗

n and the test of Andrews (1997). From thesimulations for models 1a and 2a, we can see that both our bootstrapped test D∗

n andthe test of Andrews (1997) have adequate sizes—the actual sizes are all within oneor two simulation standard errors of the asymptotic sizes. Both tests have high poweragainst Cauchy distribution in both types of models. Both tests also have very highpower against misspecifications in functional form or heteroscedasticity as in models1e, 1f, 2e, and 2f. But D∗

n has higher power against contaminated normal distributionsthan the test of Andrews (1997) in models 1c, 1d, 2c, and 2d. Probably due to censor-ing of the data, the powers of both tests are reduced somewhat in censored regressioncase. How to select the bandwidth via a data-driven method in this testing problem isan open issue and left for future research.

123

464 X. Zheng

5 Conclusion

This paper proposes a new goodness-of-fit test for parametric conditional distribu-tion functions using the nonparametric smoothing method. The test can be applied totesting for possible misspecifications of conditional distributions in a wide variety ofparametric models, such as linear regression models, discrete choice models and cen-sored regression models. We also provide a bootstrap procedure for obtaining criticalvalues of the test in small samples.

Appendix: Proofs

The following lemma shows that Dn can be approximated by a second-order U -statistic.

Lemma Given Assumptions 1–4, if h → 0 and nhd → ∞, then under the nullhypothesis (2.1), Dn = Un + Op(n−1), where

Un ≡ 1

n(n − 1)

n∑i=1

n∑j �=i

1

hdK

(xi − x j

h

)

·∫

[1(yi ≤ y) − F(y|xi , θ0)][1(y j ≤ y) − F(y|x j , θ0)]d F2(y).

Proof of Lemma Let zi = (yi , x ′i )

′ and ei (y, θ) = 1(yi ≤ y) − F(y|xi , θ). Let

D3n ≡ (n)−13

∑I3

Hn(zi , z j , zk, θ ), (A.1)

where (n)3 = n(n − 1)(n − 2), I3 = {(i, j, k) : i �= j, j �= k, k �= i}, and

Hn(zi , z j , zk, θ) = 1

hdK

(xi − x j

h

)ei (yk, θ)e j (yk, θ). (A.2)

The statistic Dn can be written as

Dn = n − 2

nD3n + 2

n2(n − 1)

∑i �= j

1

hdK

(xi − x j

h

)ei (yi , θ )e j (yi , θ ). (A.3)

Let � denote a generic constant. For the second term of the above equation, theexpectation of its absolute value is less than

�2

n2(n − 1)

1

hd

∑i �= j

E

[K

(xi − x j

h

)]

= 2

n2(n − 1)

1

hd

∑i �= j

∫K (u) f1(xi ) f1(xi − hu)hddxi du = O(n−1). (A.4)

123


Thus the second term is of order Op(n−1). We shall show that D3n = Un + op(n−1).Thus, Dn = Un + Op(n−1).

D3n can be decomposed into three parts:

D3n =⎧⎨⎩(n)−1

3

∑I3

1

hdK

(xi − x j

h

)ei (yk, θ0)e j (yk, θ0)

⎫⎬⎭

−2

⎧⎨⎩(n)−1

3

∑I3

1

hdK

(xi − x j

h

)ei (yk, θ0)

[F(yk |x j , θ ) − F(yk |x j , θ0)

]⎫⎬⎭

+⎧⎨⎩(n)−1

3

∑I3

1

hdK

(xi − x j

h

)[F(yk |xi , θ ) − F(yk |xi , θ0)

]

·[

F(yk |x j , θ ) − F(yk |x j , θ0)]⎫⎬⎭

≡ U1n − 2U2n + U3n . (A.5)

We shall show that U1n = Un + op(n−1), U2n = Op(n−1), and U3n = Op(n−1).First we show that U1n = Un + op(n−1). Let

H1n(zi , z j , zk) = Hn(zi , z j , zk, θ0) − E[Hn(zi , z j , zk, θ0)|zi , z j ]. (A.6)

Noting that E[H1n(·)|zi ] = E[H1n(·)|z j ] = 0, we have

E[U1n − Un]2 = (n)−23

∑k1

∑k2

E

⎧⎨⎩∑

i1 �= j1

∑i2 �= j2

H1n(zi1 , z j1 , zk1)H1n(zi2 , z j2 , zk2)

⎫⎬⎭

= (n)−23

∑k

∑i �= j

E[H1n(zi , z j , zk)

]2

≤ (n)−23

∑k

∑i �= j

�1

h2dK 2(

xi − x j

h

)= O(n−3h−d). (A.7)

Thus, by Chebyshev’s inequality and the condition nhd → ∞,

U1n = Un + Op(n−3/2h−d/2) = Un + op(n

−1). (A.8)

By taking a Taylor expansion of F(yk |xi , θ ) around θ0, we have

U2n = (n)−13

∑I3

1

hdK

(xi − x j

h

)ei (yk, θ0)

∂ F(yk |x j , θ0)

∂θ ′ (θ − θ0)

+(n)−13

∑I3

1

hdK

(xi − x j

h

)ei (yk, θ0)(θ − θ0)

′ ∂2 F(yk |x j , θ )

∂θ∂θ ′ (θ − θ0)

123

466 X. Zheng

≡ Q1n(θ − θ0) + (θ − θ0)′Q2n(θ − θ0), (A.9)

where θ is between θ0 and θ . Q1n is a U -statistic with E(Q1n) = 0. By Lemma 3.1of Powell et al. (1989), Q1n = Op(n−1/2). For Q2n , we have

E‖Q2n‖ ≤ �E

⎡⎣h−d(n)−1

3

∑I3

K

(xi − x j

h

)b(xk)

⎤⎦

= �h−d∫

K (u)hddu · E[b(wk, xk)]= O(1). (A.10)

Since θ − θ0 = Op(n−1/2), U2n = Op(n−1).For U3n(θ), by taking a Taylor expansion, we have

|U3n| ≤∣∣∣(n)−1

3

∑I3

1

hdK

(xi − x j

h

)(θ − θ0)

′

·∂G(yk(θ1)|xi , θ1)

∂θ

∂G(yk(θ2)|x j , θ2)

∂θ ′ (θ − θ0)

∣∣∣

≤ �‖θ − θ0‖2 · 1

n(n − 1)

∑i �= j

1

hdK

(xi − x j

h

)· 1

n

n∑k=1

b(wk, xk)

+�‖θ − θ0‖2 · 1

n(n − 1)

∑i �= j

1

hdK

(xi − x j

h

)

= Op(n−1), (A.11)

where both θ1 and θ2 are between θ and θ0. Thus U3n = Op(n−1).Summarizing the above results, we have proved that Dn = Un + Op(n−1). ��

Proof of Theorem 1 By the lemma, we only need to find the asymptotic distribution ofUn . We apply Theorem 1 of Hall (1984) to find the asymptotic null distribution of Un .

Un is a U -statistic with kernel Hn(zi , z j ) = 1hd K

(xi −x j

h

) ∫ei (y, θ0)e j (y, θ0)d F1(y)

where zi = (yi , xi ). Under the null hypothesis, it is easy to see that Un is a degen-erate statistic. Denote Gn(z1, z2) = E [Hn(z3, z1)Hn(z3, z2)|z1, z2] . Noting that| ∫ ei (y, θ0)ε j (y, θ0)d F2(y)| ≤ 4

∫d F2(y) ≡ C and E [ei (y1, θ0)ei (y2, θ0)|xi ] =

G(y1 ∧ y2|xi ) [1 − G(y1 ∨ y2|xi )], we have

E[G2

n(z1, z2)]

≤ C4

h4dE

{E

[K

(x3 − x1

h

)K

(x3 − x2

h

)|z1, z2

]}2

= C4

h4d

∫ ∫ [∫K

(x3 − x1

h

)K

(x3 − x2

h

)f1(x3)dx3

]2

f1(x1) f1(x2)dx1dx2

123


= C4

h4d

∫ ∫ [∫K (u)K (u + v)hddu

]2

f1(x1) f1(x1 − hv)hddx1dv

= O(1/hd), (A.12)

E[

H2n (z1, z2)

]

= 1

h2d

∫ ∫ {∫ ∫K 2(

x1 − x2

h

)G(y1 ∧ y2|x1) [1 − G(y1 ∨ y2|x1)]

·G(y1 ∧ y2|x2) [1 − G(y1 ∨ y2|x2)] f1(x1) f1(x2)dx1dx2

}d F2(y1)d F2(y2)

= 1

h2d

∫ ∫ {∫ ∫K 2(u)G(y1 ∧ y2|x1) [1 − G(y1 ∨ y2|x1)] G(y1 ∧ y2|x1 − hu)

· [1 − G(y1 ∨ y2|x1 − hu)] f1(x1) f1(x1 − hu)hddudx1

}d F2(y1)d F2(y2)

= 1

hd

∫K 2(u)du ·

∫ ∫E{

G2(y1 ∧ y2|xi ) [1 − G(y1 ∨ y2|xi )]2 f1(xi )

}

·d F2(y1)d F2(y2) + o

(1

hd

)

= σ 2/2hd + o(1/hd) = O(1/hd), (A.13)

E[

H4n (z1, z2)

]

≤ C4

h4dE

[K 4(

x1 − x2

h

)]

= C4∫ ∫

K 4(u) f1(x) f1(x − hu)hddxdu/

h4d

= O(1/h3d). (A.14)

Thus we have

E[G2n(z1, z2)] + n−1 E[H4

n (z1, z2)]{E[H2

n (z1, z2)]}2

≤ O(1/hd) + n−1 O(1/h3d)

O(1/h2d)

= O(hd) + O(1/nhd) → 0, as n → ∞. (A.15)

Thus the condition of Theorem 1 of Hall (1984) is satisfied. By the theorem and (A.13),nhd/2Un →D N (0, σ 2).

Next we show that σ 2 is a consistent estimator of σ 2. Denote σ 2 as the U -statisticwith the kernel

Hn(zi , z j ) = 1

hdK 2(

xi − x j

h

)[∫ei (y, θ0)e j (y, θ0)d F2(y)

]2

. (A.16)

123

468 X. Zheng

Similarly to the lemma, we can show that σ 2 = σ 2 +op(1). Similarly to (A.13) we canshow that E[‖Hn(zi , z j )‖2] = O(1/hd) = o(n). Thus the condition of Lemma 3.1of Powell et al. (1989) is satisfied. By (A.13), we have E[Hn(zi , z j )] = σ 2 + o(1).

So by Lemma 3.1 of Powell et al. (1989), σ 2 = σ 2 + op(1). Thus σ 2 = σ 2 + op(1).

��Proof of Theorem 2 Similarly to the lemma, we can show that Dn = Un + Op(n−1)

where Un is defined in the lemma. Let ui (y) = εi (y, θ0) − λn�(y, xi ). ThenE[ui (y)|xi ] = 0. Un can be written as

Un =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1

n(n − 1)

n∑i=1

n∑j=1j �=i

1

hdK

(xi − x j

h

)∫ui (y)u j (y)d F2(y)

⎫⎪⎪⎪⎬⎪⎪⎪⎭

+

⎧⎪⎪⎪⎨⎪⎪⎪⎩

λn1

n(n − 1)

n∑i=1

n∑j=1j �=i

1

hdK

(xi − x j

h

)∫ui (y)�(y, x j )d F2(y)

⎫⎪⎪⎪⎬⎪⎪⎪⎭

+

⎧⎪⎪⎪⎨⎪⎪⎪⎩

λ2n

1

n(n − 1)

n∑i=1

n∑j=1j �=i

1

hdK

(xi − x j

h

)∫�(y, xi )�(y, x j )d F2(y)

⎫⎪⎪⎪⎬⎪⎪⎪⎭

≡ V1n + λn V2n + λ2n V3n . (A.17)

Similarly to the proof of Theorem 1, we can show that nhd/2V1nD−→N (0, σ 2), V2n =

Op(n−1/2), and V3np−→ ∫

E[�2(y, X) f1(X)

]d F2(y) > 0.

If dn = n−1/2h−d/4, then

nhd/2λn V2n = hd/4√nV2np−→0,

(A.18)nhd/2λ2

n V3n = V3np−→∫

E[�2(y, X) f1(X)

]d F2(y) > 0.

Thus we have

Tn = nhd/2Wn

/σ

D−→N (μ, 1) , (A.19)

where μ = ∫E[�2(y, X) f1(X)

]d F2(y)

/σ. ��

References

Aït-Sahalia Y, Bickel PJ, Stoker TM (2001) Goodness-of-fit tests for regression using Kernel methods.J Economet 105:363–412

123


Andrews DWK (1997) A conditional Kolmogorov test. Econometrica 65:1097–1128Bai J (2003) Testing parametric conditional distributions of hynamic models. Rev Econ Stat 85:531–549Bickel PJ, Rosenblatt M (1973) On some global measures of the deviations of density function Estimates.

Ann Stat 1:1071–1095. [Correction: 3 (1975) 1370]Bickel PJ, Ritov Y, Stoker TM (2006) Tailor-made tests for goodness of fit to semiparametric hypotheses.

Ann Stat 34:721–741Eubank RL, Hart JD, LaRiccia VN (1993) Testing goodness of fit via nonparametric function estimation

techniques. Commun Stat 22:3327–3354Eubank RL, LaRiccia VN (1992) Asymptotic comparison of cramér-von mises and nonparametric function

estimation techniques for testing goodness-of-fit. Ann Stat 20:2071–2086Eubank RL, LaRiccia VN, Rosenstein R (1987) Test statistics derived as components of Pearson’s Phi-

squared distance measure. J Am Stat Assoc 82:816–825Ghosh BK, Huang W (1991) The power and optimal Kernel of the Bickel-Rosenblatt test for the goodness-

of-fit. Ann Stat 19:999–1009Hall P (1984) Central limit Theorem for integrated square error of multivariate nonparametric density esti-

mators. J Multivar Anal 14:1–16Hart JD (1997) Nonparametric smoothing and lack-of-fit tests. Springer, New YorkPowell JL, Stock JH, Stoker TM (1989) Semiparametric estimation of index coefficients. Econometrica

57:1403–1430Rosenblatt M (1975) A quadratic measure of deviation of two-dimensional density estimates and a test of

independence. Ann Stat 3:1–14. [Correction: 10 (1982) 646]Zheng JX (2000) A consistent test of conditional parametric distributions. Economet Theory 16:667–691Zheng X (2008) Testing for discrete choice models. Econ Lett 98:176–184

123

testing parametric conditional distributions using the nonparametric smoothing method

Documents