estimation and inference of semiparametric models · 2019. 12. 19. · estimation and inference of...

59
Estimation and Inference of Semiparametric Models using Data from Several Sources Moshe Buchinsky Fanghua Li Zhipeng Liao This Version: July 2016 Abstract This paper studies the estimation and inference of nonlinear econometric model when the economic variables are contained in di˙erent data sets. We construct a minimum distance (MD) estimator of the unknown structural parameter of interest when there are some common conditioning variables in di˙erent data sets. The MD estimator is show to be consistent and has asymptotic normal distribution. We provide the specifc form of optimal weight for the MD estimation, and show that the optimal weighted MD estimator has the smallest asymptotic variance among all MD estimators. Consistent estimator of the variance-covariance matrix of the MD estimator, and hence inference procedure of the unknown parameter is also provided. The fnite sample performances of the MD estimator and the inference procedure are investigated in simulation studies. JEL Classifcation: C12, C14 Keywords: Conditional Moment Restrictions; Data Combination; Minimum Distance Estimation 1 Introduction There are many cases in empirical micro studies where data needed to analyze a particular phenomenon is not always available in one data set. Typically, this hampers the possibility of meaningful empirical research. In fact, a common phenomenon is to make some simplifying assumptions, which would then permit the researcher to use information from more than one data source. For example, as Blundell et at. (2008) note, this is a crucial diÿculty when faced by those studying households’ consumption and saving behavior because of the lack of panel data on both household expenditures, income, and saving. An important data for studying consumption is, for example, the Panel Study of Income Dynamics (PSID), a survey that provides longitudinal annual data for household that have been followed since 1968. The PSID collects data on a subset of consumption items, namely food at home and food away from home (with few gaps in some of the survey years) and income. However, the PSID is that it does not provide data on wealth. Department of Economics, UC Los Angeles, 8373 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email: [email protected] Department of Economics, UC Los Angeles Mail Stop: 147703, Los Angeles, CA 90095. Department of Economics, UC Los Angeles, 8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email: [email protected] 1

Upload: others

Post on 29-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Estimation and Inference of Semiparametric Models

    using Data from Several Sources

    Moshe Buchinsky ∗ Fanghua Li † Zhipeng Liao ‡

    This Version: July 2016

    Abstract

    This paper studies the estimation and inference of nonlinear econometric model when the economic variables are contained in di˙erent data sets. We construct a minimum distance (MD) estimator of the unknown structural parameter of interest when there are some common conditioning variables in di˙erent data sets. The MD estimator is show to be consistent and has asymptotic normal distribution. We provide the specifc form of optimal weight for the MD estimation, and show that the optimal weighted MD estimator has the smallest asymptotic variance among all MD estimators. Consistent estimator of the variance-covariance matrix of the MD estimator, and hence inference procedure of the unknown parameter is also provided. The fnite sample performances of the MD estimator and the inference procedure are investigated in simulation studies.

    JEL Classifcation: C12, C14

    Keywords: Conditional Moment Restrictions; Data Combination; Minimum Distance Estimation

    1 Introduction

    There are many cases in empirical micro studies where data needed to analyze a particular phenomenon is not always available in one data set. Typically, this hampers the possibility of meaningful empirical research. In fact, a common phenomenon is to make some simplifying assumptions, which would then permit the researcher to use information from more than one data source. For example, as Blundell et at. (2008) note, this is a crucial diÿculty when faced by those studying households’ consumption and saving behavior because of the lack of panel data on both household expenditures, income, and saving.

    An important data for studying consumption is, for example, the Panel Study of Income Dynamics (PSID), a survey that provides longitudinal annual data for household that have been followed since 1968. The PSID collects data on a subset of consumption items, namely food at home and food away from home (with few gaps in some of the survey years) and income. However, the PSID is that it does not provide data on wealth. ∗Department of Economics, UC Los Angeles, 8373 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email:

    [email protected] †Department of Economics, UC Los Angeles Mail Stop: 147703, Los Angeles, CA 90095. ‡Department of Economics, UC Los Angeles, 8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email:

    [email protected]

    1

    mailto:[email protected]:[email protected]

  • In contrast, there are few data set that provide detailed data on income and wealth (e.g. Health and Retire-ment Survey (HRS), or the National Longitudinal Study (NLS)), these data sets provide no information on consumption.

    Problems of similar nature exist in many other countries. For example, in the UK, the Family Expenditure Survey (FES) provides comprehensive data on household expenditures, but this is across-sectional data and thus the researcher does not get to observe households over time. In contrast, the British Household Panel Survey (BHPS) is a Panel data set that collects data on income or wealth, but it provides no information on consumption.1 This is quite puzzling, given the vital need to study the consumption decisions jointly with the income and wealth processes.

    Consequently, as is clearly and comprehensively explain in Blundell et al. (2008), studies the aimed at understanding consumption behavior, and testing alternative theories, have resorted to the limited data on food expenditure provided in the PSID. This includes, among others, Hall and Mishkin (1982), Zeldes (1989), Runkle (1991), and Shea (1995), Cochrane (1991), Hayashi, Altonji and Kotliko˙ (1996), Cox, Ng and Waldkirch (2004), Martin (2003) and Hurst and Sta˙ord (2004) for tests of many alternative theories. The main problem with all these studies is that they use consumption on limited number of goods (largely necessity goods), and thus putting into question the external validity of the results.

    One way that has been used in the literature is to form synthetic panel data sets from repeated cross-section data sets in which consumption is reported (e.g. the CEX or the FES). This is done in, for example, Browning, Deaton and Irish (1985) and Attanasio and Weber (1993).

    An alternative empirical approach that have been used occasionally in the literature involve imputation of consumption to the PSID households using information on consumption from the CEX. Specifcally, Skinner (1987) proposes to impute total consumption in the PSID using the estimated coeÿcients of a regression of total consumption on a number of consumption items that are reported in both the PSID and the CEX. While this method seems appealing at frst sight, it reduces any variation in total consumption, since it does not take into account the fact the there is considerable idiosyncratic elements that goes into the individual decision making. Ziliak (1998) and Browning and Leth-Petersen (2003) provided alternative method that are variants of that proposed by Skinner (1987).

    However, this method has a major weakness, in that it ignored, by construction, the dynamics of the individuals’ consumption. Avoiding direct control of Individual’s heterogeneity has been shown to provide major obstacle when modeling individual’s behavior in general.

    The one paper in the literature that provides a method that is related in spirit to the method proposed here is the paper by Blundell et al. (2008).2 The method is similar in nature to that of Skinner (1987), in that the authors impute consumption data for the households in the PSID using regression parameters estimated from the CEX data. The key di˙erence, is that the authors in Blundell et al. (2006) is that they use “structural” regression of a standard demand function for food that depends not only on other consumption items, but also depend on prices and a set of demographic and socio-economic variable of the household. Assuming monotonicity of the demand for makes it possible to invert these function in order to obtain a structurally based formula non-durable consumption, which exists in the CEX, but is missing in the PSID. Nevertheless, the general problem with such an imputation method described above for Skinner’s method still apply. If nothing else, the consumption data imputed in this fashion is likely to su˙er from the well-known error-in-variable problem. Most importantly, it ignores the inherent individual heterogeneity of

    1Similar problems exist for many other countries, in particular countries in Europe (e.g. France and Spain) that collect detailed data on both consumption, income, and wealth, but the information never exists in a single data set.

    2This method is used extensively in Blundell, Pistaferri and Preston (2008).

    2

  • consumption. In two recent papers Fan, Sherman and Shum (2014, 2016) address a special case of the problem addressed

    in our paper, namely the case of treatment e˙ect. Under this scenario, the outcome variables and conditioning variable are observed in two separate data sets, so that the treatment e˙ect parameters are not point-identidfed. The authors provide sharp bounds on the counterfactual distributions and parameter of interest (see Fan, Sherman and Shum (2016)), and the corresponding inference (see Fan, Sherman and Shum (2014).

    Our case is more general and encompass a more general situation in which some of the variables are available only in one data set, while others are available in a separate data sets. The key insight is that there are some variables that appear in both. Under relatively mild regularity conditions we provide a method that allow one to point-identifed the structural parameters of interest in the main data of interest using the information provided in the other data set. The parameters of interest and the “imputation equation” are estimated simultaneously. We also provide the necessary theory for inference including cases in which the number of observation in “imputation” data set do not diverge to infnity at the same rate as that for the main data of interest.

    The semiparametric estimator proposed in this paper is related to the two sample instrumental variable (2SIV) estimator studied in Klevmarken (1982), Angrist and Krueger (1992) and Arellano and Meghir (1992) and the two sample GMM estimator studied in Ridder and Moÿtt (2007). The main di˙erence is that these papers consider estimation of the structural parameters with fnite many moment restrictions, while our paper studies the estimation problem with conditional moment conditions and hence infnite many unconditional moment restrictions.

    Notation of this paper is standard. For any square matrix A, λmin(A) and λmax(A) denote the smallest and largest eigenvalues of A respectively. Throughout this paper, we use C to denote a generic fnite positive constant which is larger than 1. For any set of real vectors {al}l∈I where I = {l1, . . . , ldI } is a index set with dI distinct natural numbers, we defne (al)l∈I = (al1 , . . . , aldI ) and (al)

    0 = (al1 , . . . , aldI )0 . The notation l∈I

    k·k denotes the Euclidean norm. A0 refers to the transpose of any matrix A. Ik and 0l are used to denote k × k identity matrix and l × l zero matrices respectively. The symbolism A ≡ B means that A is defned as B. the expression an = op(bn) signifes that Pr (|an/bn| ≥ �) → 0 for all � > 0 as n go to infnity; and an = Op(bn) when Pr (|an/bn| ≥ M) → 0 as n and M go to infnity. As usual, "→p " and "→d " imply convergence in probability and convergence in distribution, respectively.

    2 The Model and the Estimators

    We are interested in estimating the following model

    Y = g(X1, X2, θ0) + v (1)

    where X1 and X2 are sets of regressors, g(·, ·, ·) : Rdx1 × Rdx2 × Rdθ → R is a known function, θ0 is the unknown parameter of interest and v is a unobservable residual term. Two data sets are available to estimate the unknown parameter: {(Yi, X 0 , X 0 )}i∈I1 and {(X1,i, X 0 , X 0 )}i∈I2 , where I1 and I2 are two index sets 2,i 3,i 2,i 3,iwith cardinalities n1 and n2 respectively.

    The unknown parameter θ0 could be conveniently estimated under the conditional moment restriction E [v|X1, X2] = 0, if we had the joint observations on (Y, X10 , X20 ). However, such straightforward method is not applicable here because Y and X1 are contained in di˙erent data sets. On the other hand, the common

    3

  • variables X2 and X3 contained in both data sets can be useful for identifying and estimating the unknown parameter θ0. For this purpose, we assume that

    E [v|X2, X3] = 0. (2)

    Using the expression in (1) and the conditional mean restriction in (2), we get

    E [ Y | X2, X3] = E [ g(X1, X2, θ0)| X2, X3] , (3)

    which is the key equation for the identifcation and estimation of θ0. For ease of notations, we write X = (X10 , X20 )0 and Z = (X20 , X30 )0. Then the model can be written as

    Y = g(X, θ0) + v with E [v|Z] = 0. (4)

    For any θ, we defne the conditional expectation of g(X, θ) given Z as

    φ(Z, θ) = E [ g(X, θ)| Z] .

    Then we can write Y = φ(Z, θ0) + ε + v = h0(Z) + u,

    where ε = g(X, θ0) − φ(Z, θ0), u ≡ ε + v and h0(Z) = φ(Z, θ0). As the conditioning variable Z is available in both data sets, we have n (n = n1 + n2) observations:

    0{Zi}i∈I = {(X 0 , X 0 )}i∈I where I = I1 ∪ I2. Let Pk(z) = [p1(z), . . . , pk(z)] be a k-dimensional vector of 2,i 3,ibasis functions for any positive integer k. For any k and any n, we defne Pn,k = (Pk(Zi))i∈I which is an n × k matrix. Accordingly, we defne Pn1,k1 = (Pk1 (Zi))i∈I1 and Pn2,k2 = (Pk2 (Zi))i∈I2 which are n1 × k1 and n2 × k2 matrices respectively.

    The conditional mean function h0(Z) = E [ Y | Z] can be estimated using the frst data set by

    bhn1 (z) = Pk1 (z)0(Pn0 1,k1 Pn1,k1 )−1Pn0 1,k1 Yn1 (5) where Yn1 = (Yi)0 . Using the second data set, we get the following estimator of the conditional meani∈I1 function φ(Z, θ) for any θ:

    φbn2 (Z, θ) = Pk2 (Z)0(Pn0 2,k2 Pn2 ,k2 )−1Pn0 2,k2 gn2 (θ) (6) where gn2 (θ) = (g(Xi, θ))0 . Using the estimators of h0(Z) and φ(Z, θ), we can construct the estimator of i∈I2 θ0 via the minimum distance (MD) estimation: � �� �X 2

    −1 � �θbn = arg min n wbn(Zi) �bhn1 (Zi) − φbn2 (Zi, θ)� (7)θ∈Θ

    i∈I

    where Θ denotes the parameter space containing θ0, and wbn(·) is a non-negative weight function. One simple and straightforward choice of the weight function wbn(·) is the identity function, i.e., wbn(z) = 1 for any z. However, as we will show later in this paper, the identity weighted MD estimator may not have the smallest possible variance.

    4

  • 3 Asymptotic Properties of the MD Estimator

    In this section, we establish the asymptotic properties of the MD estimator. For any positive integer k, Let ξk = supz∈Z kPk(z)k and Qk = E [Pk(Z)P 0 (Z)], where Z denotes the support of Z. We frst state the ksuÿcient conditions for consistency.

    Assumption 1. (i) {(Yi, Zi)} and {(Xi, Zi)} are independent with i.i.d. observations; (ii) V ar [Y | Z] 0 such that

    sup |h0(z) − Pk(z)0βh,k| = sup |h0(z) − h0,k(z)| = O(k−rh ); (8) z∈Z z∈Z

    −1 −1(v) maxj=1,2 ξk2 j log(kj )nj = o(1) and k1n1 + k1

    −1 = o(1).

    Assumption 1 includes mild and standard conditions on nonparametric series estimation of conditional mean function (see, e.g. Andrews (1991), Newey (1997) and Chen (2007)).

    Defne h iX −1 2 2Ln(θ) = n wn(Zi) |h0(Zi) − φ(Zi, θ)| and L ∗ (θ) = E wn(Z) |h0(Z) − φ(Z, θ)|n

    i∈I

    for any θ ∈ Θ, where wn(·) is defned in Assumption 2(v) below. � � P−1Assumption 2. (i) supθ∈Θ E φ2(Z, θ) < C; (ii) n |φbn2 (Zi, θ) − φ(Zi, θ)|2 = op(1) uniformly overi∈I θ; (iii) for any ε > 0, there is ηε > 0 such that h i

    E |h0(Z) − φ(Z, θ)|2 > ηε for any θ ∈ Θ with ||θ − θ0|| ≥ ε;

    −1/4 −1/4(iv) supθ∈Θ |Ln(θ) − Ln∗ (θ)| = op(1); (v) supz∈Z |wbn(z) − wn(z)| = Op(δw,n) where δw,n = O(n1 + n2 ) and wn(·) is a sequence of non-random functions with C−1 ≤ wn(z) ≤ C for any n and any z ∈ Z.

    Assumption 2(i) imposes uniform fnite second moment condition on the function φ(Z, θ). Assumption 2(ii) requires that the nonparametric estimator φbn2 (Zi, θ) of φ(Zi, θ) is consistent under the empirical L2 -norm uniformly over θ ∈ Θ. Assumption 2(iii) is the identifcation condition of θ0. Assumption 2(iv) is

    2a uniform law of large numbers of the function w(Zi) |h0(Zi) − φ(Zi, θ)| indexed by θ. Assumption 2(v) requires that wbn(·) is approximated by a sequence of nonrandom function wn(·) uniformly over z. For the consistency of the MD estimator, it is suÿcient to have δw,n = o(1) in Assumption 2(v). The rate condition

    −1/4 −1/4δw,n = O(n1 + n2 ) is needed for deriving the asymptotic normality of the MD estimator. It is clear that Assumption 2(v) holds trivially if wbn(·) is the identity function. Theorem 1. Under Assumptions 1 and 2, we have θbn = θ0 + op(1).

    For ease of notations, we defne

    ∂2∂g(X,θ) g(X,θ)gθ(X, θ) = , gθθ(X, θ) = ,∂θ ∂θ∂θ0 φθ(Z, θ) = E [ gθ(X, θ)| Z] , φθθ(Z, θ) = E [ gθθ(X, θ)| Z] ,

    ∂φbn2 (Z,θ) ∂2φbn2 (Z,θ)(Z, θ) = , (Z, θ) = .φbθ,n2 ∂θ φbθθ,n2 ∂θ∂θ0 By the consistency of θbn, there exists a positive sequence δn = o(1) such that θbn ∈ Nδn with probability ap-proaching 1, where Nδn = {θ ∈ Θ : ||θ − θ0|| ≤ δn}. Defne H0,n = E [wn(Z)φθ(Z, θ0)φ0 (Z, θ0)]. Let φθj (z, θ)θ

    5

  • denote the j-th component of φθ(Z, θ). We next state the suÿcient conditions for asymptotic normality of θbn.

    Assumption 3. The following conditions hold: P−1 2(i) supθ∈Nn n kgθθ(Xi, θ)k = Op(1);2 i∈I2 (ii) λmin(H0,n) > C−1;P −1/2−1(iii) n ||φbθ,n2 (Zi, θ0) − φθ(Zi, θ0)||2 = op(n );i∈I 2h i

    4(iv) E kφθ(Z, θ0)k < ∞;� � � � � � � � � 2 4(v) E u �Z > C−1 , E ε2�Z > C−1 and E u + ε4�Z < C;

    (vi) supz∈Z |wn(z)φθj (z, θ0) − Pk0 (z)βwφj ,n,k| = o(1) where βwφj ,n,k ∈ Rk (j = 1, . . . , dθ); −1/2 1/2(vii) maxj=1,2(kj n + k−rh n ) = o(1).j j j

    2Assumptions 3(i) holds when kgθθ(x, θ)k < C for any x and any θ in the local neighborhood of θ0. The lower bound of the eigenvalue of H0,n in Assumptions 3(ii) ensures the local identifcation of θ0. Assumptions

    −1/43(iii) requires that the convergence rate of φbθ,n2 (Zi, θ0) under the empirical L2-norm is faster than n2 . Assumptions 3(iv) imposes fnite second moment on the derivative function φθ(Z, θ0). Assumption 3(v) imposes moment conditions on the projection errors u and ε which are useful for deriving the asymptotic normality of the MD estimator. Assumption 3(vi) requires that the function wn(z)φθj (z, θ0) can be approx-imated by the basis functions. Assumption 3(vii) imposes restrictions on the number of basis functions and the smoothness of the unknown function h0. � � � � � �

    2Let σu2 (Z) = E u �Z , σε 2(Z) = E ε2�Z and φwθ,n = (wn(Zi)φθ(Zi, θ))i∈I . Defne Q−1 Q−1 P 0 φ0φwθ,nPn,k1 n1,k1 Qn1,u n1,k1 n,k1 wθ,n Σn1 ≡ n2n1 P−1where Qn1,u = n σ2 (Zi)Pk1 (Zi)P 0 (Zi), and 1 i∈I1 u k1 Q−1 P 0 φ0Qn2,εQ

    −1φwθ,nPn,k2 n2,k2 n2,k2 n,k2 wθ,n Σn2 ≡ n2n2 P−1where Qn2,ε = n σε 2(Zi)Pk2 (Zi)P 0 (Zi).2 i∈I2 k2 Theorem 2. Under Assumptions 1, 2 and 3, we have

    −1/2 −1/2θbn − θ0 = Op(n + n ) (9)1 2

    and moreover γ0 (H0,n(Σn1 +Σn2 )

    −1H0,n)1/2(θbn − θ0) →d N(0, 1) (10)n

    for any non-random sequence γn ∈ Rdθ with γn0 γn = 1.

    Remark 1. The frst result of Theorem 2, i.e., (9), implies that the convergence rate of the MD estimator −1/2 −1/2is of the order max{n , n }.1 2

    6

  • Remark 2. By the Cramer-Wold device and Theorem 2, we know that

    (H0,n(Σn1 +Σn2 )−1H0,n)

    1/2(θbn − θ0) →d N(0dθ , Idθ ), (11) which together with the continuous mapping theorem (CMT) implies that,

    (θbn − θ0)0(H0,n(Σn1 +Σn2 )−1H0,n)(θbn − θ0) →d χ2(dθ). (12) Moreover, let ι∗ j be the dθ × 1 selection vector whose j-th (j = 1, . . . , dθ) component is 1 and rest components are 0. Defne

    )−1/2(H0,n(Σn1 +Σn2 )−1H0,n

    γj,n = ι ∗, for j = 1, . . . , dθ.

    (ι∗0 )−1ι∗)1/2 j j (H0,n(Σn1 +Σn2 )−1H0,n j

    It is clear that γ0 γj,n = 1, and by Theorem 2, we have j,n

    γ0 (Σn1 +Σn2 )1/2(bj,n(H0,n )−1H0,n θn − θ0)

    θbj,n − θj,0 = →d N(0, 1) (13)(ι∗0 )−1ι∗)1/2 j (H0,n(Σn1 +Σn2 )

    −1H0,n j

    = ι∗0 b = ι∗0where θbj,n θn and θj,0 j θ0. Results in (12) and (13) can be used to conduct inference on θj,0 and θ0j if the consistent estimators of H0,n, Σn1 and Σn2 are available.

    4 Optimal Weighting

    In this section, we compare the MD estimators through their fnite sample variances. The comparison leads to an optimal weight matrix which gives MD estimator with smallest fnite sample variance, as well as asymptotic variance, among all MD estimators. The following lemma simplifes the fnite sample variance-covariance matrix which facilitates the comparison of the MD estimators.

    Lemma 1. Under Assumptions 1(i), 1(iii), 1(v), 2(v) and 3(iv)-3(vi),

    H−1 )H−1(Σn1 +Σn2 = Vn,θ(1 + op(1)).0,n 0,n � � � � 2 −1 −1where Vn,θ = H−1E w (Z) n σ2 (Z) + n σε 2(Z) φθ(Z, θ0)φ0 (Z, θ0) H

    −1 .0,n n 1 u 2 θ 0,n

    If the sequence of the weight function is set to be

    ∗ −1 −1 −1 −1 w (Z) = (n + n )(n σ2 (Z) + n σε 2(Z))−1 , (14)n 1 2 1 u 2

    then the fnite sample variance of the MD estimator becomes "� #!−1�−1σ2 (Z) σε

    2(Z)uV ∗ = E + φθ(Z, θ0)φθ0 (Z, θ0) . (15)n,θ n1 n2

    The next lemma shows that V ∗ is the smallest asymptotic variance-covariance of the MD estimator. θ

    Theorem 3. For any sequence of weight functions wn(Z), we have Vn,θ ≥ V ∗ for any n1 and any n2.n,θ

    7

  • We call the MD estimator whose fnite sample variance-covariance matrix equals V ∗ optimal MD estima-θ ∗tor. To ensure the optimal MD estimator is feasible, we have to: (i) show that C−1 < w (z) < C for any z ∈n

    ∗ ∗ ∗Z and any n1, n2; and (ii) construct an empirical weight function wbn(z) such that supz∈Z |wbn(z) − wn(z)| = −1/4 −1/4 ∗), where δw,n = O(n + n ). In the rest of this section, we show that w (z) is bounded from Op(δw,n 1 2 n

    ∗above and from below. Construction of the empirical weight function wb (·) is studied in the next section. nLemma 2. Under Assumption 3(v), C−1 < w ∗ (z) < C for any z ∈ Z and any n1, n2.n

    5 Estimation of the Variance and Optimal Weighting

    The estimator of the variance-covariance matrix is constructed by its sample analog. Let ubi = Yi − bhn1 (Zi) for any i ∈ I1, and εbi = g(Zi, θbn) − φbn2 (Zi, θbn) for any i ∈ I2. Defne X b −1 b )b b )0Hn = n wbn(Zi)φbθ,n2 (Zi, θn φθ,n2 (Zi, θn ,

    i∈I b Q−1 b Q−1 P 0 φb0φwθ,nPn,k1 n1,k1 Qn1,u n1,k1 n,k1 wθ,n bΣn1 = , n2n1 b Q−1 b P 0 φb0 b φwθ,nPn,k2 Qn2,εQ−1 n2,k2 n2,k2 n,k2 wθ,n Σn2 = , n2n2 P Pb b −1 2 −1 ε2where φbwθ,n = ( wbn(Zi)φbθ,n2 (Zi, θn))i∈I , Qn1,u = n1 i∈I1 ubi Pk1 (Zi)P 0 (Zi) and Qbn2,ε = n2 i∈I2 bi Pk2 (Zi)P 0 (Zi).k1 k2 The variance estimator is defned as

    Vbn = Hb−1(Σbn1 +Σbn1 )Hb−1 . (16)n n The following conditions are needed to show the consistency of Vbn and the empirical optimal weight function constructed later in this section.

    −1 2Assumption 4. (i) supθ∈Nn n2 P

    i∈I2 kgθ(Xi, θ)k = Op(1); (ii) there exist βu,k ∈ Rk and ru > 0 such

    that � � sup �σ2 (z) − Pk(z)0βu,k� = O(k−ru ); (17)uz∈Z

    (iii) there exist βε,k ∈ Rk and rε > 0 such that � � sup �σε 2(z) − Pk(z)0βε,k� = O(k−rε ); (18) z∈Z

    1/2 −1/2 k−rh

    4(iv) maxj=1,2(ξkj k n + ξkj ) = o(1); (v) E[kgθ(X, θ0)k ] ≤ C.j j j

    Assumption 4(i) requires that the sample average of kgθ(Xi, θ)k is stochastically bounded uniformly over the local neighborhood of θ0. Assumptions 4(ii) and 4(iii) implies that the conditional variances σ2 (z)uand σε 2(z) can be approximated by the basis functions Pk(z). Assumption 4(iv) imposes restrictions on the numbers of basis functions and the smoothness of the conditional variance functions. Assumption 4(v) imposes fnite fourth moment on gθ(X, θ0).

    Theorem 4. Suppose Assumptions 1, 2, 3, 4(i) and 4(iv) hold. If (k1 + k2)δ2 = o(1), then we have w,n

    Hbn −1(Σbn1 +Σbn1 )Hbn −1 = H0−,n1(Σn1 +Σn2 )H0−,n1(1 + op(1) (19) 8

  • and moreover, )−1 b 1 γn0 (Hbn(Σbn1 +Σbn1 Hn) 2 (θbn − θ0) →d N(0, 1), (20)

    for any non-random sequence γn ∈ Rdθ with γn0 γn = 1.

    Remark 3. By the consistency of the Hbn(Σbn1 +Σbn1 )−1Hbn and CMT, (Hbn(Σbn1 +Σbn1 )−1Hb )1/2(θbn − θ0) →d N(0, Idθ ),

    which together with the CMT implies that

    Wn(θ0) = (θbn − θ0)0(Hbn(Σbn1 +Σbn1 )−1Hb )(θbn − θ0) →d χ2(dθ). (21) Recall that ι∗ j is the dθ × 1 selection vector whose j-th (j = 1, . . . , dθ) component is 1 and rest components are 0. By the consistency of the Hbn(Σbn1 +Σbn1 )−1Hbn, we have

    ι∗0 b (b + b )−1 b ι ∗ = ι∗0 j H−1 )H−1ι ∗ Hn Σn1 Σn1 Hn (Σn1 +Σn2 j (1 + op(1)j j 0 0 which together with (13) and the CMT implies that

    θbj,n − θj,0 tj,n(θj,0) = q →d N(0, 1). (22)

    Hn Σn1 Σn1 Hι∗ι∗0 j

    b (b + b )−1 b j The Student-t statistic in (22) and the Wald-statistic in (21) can be applied to conduct inference on θj,0 for j = 1, . . . , dθ and joint inference on θ0 respectively.

    Remark 4. Theorem 4 can be applied to conduct inference on θ0 using the identity weighted MD estimator θb1,n defned as X

    −1θb1,n = arg min n (bhn1 (Zi) − φbn2 (Zi, θ))2 . (23)θ∈Θ

    i∈I

    As the identity weight function satisfes Assumption 2(v) and the condition (k1 + k2)δ2 = o(1) holdsw,n trivially, under Assumptions 1, 2(i)-(iv) and 3, Theorem 2 implies that

    −1/2 −1/2θb1,n − θ0 = Op(n1 + n2 ). (24)

    The identity weighted MD estimator can be used to construct the empirical weight function which enables us to construct the optimal MD estimator.

    Let ubi = Yi − bhn1 (Zi) for any i ∈ I1, and εei = g(Zi, θb1,n) − φbn2 (Zi, θb1,n) for any i ∈ I2. Defne ∗ −1 −1 −1 −1σ2 σ2 wbn(z) = (n1 + n2 )(n1 bn,u(z) + n2 bn,ε(z))−1 , (25)

    9

  • σ2 σ2where b (z) and bn,ε(z) are the estimators of the conditional variances σ2 (z) and σε 2(z):n,u u−1 −1σb2 (z) = n P 0 (z)Q−1 P 0 b and σb2 P 0 (z)Q−1 P 0 b , (26)n,u 1 k1 n1,k1 n1,k1 U2,n1 n,ε(z) = n2 k2 n2,k2 n2 ,k2 e2,n2

    b 2 ε2where U2,n1 = (ubi )0 and eb2,n2 = (ei )0 . The optimal MD estimator is defned as i∈I1 i∈I2 X −1 ∗ θbn ∗ = arg min n wbn(Zi)(bhn1 (Zi) − φbn2 (Zi, θ))2 . (27)

    θ∈Θ i∈I

    θ∗ ∗To show the optimality of b , it is suÿcient to show that wb (Zi) satisfes the high level conditions inn nAssumption 2(v). For this purpose, we frst derive the convergence rates of σb2 (z) and σb2 n,u n,ε(z). Lemma 3. Under Assumptions 1, 2(i)-(iv), 3 and 4, we have

    � � �σb2 (z) − σ2 � 1/2 −1/2 + k−ru k−2rhsup (z) = Op(ξk1 (k n ) + ξ2 ),n,u u 1 1 1 k1 1 z∈Z

    and � � 1/2 −1/2 −1 + k−2rhsup �σb2 ε (z)� = Op(ξk2 (k n + k−rε ) + ξ2 (n )).n,ε(z) − σ2 2 2 2 k2 1 2 z∈Z

    Remark 5. Under Assumption 4(iv) and

    k−ru k−rε + ξ2 −1+ ξk2 n = o(1), (28)ξk1 1 2 k2 1

    Lemma 3 implies that

    � � � � sup �σb2 (z) − σ2 (z)� = op(1) and sup �σb2 ε (z)� = op(1), (29)n,u u n,ε(z) − σ2 z∈Z z∈Z

    σ2 σ2which means that b (z) and bn,ε(z) are consistent estimators of σ2 (z) and σε 2(z) under the uniform metric. n,u u

    Theorem 5. Under (28), Assumptions 1, 2(i)-(iv), 3 and 4, we have

    ∗ sup |wb (z) − w ∗ (z)| = Op(δw,n)nz∈Z

    1/2 −1/2 k−2rh k−ru k−rε −1where δw,n = maxj=1,2(ξkj kj nj + ξk

    2 j j ) + ξk1 1 + ξk2 2 + ξk

    2 2 n1 .

    Remark 6. When the power series are used as the basis functions Pk(z), we have ξkj ≤ Ckj . Then the convergence rate of δw,n is simplifed as

    3/2 −1/2 + k2−2rh + k1−rε + k2 −1δw,n = max(kj nj j ) + k1

    1−ru 2 2 n1 .

    j=1,2

    −1 −1Hence in this case δw,n = o(1), if maxj=1,2 kj 3n + k22n = o(1), rh > 1, ru > 1 and rε > 1. The condition j 1 −1/4 −1/4

    δw,n = O(n + n ) hold when 1 2

    −1 8/3 −1 1/4k2−2rh 1/4

    k1−ru 1/4

    k1−rεmax kj 6 nj + k2 n1 = O(1) and max nj j + n1 1 + n2 2 = O(1). (30)

    j=1,2 j=1,2

    10

  • −1Moreover, (k1 + k2)δ2 = o(1) holds under (30) and k2 = o(1).w,n 2 n1

    Remark 7. When the splines or trigonometric functions are used as the basis functions Pk(z), we have 1/2

    ξkj ≤ Ckj . Then the convergence rate of δw,n is simplifed as

    −1/2 + k1−2rh 1/2−ru 1/2−rε −1 = max(kj n ) + k + k + k2n .δw,n j j 1 2 1

    j=1,2

    −1 −1Hence in this case δw,n = o(1), if maxj=1,2 kj 2nj + k22n1 = o(1), rh > 1/2, ru > 1/2 and rε > 1/2. The

    −1/4 −1/4condition δw,n = O(n + n ) hold when 1 2

    k4 −18/3 −1 1/4k1−2rh

    1/4k1−ru

    1/4k1−rεmax + k n = O(1) and max n + n + n = O(1). (31)j nj 2 1 j j 1 1 2 2

    j=1,2 j=1,2

    −1Moreover, (k1 + k2)δw,n 2 = o(1) holds under (31) and k22n = o(1).1

    6 Monte Carlo Simulation

    In this section, we study the fnite sample performances of the MD estimator and the proposed inference method. The simulated data is from the following model

    Yi = g(Xi, θ0) + vi, (32)

    where Yi, Xi and vi are scale random variables, g(Xi, θ0) is a function specifed in the following ( Xiθ0 in Model 1

    g(Xi, θ0) = , (33)log(1 + X2θ0), in Model 2 i

    where θ0 = 1 is the unknown parameter. To generate the simulated data, we frst generate (X1∗ ,i, X2∗ ,i, vi)0 from the joint normal distribution with

    mean zero and identity variance-covariance matrix. Let

    Zi = X2∗ ,i(1 + X2

    ∗,i2 )−1/2 and Xi = Zi + X1

    ∗ ,i log(Zi

    2). (34)

    We assume that (Yi, Zi) are observed together and (Xi, Zi) are observed together. We generate the frst data set {(Yi, Zi)} with sample size n1, and then independently generate the second data set {(Xi, Zi)}i∈I1 i∈I2 with sample size n2. As both the magnitudes of n1, n2 and their relative magnitude are important to the fnite sample properties of the MD estimator, we consider two sampling schemes: equal sampling and unequal sampling separately. In the equal sampling scheme, we set n1 = n2 = n0 where n0 starts from 50 with increment 50 and ends at 1000. In the unequal sampling, we set n1 + n2 = 1000 where n1 starts from 100 with increment 50 and ends at 900. For each combination of n1 and n2, we generate 10000 simulated samples to evaluate the performances of the MD estimator and the proposed inference procedure.

    11

  • Figure 6.1. Properties of the MD and the Imputation Estimators (n1 = n2)

    In addition to the MD estimator, we study two alternative estimators based on data imputation. The frst estimator (which is called the Y -imputed estimator in this section) is defned as X

    −1θbX,n = arg min n (Yi − g(Xbi, θ))2 (35)1θ∈Θ

    i∈I1

    b −1P 0 (Zi)Q−1 P where Xi = n XiPk2 (Zi) for any i ∈ I1 is the predicted value of Xi in the frst data 2 k2 n2,k2 i∈I2 set based on nonparametric regression. The second estimator (which is called the Y -imputed estimator in this section) is defned as X b −1 ( bθY,n = arg min n Yi − g(Xi, θ))2 (36)2

    θ∈Θ i∈I2

    −1where Ybi = n P 0 (Zi)Q−1 P YiPk1 (Zi) for any i ∈ I2 is the predicted value of Yi in the second data 1 k1 n1,k1 i∈I1 12

  • set based on nonparametric regression. In the simulation studies, we set k1 = k2 = 5 and Pk1 (Z) = Pk2 (Z) = (1, Z, Z2, Z3, Z4). The minimization problem in the MD estimation and the nonlinear regressions (in (35) and (36)) are solved by grid search with Θ = [0, 2] and equally spaced grid points with grid length 0.001.

    Figure 6.2. Properties of the MD and the Imputation Estimators (n1 + n2 = 1000)

    The fnite sample properties of the identity weighted MD estimator (the green dashed line), the optimal weighted MD estimator (the black solid line), the X-imputed estimator (the blue dotted line) and the Y -imputed estimator (the red dash-dotted line) are presented in Figures 6.1 and 6.2. In Figure 6.1, we see that the bias and variance of the two MD estimators converge to zero with the growth of both n1 and n2. The optimal weighted MD estimator has smaller bias and smaller variance, and hence smaller RMSE than the identity weighted MD estimator. The improvement of the optimal MD estimator over the identity weighted MD estimator is clearly investigated in model 1. The X-imputed estimator has almost the same fnite sample bias and fnite sample variance as the identity weighted MD estimator in the linear model (i.e., model 1). But it has large and non-convergent fnite sample bias in model 2, which indicates that

    13

  • the X-imputed estimator may be inconsistent in general nonlinear models. The Y -imputed estimator has large and non-convergent fnite sample bias in both model 1 and model 2, which shows that it may be an inconsistent estimator in general. The fnite sample performances of the MD estimators and the two imputed estimators under unequal sampling scheme are presented in Figure 6.2. In this fgure, we see that when n1 (or n2) is small, the fnite sample bias and variance of the MD estimators are large regardless how big n2 (or n1) is. This means that the main part in the estimation error of the MD estimator is from the component estimated by the smaller sample, which is implied by Theorem 2.

    Figure 6.3. Properties of the Confdence Intervals (n1 = n2)

    The fnite sample properties of the inference procedures based on the identity weighted MD estimator and the optimal weighted MD estimator are provided in Figures 6.3 and 6.4. In Figure 6.3, we see that the fnite coverage probabilities of the confdence intervals based on the MD estimators converge to the nominal level 0.9 with both n1 and n2 increase to 1000. In model 1, the coverage probability of the confdence interval based on the optimal MD estimator is closer to the nominal level than that based on the identity weighted MD estimator in all sample sizes we considered. In model 2, the confdence interval based on the optimal MD estimator is slightly worse than that based on the identity weighted MD estimator when the sample sizes n1 and n2 are small, and the coverage probabilities of the two confdence intervals are identical and close to the nominal level when n1 and n2 are larger than 250. In both model 1 and model 2, the average length of the confdence interval of the optimal MD estimator is much smaller than that of the confdence interval of the identity weighted MD estimator, which is because the optimal MD estimator has smaller variance. The fnite sample performances of the confdence intervals based on the MD estimators under unequal sampling scheme

    14

  • are presented in Figure 6.4. In this fgure, we see that when n1 (or n2) is small, the coverage probabilities of the confdence intervals of the two MD estimators are away from the nominal level. The performance of the inference based on the identity weighted MD estimator is poor in model 1 when the sample size n2 is small regardless of the size of the other sample n1. From fgures 6.3 and 6.4, we also see that the average length of the confdence intervals of the optimally weighted MD estimators is smaller than the identity weighted MD estimator.

    Figure 6.4. Properties of the Confdence Intervals (n1 + n2 = 1000)

    7 Conclusion

    This paper studies estimation and inference of nonlinear econometric models when the economic variables of the models are contained in di˙erent data sets in pratice. We provide a semiparametric MD estimator based on conditional moment restrictions with common conditioning variables which are contained in di˙erent data sets. The MD estimator is show to be consistent and has asymptotic normal distribution. We provide the specifc form of optimal weight for the MD estimation, and show that the optimal weighted MD estimator has the smallest asymptotic variance among all MD estimators. Consistent estimator of the variance-covariance matrix of the MD estimator, and hence inference procedure of the unknown parameter is also provided. The fnite sample performances of the MD estimator and the inference procedure are investigated in simulation studies.

    15

  • References

    [1] Donald WK Andrews. Asymptotic normality of series estimators for nonparametric and semiparametric regression models. Econometrica: Journal of the Econometric Society, pages 307–345, 1991.

    [2] Joshua D Angrist and Alan B Krueger. The e˙ect of age at school entry on educational attainment: an application of instrumental variables with moments from two samples. Journal of the American statistical Association, 87(418):328–336, 1992.

    [3] Manuel Arellano and Costas Meghir. Female labour supply and on-the-job search: an empirical model estimated using complementary data sets. The Review of Economic Studies, 59(3):537–559, 1992.

    [4] Orazio P Attanasio and Guglielmo Weber. Consumption growth, the interest rate and aggregation. The Review of Economic Studies, 60(3):631–649, 1993.

    [5] Richard Blundell, Luigi Pistaferri, and Ian Preston. Consumption inequality and partial insurance. American Economic Review, 98(5):1887–1921, 2008.

    [6] Martin Browning, Angus Deaton, and Margaret Irish. A proftable approach to labor supply and commodity demands over the life-cycle. Econometrica: journal of the econometric society, pages 503– 543, 1985.

    [7] Martin Browning and Søren Leth-Petersen. Imputing consumption from income and wealth information. The Economic Journal, 113(488), 2003.

    [8] Xiaohong Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of economet-rics, 6:5549–5632, 2007.

    [9] John H Cochrane. A simple test of consumption insurance. Journal of political economy, 99(5):957–976, 1991.

    [10] Yanqin Fan, Robert Sherman, and Matthew Shum. Identifying treatment e˙ects under data combina-tion. Econometrica, 82(2):811–822, 2014.

    [11] Yanqin Fan, Robert Sherman, and Matthew Shum. Estimation and inference in an ecological inference model. Journal of Econometric Methods, 5(1):17–48, 2016.

    [12] Robert E Hall and Frederic S Mishkin. The sensitivity of consumption to transitory income: Estimates from panel data on households. Econometrica, 50(2):461–481, 1982.

    [13] Fumio Hayashi, Joseph Altonji, and Laurence Kotliko˙. Risk-sharing between and within families. Econometrica: Journal of the Econometric Society, pages 261–294, 1996.

    [14] Erik Hurst and Frank P Sta˙ord. Home is where the equity is: Mortgage refnancing and household consumption. Journal of money, Credit, and Banking, 36(6):985–1014, 2004.

    [15] N Anders Klevmarken. Missing variables and two-stage least-squares estimation from more than one data set. Technical report, Research Institute of Industrial Economics, 1982.

    [16] Robert Martin. Consumption, durable goods, and transaction costs. 2003.

    16

  • [17] Whitney K Newey. Convergence rates and asymptotic normality for series estimators. Journal of econometrics, 79(1):147–168, 1997.

    [18] Geert Ridder and Robert Moÿtt. The econometrics of data combination. Handbook of econometrics, 6:5469–5547, 2007.

    [19] David E Runkle. Liquidity constraints and the permanent-income hypothesis: Evidence from panel data. Journal of monetary Economics, 27(1):73–98, 1991.

    [20] John Shea. Myopia, liquidity constraints, and aggregate consumption: a simple test. Journal of money, credit and banking, 27(3):798–805, 1995.

    [21] Jonathan Skinner. A superior measure of consumption from the panel study of income dynamics. Economics Letters, 23(2):213–216, 1987.

    [22] Andreas Waldkirch, Serena Ng, and Donald Cox. Intergenerational linkages in consumption behavior. Journal of Human Resources, 39(2):355–381, 2004.

    [23] Stephen P Zeldes. Optimal consumption with stochastic income: Deviations from certainty equivalence. The Quarterly Journal of Economics, 104(2):275–298, 1989.

    [24] James P Ziliak. Does the choice of consumption measure matter? an application to the permanent-income hypothesis. Journal of monetary Economics, 41(1):201–216, 1998.

    17

  • APPENDIX

    A Proof of the Main Results in Section 3

    Proof. [Proof of Theorem 1] Defne the empirical criterion function of the MD estimation problem as ����� ��� X 2 Ln wni∈I

    By Assumptions 2(iii) and 2(v), inf L ∗ (θ) ≥ ηC,ε (38)n{θ∈Θ: ||θ−θ0||≥ε}

    where ηC,ε = Cηε > 0 is a fxed constant which only depends on ε. (38) implies that θ0 is uniquely identifed as the minimizer of L∗ (θ). Hence, to prove the consistency of θbn, it is suÿcient to show that n

    bb bhn1 (Zi) − φbn2 −1 for any θ ∈ Θ. (37)(θ) = n (Zi) (Zi, θ)

    ���b (θ) − L ∗ Ln n(θ) ��� = op(1). (39)sup θ∈Θ

    Note that we can decompose Ln(θ) as

    Ln wn

    wnbbb X 2(Zi)(|bhn1 (Zi) − h0(Zi)|2 + |φbn2 (Zi, θ) − φ(Zi, θ)|2 + |h0(Zi) − φ(Zi, θ)|−1(θ) = n )

    i∈I X (Zi)(bhn1 (Zi) − h0(Zi))(φbn2 −1− 2n (Zi, θ) − φ(Zi, θ))

    i∈I X (Zi)(φbn2 −1− 2n wnb

    bwni∈I

    Using Assumption 1(i), one can use Rudelson’s law of large numbers for matrices (see, e.g., Lemma 6.2 in Belloni, et. al. (2015)) to get

    −1/2Qn,kj − Qkj = Op(n −1/2ξkj (log(kj ))1/2) and Qnj ,kj − Qkj = Op(nj ξkj (log(kj ))

    1/2) (41)

    (Zi, θ) − φ(Zi, θ))(h0(Zi) − φ(Zi, θ)) i∈I X

    (Zi)(bhn1 −1 (Zi) − h0(Zi))(h0(Zi) − φ(Zi, θ)). (40)+ 2n

    P P−1where Qn,kj = n−1 i∈I Pkj (Zi)Pk0 j (Zi), Qnj ,kj Pkj (Zi)Pk0 j (Zi) and the convergence is under = n1 i∈Ij the operator norm of matrix. By (41), Assumptions 1(iii) and 1(v),

    C−1 ≤ λmin(Qn,kj ) ≤ λmax(Qn,kj ) ≤ C and C−1 ≤ λmin(Qnj ,kj ) ≤ λmax(Qnj ,kj ) ≤ C, (42)

    with probability approaching 1. Under Assumption 1 and (42), (A.2) in the proof of Theorem 1 in Newey (1997) implies that

    bβk1,n1 − βh,k1

    2 = Op(k1n −1 1 + k−2rh 1 ). (43)

    18

  • By the triangle inequality,

    bhn1 (Zi) − h0(Zi) ��� ��� 2X −1 n i∈I

    i∈I i∈I

    ��� ��� X X2 2−1 −1≤ 2n (Zi) − h0,k1 |h0,k1 (Zi) − h0(Zi)|bhn1 (Zi) + 2n ≤ 2(βbk1,n1 − βh,k1 )0Qk1,n(βbk1,n1 − βh,k1 ) + 2 sup |h0,k1 (z) − h0(z)|2

    z∈Z

    ≤ 2λmax(Qk1 ,n)

    bβk1,n1 − βh,k1

    2 + 2 sup |h0,k1 (z) − h0(z)|2

    z∈Z −1 + k−2rh= Op(k1n ) = op(1) (44)1 1

    where the frst equality is by Assumption 1(iv), (42) and (43), the second equality is by Assumption 1(v). By the triangle inequality and Assumption 2(v),

    sup |wbn(z)| ≤ sup |wbn(z) − wn(z)| + sup |wn(z)| < 2C (45) z∈Z z∈Z z∈Z

    with probability approaching 1. By (44) and (45),

    wn

    By (45) and Assumption 2(ii),

    ���b �2�� ���bhn1 ��� X X 2 −1 (46)(Zi) − h0(Zi) ≤ sup |wbn(z)|z∈Z

    (Zi) − h0(Zi)(Zi) bhn1 (1).n = opi∈I i∈I

    ���bφn2 ��� X 2 −1 bwn≤ sup |wbn(z)| sup

    z∈Z θ∈Θ

    (Zi, θ) − φ(Zi, θ)(Zi)sup n θ∈Θ i∈I ���bφn2 (Zi, θ) − φ(Zi, θ) ��� X 2 −1 (47)= op(1).n

    i∈I

    Using (44), (45), Assumption 2(ii) and the Cauchy-Schwarz inequality, we get

    sup θ∈Θ

    ����� n −1 ����� X (Zi)(bhn1 (Zi) − h0(Zi))(φbn2 s���

    bs wn (Zi, θ) − φ(Zi, θ)) i∈I ��� ���bφn2 ��� X X2 2 (48)−1 −1≤ sup |wbn(z)|z∈Z

    (Zi) − h0(Zi) (Zi, θ) − φ(Zi, θ)bhn1 (1).n sup n = opθ∈Θi∈I i∈I ��

    By Assumption 1(ii), E h20(Z) < C, which together with Assumption 2(i) implies that ih � �� � |h0(Z) − φ(Z, θ)|2 h2 0(Z) φ2(Z, θ) (49)≤ 2EE + 2 sup E < C. sup

    θ∈Θ θ∈Θ

    19

  • By (49), Assumptions 2.(iv) and 2.(v), X −1 sup n |h0(Zi) − φ(Zi, θ)|2

    θ∈Θ i∈I X −1 2≤ C sup n wn(Zi) |h0(Zi) − φ(Zi, θ)|

    θ∈Θ i∈I ih 2≤ (C + op(1)) sup E wn(Zi) |h0(Zi) − φ(Zi, θ)|iθ∈Θ h

    2≤ (C + op(1)) sup E |h0(Zi) − φ(Zi, θ)| = Op(1). (50) θ∈Θ

    Using (45), (50), Assumptions 2(ii) and (iv), and the Cauchy-Schwarz inequality, we get ����� ����� X (Zi)(φbn2 −1 wn

    ≤ sup |wbn(z)| sup z∈Z θ∈Θ

    b (Zi, θ) − φ(Zi, θ))(h0(Zi) − φ(Zi, θ))sup n θ∈Θ i∈I ss XX ���bφn2 (Zi, θ) − φ(Zi, θ) ��� 2 = op|h0(Zi) − φ(Zi, θ)|2 n−1 (51)−1 (1).n

    i∈I i∈I

    Similarly, using (45), (44), (50), Assumptions 2(iv) and the Cauchy-Schwarz inequality, we get ����� ����� = opX (Zi)(bhn1 −1 (52)

    b

    wnbCollecting the results in (40), (46), (47), (48), (51) and (52), we get

    wn

    (Zi) − h0(Zi))(h0(Zi) − φ(Zi, θ)) (1).sup n θ∈Θ i∈I

    X 2−1 (53)(Zi) |h0(Zi) − φ(Zi, θ)|Ln(θ) = sup + op(1).sup n

    θ∈Θ θ∈Θ i∈I

    By (50) and Assumption 2(v),

    sup θ∈Θ

    ����� n ����� X −1 2(wbn(Zi) − wn(Zi)) |h0(Zi) − φ(Zi, θ)|

    i∈I X −1 2≤ sup |wbn(z) − wn(z)| sup n |h0(Zi) − φ(Zi, θ)| = op(1) (54)

    z∈Z θ∈Θ i∈I

    which together with (53) and Assumption 2(iv),

    sup ���b (θ) − L ∗ Ln n(θ) ��� = sup

    θ∈Θ θ∈Θ |Ln(θ) − L ∗ (θ)| + op(1) = op(1). (55)n

    This proves (39) and hence the claim of the theorem.

    Lemma 4. By Assumptions 1(i), 1(iii), 1(v) and 3(i), we have

    −1 X

    bφθθ,n2

    2 sup n

    θ∈Nδn i∈I (Zi, θ) = Op(1).

    20

  • Proof. [Proof of Lemma 4] By defnition, X −1φbθθ,n2 (z, θ) = n Pk2 (z)0Q−1 Pk2 (Zi)gθθ(Xi, θ).2 n2,k2

    i∈I

    Let gθj1 θj2 (Xi, θ) denote the (j1, j2)-th component of gθθ(Xi, θ), for any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ. Let

    −1b (z, θ) = n P 0 (z)Q−1 P 0 (θ),φθj1 θj2 ,n2 2 k2 n2,k2 n2,k2 gθj1 θj2 ,n2 where gθj1 θj2 ,n2 (θ) = (gθj1 θj2 (Xi, θ))

    0 i∈I2 . Then by defnition, X

    −1 φb2 Q−1 Q−1 P 0 n θj1 θj2 ,n2 (Zi, θ) = gθj1 θj2 ,n2 (θ)0Pn2,k2 n2,k2 Qn,k2 n2,k2 n2,k2 gθj1 θj2 ,n2 (θ) i∈I

    λmax(Qn,k2 ) gθj1 θj2 ,n2 (θ)0Pn2,k2 (P n

    0 2 ,k2

    Pn2,k2 )−1P n

    0 2 ,k2

    gθj1 θj2 ,n2 (θ)≤ λmin(Qn1,k2 ) n2 Xλmax(Qn,k2 ) −1≤ n2 (gθj1 θj2 (Xi, θ))2 λmin(Qn1,k2 ) i∈I2

    which together with (42) (which holds under Assumptions 1(i), 1(iii) and 1(v)), and Assumption 3(i) implies that X X

    −1 λmax(Qn,k2 ) −1 sup n φb2 θj1 θj2 ,n2 (Zi, θ) ≤ sup n2 (gθj1 θj2 (Xi, θ))2 = Op(1) θ∈Nδn λmin(Qn1,k2 ) θ∈Nδni∈I i∈I2 for any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ. This fnishes the proof.

    Lemma 5. By Assumptions 1(i), 1(iii), 1(iv), 1(v), 3(v) and 3(vii), we have X −1 −1/2 −1/2 n (bhn1 (Zi) − φbn2 (Zi, θ0))2 = op(n1 + n2 ).

    i∈I

    Proof. [Proof of Lemma 5] By (44) (which holds under Assumptions 1(i), 1(iii), 1(iv) and 1(v)), X −1 n (bhn1 (Zi) − φbn2 (Zi, θ0))2

    i∈I X X −1 −1≤ 2n (bhn1 (Zi) − h0(Zi))2 + 2n (φbn2 (Zi, θ0) − h0(Zi))2

    i∈I i∈I X −1 −1 + k−2rh= 2n (φbn2 (Zi, θ0) − h0(Zi))2 + Op(k1n1 1 ). (56)

    i∈I

    Let βbφ,n2 = (P 0 Pn2,k2 )−1P 0 gn2 (θ0), where gn2 (θ0) = (g(Xi, θ0))0 . Then n2 ,k2 n2,k2 i∈I2 2 b (P 0 )−2P 0βφ,n2 − βh,k2

    = (gn2 (θ0) − Hn2,k2 )0Pn2,k2 Pn2,k2 (gn2 (θ0) − Hn2,k2 )n2,k2 n2,k2 (gn2 (θ0) − Hn2,k2 )0Pn2,k2 (P 0 Pn2,k2 )−1P 0 (gn2 (θ0) − Hn2,k2 )n2 ,k2 n2,k2≤ , (57)

    n2λmin(Qn2,k2 )

    21

  • where Hn2,k2 = (h0,k2 (Zi))0 . By Assumptions 1(iv), i∈I2

    −1 n (Hn2 − Hn2 ,k2 )0Pn2,k2 (P 0 Pn2,k2 )−1P 0 (Hn2 − Hn2,k2 )2 n2,k2 n2,k2 −1≤ n (Hn2 − Hn2,k2 )0(Hn2 − Hn2,k2 ) = O(k

    −2rh ), (58)2 2

    where Hn2 = (h0(Zi))0 . By Assumptions 1(i), 1(iii) and 3(v), i∈I2 ��� � n −1 (P 0 )−1P 0 2 (gn2 (θ0) − Hn2 )0Pn2 ,k2 n2,k2 Pn2,k2 n2,k2 (gn2 (θ0) − Hn2E ) {Zi}i∈I2 �−1 (Pn0 2,k2 Pn2,k2 )−1Pn0 2,k2 E [ (gn2 (θ0) − Hn2 )(gn2 (θ0) − Hn2 )0| {Zi}i∈I2 ] Pn2,k2tr= n2 −1 −1≤ sup σε 2(z)k2n = O(k2n )2 2

    z∈Z

    which together with the Markov inequality implies that

    −1 −1 n2 (gn2 (θ0) − Hn2 )0Pn2,k2 (Pn0 2,k2 Pn2 ,k2 )−1Pn

    0 2,k2 (gn2 (θ0) − Hn2 ) = Op(k2n2 ). (59)

    Combining the results in (57), (58) and (59), and then applying (42), we get

    bβφ,n2 − βh,k2

    2 = Op(k2n −1 2 + k−2rh 2 (60)). By (42) and (60) X

    −1 n (φbn2 (Zi, θ0) − h0,k2 (Zi))2 i∈I

    = ( βbφ,n2 − βh,k2 )0Qn,k2 (βbφ,n2 − βh,k2 ) ≤ λmax(Qn,k2 )

    bβφ,n2 − βh,k2

    2 = Op(k2n −1 2 + k−2rh 2 (61)) which together with (56) and Assumption 3(vii) proves the claim of the lemma.

    Lemma 6. Under Assumptions 1(i), 2(v), 3(iv) and 3(vi), we have

    � � n (P 0 )−1P 0 φ0−1φwθ,nPn,k1 n,k1 Pn,k1 n,k1 wθ,n = E w

    2 (Z)φθ(Z, θ0)φ0 θ(Z, θ0)n + op(1).

    Proof. [Proof of Lemma 6] For j = 1, . . . , dθ, let φwθj ,k1,n(z, θ0) = P 0 (z)βwφj ,k1 , φwθj ,k1,n = (φwθj ,k1,n(Zi, θ0))i∈Ikand φwθ,k1,n = (φ0 )0 . For ease of notations, we defne Mk1,n = Pn,k1 (P 0 Pn,k1 )−1P 0 .wθj ,k1,n j=1,...,dθ n,k1 n,k1

    By defnition,

    0φ0 φ0φwθ,nMk1,n wθ,n = φwθ,k1,nMk1,n wθ,k1,n + (φwθ,n − φwθ,k1,n) Mk1 ,n (φwθ,n − φwθ,k1,n)

    0φ0+ (φwθ,n − φwθ,k1,n) Mk1,n wθ,k1,n + φwθ,k1,nMk1,n (φwθ,n − φwθ,k1,n) . (62)

    For any j = 1, . . . , dθ, let φwθj ,n denote the j-th row of φwθ,n. By the Cauchy-Schwarz inequality, for any

    22

  • j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, �� ��2 )0 n −1(φwθj1 ,n − φwθj ,k1,n)Mk1,n(φwθj2 ,n − φwθj2 ,k1 ,n≤ n −1(φwθj1 ,n − φwθj ,k1,n)Mk1,n(φwθj1 ,n − φwθj ,k1,n)

    0

    −1(φwθj2 −1≤ n

    X ,n )0− φwθj2 ,k1,n)Mk1,n(φwθj2 ,n − φwθj2 ,k1 ,n× n �� ��2 wn(Zi)φθj1 (Zi, θ0) − φwθj1 ,k1 (Zi, θ0) i∈I ��X ��2−1 (63)wn(Zi)φθj2 (Zi, θ0) − φwθj2 ,k1 (Zi, θ0) = o(1)× n i∈I

    where the last equality is by Assumption 3(vi), and the fact that Mk1 ,n is an idempotent matrix. (63) then implies that

    n −1 (φwθ,n − φwθ,k1,n) Mk1,n (φwθ,n − φwθ,k1,n) = o(1). (64)

    For any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, by defnition we can write

    n −1φwθj1 ,k1,nMk1,nφwθ0

    j2 ,k1,n

    −1β0 = n Pn,k1 (P 0 Pn,k1 )

    −1P 0 φ0 wφj1 ,kP 0

    n,k1 n,k1 wθj2n,k1 X ,k1,n −1 = n φwθj1 ,k1,n(Zi, θ0)φwθj2 ,k1,n(Zi, θ0)

    i∈I X −1 2 = n w (Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0)n

    i∈I X −1+ n (φwθj1 ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0))φwθj2 ,k1,n(Zi, θ0)

    i∈I X −1+ n wn(Zi)φθj1 (Zi, θ0)(φwθj2 ,k1 (Zi, θ0) − wn(Zi)φθj2 (Zi, θ0)). (65)

    i∈I

    By Assumptions 1(i), 2(v), 3(iv) and the Markov inequality, we have X � �−1 −1/2), 2 2 n (66)(Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) − E (Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) = Op(nn w wni∈I

    where under Assumptions 2(v) and 3(iv)

    �� �� < C. � �2 (67)E (Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0)wnBy Assumption 3(vi), ����� X −1 n (φwθj1 ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0))(φwθj2 ,k1,n(Zi, θ0) − wn(Zi)φθj2 (Zi, θ0))

    i∈I

    ����� � �2 ≤ max sup |φwθj ,k1,n(z, θ0) − wn(z)φθj (z, θ0)| = o(1), (68)

    j=1,...,dθ z∈Z

    23

  • which implies that

    −1 X (φwθj ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0))φwθj2 ,k1,n(Zi, θ0)n

    i∈I −1

    X = n

    i∈I (φwθj ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0))wn(Zi)φθj2 (Zi, θ0) + op(1). (69)

    By the Cauchy-Schwarz inequality, ����� X −1 n (φwθj ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0))wn(Zi)φθj2 (Zi, θ0)i∈I

    ����� 2

    ≤ n −1 X��φwθj ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0) ��2 n X −1 2 wn(Zi)φ2 θj2 (Zi, θ0) = op(1) (70) i∈I i∈I

    where the equality is by Assumption 3(vi), (66) and (67). Combining the results in (69) and (70), we get X −1 n (φwθj ,k1,n(Zi, θ0) − wn(Zi)φθj1 (Zi, θ0))φwθj2 ,k1,n(Zi, θ0) = op(1). (71)

    i∈I

    Similarly, we can show that X −1 n wn(Zi)φθj1 (Zi, θ0)(φwθj2 ,k1,n(Zi, θ0) − wn(Zi)φθj2 (Zi, θ0)) = op(1). (72)

    i∈I

    Collecting the results in (65), (66), (71) and (72), we have

    � � n −1φwθj1 ,k1,nMk1,nφwθ

    0 j2 ,k1,n

    = E 2 n (73)(Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) + op(1)w

    for any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, which implies that � � n φ0−1φwθ,k1,nMk1,n wθ,k1,n = E w

    2 (Zi)φθ(Zi, θ0)φ0 θ(Zi, θ0)n (74)+ op(1).

    By the Cauchy-Schwarz inequality, for any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, ����� φ0(φwθj1 ,n − φwθj1 ,k1 ,n)Mk1,n wθj2 ,k1,n

    n

    ����� 2

    ≤ )0 φ0(φwθj1 ,n − φwθj1 ,k1,n)Mk1,n(φwθj1 ,n − φwθj1 ,k1,n φwθj2 ,k1,nMk1,n wθj2 ,k1,n

    n n = op(1) (75)

    where the equality is by (63), (67) and (73). (75) then implies that

    n φ0−1 (φwθ,n − φwθ,k1,n) Mk1,n (1)wθ,k1,n = op (76)

    and similarly 0

    n −1φwθ,k1,nMk1,n (φwθ,n − φwθ,k1,n) = op(1).

    Combining the results in (62), (64), (74), (76) and (77), we immediately get the claimed result.

    (77)

    24

  • Proof. [Proof of Theorem 2] By the defnition of θbn, we have the following frst order condition X −1 n wn(Zi)(bhn1 (Zi) − φbn2 (Zi, θbn))φbθ,n2 (Zi, θbn) = 0. (78)

    i∈I

    Applying the frst order expansion to (78), we get X −10 = n wbn(Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2 (Zi, θ0)

    i∈I X −1+ n wbn(Zi)φbθ,n2 (Zi, θen)φbθ,n2 (Zi, θen)0(θbn − θ0)

    i∈I X −1 e e+ n wbn(Zi)(bhn1 (Zi) − φbn2 (Zi, θn))φbθθ,n2 (Zi, θn)(θbn − θ0), (79)

    i∈I

    ewhere θn is between θbn and θ0 and it may di˙er across rows. For any j = 1, . . . , dθ, by the mean value expansion and the Cauchy-Schwarz inequality, � �

    � � �φbθj ,n2 (Zi, θej,n) − φbθj ,n2 (Zi, θ0)� ≤ sup φbθj θ,n2 (Zi, θ)

    θej,n − θ0

    θ∈Nδn

    which together with the triangle inequality and Lemma 4 implies that

    X X 2 2 −1 −1 n (φbθj ,n2 (Zi, θej,n) − φbθj ,n2 (Zi, θ0))2 ≤ sup n φbθj θ,n2 (Zi, θ) θej,n − θ0 = op(1). (80)

    θ∈Nδni∈I i∈I

    By Assumption 3(iii) and (80), X −1 n (φbθj ,n2 (Zi, θej,n) − φθj (Zi, θ0))2

    i∈I X X −1 −1e≤ 2n (φbθj ,n2 (Ziθj,n) − φbθj ,n2 (Zi, θ0))2 + 2n (φbθj ,n2 (Zi, θ0) − φθj (Zi, θ0))2 = op(1). (81)

    i∈I i∈I

    By Assumption 3(iv) and the Markov inequality, X −1 n (φθj (Zi, θ0))

    2 = Op(1), (82) i∈I

    which together with (81) implies that X −1 n (φbθj ,n2 (Zi, θen))2 = Op(1) (83)

    i∈I

    for any j = 1, . . . , dθ. For any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, we can use the triangle inequality and

    25

  • the Cauchy-Schwarz inequality, Assumptions 2(v), (81), (82) and (83) to deduce that

    XX n −1 (Zi)φbθj1 ,n2 (Zi, )bφθj2 ,n2 (Zi, −1) − n wn�����

    �����wn θj1,n θj2,ni∈I i∈I

    eeb (Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0)����� ����� X (Zi))φbθj1 )bφθj2 ,n2 (Zi, −1 θj1,n

    ) − φθj1

    e θj2,n(Zi, θ0)

    e≤ (wbn(Zi) − wn (Zi, )n ,n2 i∈I�����

    ����� X (Zi)(φbθj1 (Zi, θ0))φbθj2 −1 θj1,n−1 wn(Zi)φθj1 (Zi, θ0)(φ

    bθj2

    e(Zi,+ n wn ,n2 ,n2 i∈I�����

    ����� X (84)θj2,ni∈I

    Under Assumptions 1(i), 2(v) and 3(iv), we can the Markov inequality to deduce that

    e ) − φθj2(Zi, (Zi, θ0)) (1).+ n = op,n2

    ��X −1 (Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) = E wn(Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) + op(1) (85)n wni∈I

    for any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ. Collecting the results in (84) and (85), we get X (Zi)φbθ,n2 )bφθ,n2 −1 )0 = H0 + op(1). (86)eewn θn θn

    i∈I

    By the second order Taylor expansion and the triangle inequality and the Cauchy-Schwarz inequality,

    b (Zi, (Zi,n

    ��� ��� X 2 −1 eθj,n) − φ(Zi, θ0) −1 2≤ 2n kφθ(Zi, θ0)k ||θej,n − θ0||2

    i∈I

    φ(Zi,n i∈I X

    X

    e

    −1 2+2−1 sup n kφθθ(Zi, θ)k ||θej,n − θ0||4 = op(1) (87) θ∈Nδn i∈I

    where the equality is by (82), Lemma 4 and ||θej,n − θ0|| = op(1) for any j = 1, . . . , dθ. (87) together with Assumptions 2(ii) then implies that

    θj,n) − φ(Zi, θ0)

    ��� ���θj,n) − φ(Zi, θ0) X 2 bφn2 −1 e

    e(Zi, θj,n) − φ(Zi,

    (Zi,n i∈I ��� ��� ��� ��� X X2 2 bφn2 −1 −1 (88)eθni∈I i∈I

    ≤ 2n ) + 2n φ(Zi, = op(1)

    26

  • for any j = 1, . . . , dθ. By the Cauchy-Schwarz inequality,

    n −1 (Zi)(bhn1 (Zi) − φbn2 (Zi, ))φbθθ,n2 (Zi, )

    θn(bhn1 (Zi) − φbn2

    e θn(Zi,

    ewn≤ sup |wbn(z)|

    z

    bi∈I ss

    XX 2 bφθθ,n2 −1 −1θj,n))2

    j=1,...,dθ j=1,...,dθ

    e eθj,n) θn

    θj,n) − φ(Zi, θ0)e

    wn

    e

    −1+ n wn(Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))(φbθ,n2 (Zi, θ0) − φθ(Zi, θ0)). (90) i∈I

    By Assumptions 3(iv) and 3(v), and the Markov inequality

    b

    ���

    (Zi,max n max ni∈I i∈I s

    X 2 bφθθ,n2 −1≤ 2 sup |wbn(z)| max n

    j=1,...,dθz (Zi, )

    i∈I s ��� ��� ��� X X2 2 bφn2 (89)−1 −1(Zi) − h0(Zi)bhn1 (Zi, (1)× +n max n = opj=1,...,dθ

    i∈I i∈I

    where the last equality is by (44), (45), Lemma 4 and (88). By defnition, X

    (Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2 −1 (Zi, θ0)n i∈I X −1 = n wn(Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φθ(Zi, θ0)

    i∈I X −1+ n (wbn(Zi) − wn(Zi))(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2 (Zi, θ0)

    i∈I X

    XXX −1 −1 −1 n ||φbθ,n2 (Zi, θ0)||2 ≤ 2n ||φbθ,n2 (Zi, θ0) − φθ(Zi, θ0)||2 + 2n kφθ(Zi, θ0)k2 = Op(1). (91)

    i∈I i∈I i∈I

    By the triangle inequality and the Cauchy-Schwarz inequality, (91), Lemma 5, Assumptions 2(v) and 3(vii),

    X −1 n (wbn(Zi) − wn(Zi))(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2 (Zi, θ0) i∈I

    ss XX

    (Zi, θ0)

    2 (bhn1 (Zi) − φbn2 −1(Zi, θ0))2 n bφθ,n2 −1≤ sup |wbn(z) − wn(z)|z∈Z

    ni∈I i∈I s X

    bφθ,n2 (Zi, θ0)

    2 = op(n −1/2 −1/2 −1/2 −1/2 (92)−1= op(n ) ).+ n + nn1 2 1 2

    i∈I

    By the triangle inequality and the Cauchy-Schwarz inequality, Lemma 5, Assumptions 2(v), 3(iii) and 3(vii),

    X −1 n wn(Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))(φbθ,n2 (Zi, θ0) − φθ(Zi, θ0))

    i∈I ss

    bφθ,n2

    XX 2 (bhn1 (Zi) − φbn2≤ sup |wn(z)|z∈Z

    n−1 −1(Zi, θ0))2 n (Zi, θ0) − φθ(Zi, θ0) i∈I i∈I

    −1/2 −1/2 = op(n + n ). (93)1 2

    27

    X

  • Combining the results in (90), (92) and (93), we get X n −1 (Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2 (Zi, θ0)

    i∈I wnbX

    −1 −1/2 −1/2 = n wn(Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φθ(Zi, θ0) + op(n1 + n2 ) i∈I X

    −1 = n wn(Zi)(bhn1 (Zi) − h0(Zi))φθ(Zi, θ0) i∈I X

    −1 −1/2 −1/2− n wn(Zi)(φbn2 (Zi, θ0) − h0(Zi))φθ(Zi, θ0) + op(n + n ). (94)1 2 i∈I

    By the defnition of bhn1 (Zi), we can write X −1 n wn(Zi)(bhn1 (Zi) − h0(Zi))φθ(Zi, θ0)

    i∈I

    (P 0 )−1P 0φwθ,nPn,k1 n1,k1 Pn1,k1 n1,k1 Un1 = n

    (P 0 )−1P 0φwθ,nPn,k1 n1,k1 Pn1,k1 n1,k1 (Hn1 − Hn1,k1 ) φwθ,n(Hn − Hn,k1 )+ + . (95)n n

    where Hn = (h0(Zi))0 i∈I , Hn1 = (h0(Zi))0 , Un1 = (ui)0 , Hn,k1 = (h0,k1 (Zi))0 i∈I , Hn1,k1 = (h0,k1 (Zi))

    0 i∈I1 i∈I1 i∈I1

    and φwθ,n = (wn(Zi)φθ(Zi, θ0))i∈I . By the Cauchy-Schwarz inequality, ��� )−1P 0φwθ,nPn,k1 (P n0 1,k1 Pn1,k1 n1,k1 (Hn1 − Hn1 ,k1 ) ��� 2 n2

    P 0 φ0φwθ,nPn,k1 n,k1 wθ,n ≤ n2

    (Hn1 − Hn1,k1 )0Pn1,k1 (P 0 Pn1,k1 )−2P 0 (Hn1 − Hn1,k1 )n1,k1 n1,k1 λmax(Qn,k1 φwθ,nPn,k1 n,k1 Pn,k1 n,k1 wθ,n ) (P

    0 )−1P 0 φ0 ≤

    λmin(Qn1,k1 ) n

    (Hn1 − Hn1,k1 )0Pn1,k1 (P 0 Pn1,k1 )−1P 0 (Hn1 − Hn1,k1 )n1,k1 n1,k1× n1����sup w (z) )z∈Z n λmax(Qk1 ,n −1 2 −1 2≤ n kφθ(Zi, θ0)k × n1 |h0,k1 (Zi) − h0(Zi)| = Op(k1 −2rh ), (96)λmin(Qk1 ,n1 ) i∈I i∈I1

    where the last equality is by (42), (82), Assumptions 1(iv) and 2(v). By the triangle inequality,

    2 XX

    X −1 n wn(Zi)(h0,k1 (Zi) − h0(Zi))φθ(Zi, θ0) i∈I

    X −1≤ sup |wn(z)| n k(h0,k1 (Zi) − h0(Zi))φθ(Zi, θ0)k

    z∈Z i∈I

    sup |h0(z) − hk1 (z)|z X ≤ C kφθ(Zi, θ0)k = Op(k−rh ), (97)1 n i∈I

    where the last equality is by (82), Assumptions 1(iv) and 2(v). Combining the results in (95), (96) and (97),

    28

  • we get X −1 n wn(Zi)(bhn1 (Zi) − h0(Zi))φθ(Zi, θ0)

    i∈I

    (P 0 )−1 Xφwθ,nPn,k1 n1,k1 Pn1,k1 = uiPk1 (Zi) + Op(k1 −rh ). (98)n i∈I1

    By the defnition of φbn2 (Zi, θ0), we can write X −1 n wn(Zi)(φbn2 (Zi, θ0) − h0(Zi))φθ(Zi, θ0)

    i∈I

    (P 0 )−1 Xφwθ,nPn,k2 n2,k2 Pn2,k2 = εiPk2 (Zi) n i∈I2

    φwθ,nPn,k2 (P n0 2 ,k2

    Pn2,k2 )−1(Hn2 − Hn2,k2 ) φwθ,n(Hn − Hn,k2 )+ + (99)

    n n

    where Hn2,k2 = (h0,k2 (Zi))0 and Hn,k2 = (h0,k2 (Zi))i∈I . Using similar arguments in showing (96) and i∈I2 (97), we get

    φwθ,nPn,k2 (P n0 2,k2

    Pn2,k2 )−1(Hn2 − Hn2,k2 ) φwθ,n(Hn − Hn,k2 ) = Op(k−rh ) and = Op(k−rh ),2 2 n n

    which together with (99) implies that X −1 n wn(Zi)(φbn2 (Zi, θ0) − h0(Zi))φθ(Zi, θ0)

    i∈I

    (P 0 )−1 Xφwθ,nPn,k2 n2,k2 Pn2,k2 = εiPk2 (Zi) + Op(k2 −rh ). (100)n i∈I2 P

    By Assumption 3(v), C−1Qn1,k1 ≤ n1 σu2 (Zi)Pk1 (Zi)Pk1 (Zi)0 ≤ CQn1,k1 , which together with (42) i∈I1 implies that

    C−1 < λmin(Qn1,u) ≤ λmax(Qn1,u) < C, (101)

    with probability approaching 1. Similarly, we can show that

    C−1 ≤ λmin(Qn2,ε) ≤ λmax(Qn2,ε) ≤ C (102)

    with probability approaching 1.

    29

  • Under the i.i.d. assumption, ⎡ ⎤ 2

    (P 0 )−1φwθ,nPn,k1 n1 ,k1 Pn1,k1 X ⎣ ⎦E ����� n i∈I1

    uiPk1 (Zi)

    ����� ������ {Zi}i∈I

    (P 0 (P 0 )−1P 0 φ0φwθ,nPn,k1 Pn1,k1 )−1Qn1,u Pn1,k1n1,k1 n1,k1 n,k1 wθ,n = n2n −1 1

    P 0 φ0Cλmax(Qn1,u) φwθ,nPn,k1 n,k1 wθ,n ≤ λ2 (Qk1,n1 ) n2n1min

    (P 0 )−1P 0 φ0λmax(Qn1 ,u)λmax(Qn,k1 ) φwθ,nPn,k1 n,k1 Pn,k1 n,k1 wθ,n ≤ n1λ2 (Qn1 ,k1 ) nmin X�� �� λmax(Qn1,u)λmax(Qn,k1 ) −1 2 −12 ), (103)≤ sup kφθ(Zi, θ0)k(z) = Op(nw n 1 n1λ2 min(Qn1,k1 )nz∈Z i∈I

    where the last equality is by (101), (42), (82), Assumptions 2(v) and 3(iv). Combined with the Markov inequality, (103) implies that

    (P 0 )−1φwθ,nPn,k1 n1,k1 Pn1,k1 X −1/2 1 (104)uiPk1 (Zi) = Op(n ). n

    i∈I1

    Similarly, we can show that

    (P 0 )−1φwθ,nPn,k2 n2,k2 Pn2,k2 X −1/2 2 (105)εiPk2 (Zi) = Op(n ).

    e

    n i∈I2

    By (86) and (89),

    θnX

    (Zi)(φbθ,n2 )φb0 θ,n2 ) + (bhn1 (Zi) − φbn2 ))φbθθ,n2 −1 (106)ewn θni∈I

    which together with (79) implies that

    b eeθn θn

    (Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2

    (Zi, (Zi, (Zi, (Zi, )) = H0,n + op(1),n

    X [H0,n + op(1)] (θbn −1 (107)bwn

    (Zi)(bhn1 (Zi) − φbn2 (Zi, θ0))φbθ,n2

    − θ0) = −n (Zi, θ0). i∈I

    By (94), (98) and (100), and Assumption 3(vii), X −1 bwn

    i∈I

    (P 0 )−1φwθ,nPn,k1 n1,k1 Pn1 ,k1

    (Zi, θ0)n

    = X

    uiPk1 (Zi) n i∈I1

    (P 0 )−1φwθ,nPn,k2 n2,k2 Pn2,k2 X −1/2 −1/2 + n ) (108)1 2− εiPk2 (Zi) + op(n n i∈I2

    which together with (104) and (105) implies that ihX1 bbhn1 φθ,n2 n i∈I

    30

    (Zi) − φbn2 −1/2 −1/2). (109)(Zi, θ0) (Zi, θ0) = Op(n + n1 2

  • Using (107), (108) and (109), and then applying Assumption 3(ii), we get

    −H−1 Q−1 X0 φwθ,nPn,k1 n1,k1(θbn − θ0) = uiPk1 (Zi) nn1 i∈I1

    H−1 Q−1 X0 φwθ,nPn,k2 n2,k2 −1/2 −1/2 + εiPk2 (Zi) + op(n + n ), (110)1 2 nn2 i∈I2

    −1/2 −1/2which together with Assumption 3(ii), (104) and (105) implies that θbn − θ0 = Op(n + n ).1 2 By Assumption 3(v), Qn1,u ≥ C−1Qn1,k1 which implies that

    Q−1 Q−1 P 0 φ0φwθ,nPn,k1 n1,k1 Qn1,u n1,k1 n,k1 wθ,n n1Σn1 = n2

    Q−1 P 0φwθ,nPn,k1 n1,k1 n,k1 φwθ,n ≥ Cn2n −1 1

    (P 0 )−1P 0λmin(Qn,k1 ) φwθ,nPn,k1 n,k1 Pn,k1 n,k1 φwθ,n ≥ Cλmax(Qn1 ,k1 ) n

    λmin(Qn,k1 )≥ H0,n + op(1) (111)Cλmax(Qn1 ,k1 )

    where the last equality is by (42), Assumption 2(v) and Lemma 6. Using (42), (111) and Assumption 3(ii), we have

    λmin(n1Σn1 ) ≥ C−1 (112)

    with probability approaching 1. Similarly, we can show that

    λmin(n2Σn2 ) ≥ C−1 (113)

    with probability approaching 1. By (112),

    � −1 −1 )−1� (n + n )(Σn1 +Σn2 ≤ λ−1 ) + λ−1 ) ≤ 2C (114)λmax 1 2 min(n1Σn1 min(n2Σn2 with probability approaching 1. Combining the results in (110), (114) and λmin(H0) > 0 in Assumption 3(ii), we get

    (H0,n(Σn1 +Σn2 )−1H0,n)

    1/2(θbn − θ0) )1/2H−1 Q−1 X−(H0,n(Σn1 +Σn2 )−1H0,n 0.nφwθ,nPn,k1 n1,k1 = uiPk1 (Zi) nn1

    i∈I1

    )1/2H−1 Q−1 X(H0,n(Σn1 +Σn2 )−1H0,n 0.nφwθ,nPn,k2 n2,k2 −1/2 −1/2 + εiPk2 (Zi) + op(n + n ). (115)1 2 nn2 i∈I2

    Defne ⎧ )−1 )1/2H−1 Q−1−(H0,n(Σn1 +Σn2 H0,n φwθ,nPn,k1 Pk1 (Zi)ui⎨ 0.n n1 ,k1 , 1 ≤ i ≤ n1nn1ωi,n = . )−1 )1/2H−1 Q−1⎩ (H0,n(Σn1 +Σn2 H0,n 0.n φwθ,nPn,k2 n2 ,k2 Pk2 (Zi)εi , n1 < i ≤ n nn2

    31

  • Then by (110), we can write

    n−1/2 −1/2

    (H0,n(Σn1 +Σn2 )−1H0,n)

    1/2(θbn − θ0) = ωi,n + op(n1 + n2 ). (116) i=1

    Let Fi,n be the sigma feld generated by {ω1,n, . . . , ωi,n, {Zi}i∈I } for i = 1, . . . , n. Then under Assumption 1(i), E [ γ0 ωi,n| Fi−1,n] = 0 which means that {γ0 ωi,n}n is a martingale di˙erence array. We next use the n n i=1 Martingale CLT to show the claim. There are two suÿcient conditions to verify:

    n

    X

    X ��� � (γ0 nωi,n)2 Fi,n →p 1; and (117)E i=1

    n

    i=1

    X E

    � �� � (γn0 ωi,n)2I {|γn0 (118)ωi,n| > ε} Fi,n →p 0∀ε > 0. )1/2H−1For ease of notations, we defne Dn = (H0,n(Σn1 +Σn2 )−1H0,n 0.n. By defnition, we have

    n nXX �� ��� � � � (γ0 nωi,n)2 = γ0 nE ωi,nω0 i,nFi,n Fi,n γnE i=1 i=1

    Q−1 Q−1 P 0 φ0 D0Dnφwθ,nPn,k1 n1,k1 Qn1 ,u n1,k1 n,k1 wθ,n n = γ0 n γn n2n1 Q−1 P 0 φ0 D0Dnφwθ,nPn,k2 Qn2,εQ

    −1 n2,k2 n2,k2 n,k2 wθ,n n + γ0 n γn n2n2

    = γ0 Dn(Σn1 +Σn2 )D0 γn = γ

    0 γn = 1 (119)n n n

    which proves (117). By the monotonicity of expectation,

    Xn ��� � (γn0 ωi,n)2I {|γn0 ωi,n| > ε} Fi,nE i=1 Xn

    ε2 i=1

    ��� �1 (γ0 nωi,n)4≤ Fi,nE ������� ⎡ ⎤��� ��� 4 γ0 Q−1 nDnφwθ,nPn,k1 n1,k1 Pk1 (Zi)ui1 X ⎢⎣ ⎥⎦Fi,n = E

    n4n4 1ε2

    i∈I1 ������� ⎡ ⎤��� ��� 4 γ0 Q−1 nDnφwθ,nPn,k2 n2,k2 Pk2 (Zi)εiX1 ⎢⎣ ⎥⎦ . (120)Fi,n+ E

    n4n4 2ε2

    i∈I2

    By Assumptions 2(v) and 3(iv),

    H0,n = E [wn(Zi)φθ(Zi, θ0)φ0 θ(Zi, θ0)] ≤ C. (121)

    By (66) in the proof of Lemma 6,

    � � n −1φwθ,nφ

    0 wθ,n = E

    2 (Zi)φθ(Zi, θ0)φ0 θ(Zi, θ0)n + op(1), (122)w

    32

    http:�1H0,n0.n.By

  • which together with (121), Assumptions 2(v) and 3(ii) implies that

    C−1 < λmax(n −1φwθ,nφ

    0 (123)wθ,n) < C

    with probability approaching 1. By (112),

    λmin(n1(Σn1 +Σn2 )) ≥ λmin(n1Σn1 ) > C−1 . (124)

    For any γ ∈ Rdθ , we have

    )1/2H−2 )−1H0)1/2γγ0DnD

    0 γ γ0(H0,n(Σn1 +Σn2 )−1H0,n 0,n(H0(Σn1 +Σn2n =

    n1 n1 1 γ0H0,n(Σn1 +Σn2 )

    −1H0,nγ ≤ λ2 (H0,n) n1min

    γ0H02 ,nγ ≤

    λ2 min(H0,n)λmin(n1(Σn1 +Σn2 )

    λ2 (H0,n)max≤ , (125)λ2 +Σn2 )min(H0,n)λmin(n1(Σn1

    which combined with Assumption 3(ii), (121) and (124) implies that

    −1λmax(n1 DnDn0 ) ≤ C (126)

    with probability approaching 1. By Assumption 4(v) and the Cauchy-Schwarz inequality, ��� 4 ������� ⎡ ⎤���γ0 Q−1 nDnφwθ,nPn,k1 n1,k1 Pk1 (Zi)uiX1 ⎢⎣ ⎥⎦Fi,nE

    n4n4 1ε2

    i∈I1 ��� ��� 4 γ0 Q−1 nDnφwθ,nPn,k1 n1 ,k1 Pk1 (Zi)XC ≤ ε2 n4n4 1i∈I1 ��� Pk1 (Zi) ��� 2 γ0 Q−1 nDnφwθ,nPn,k1 n1,k1Q−2 n1,k1 Cξ2 k1 γ0 n P 0 φ0 D0 n,k1 wθ,n n XDnφwθ,nPn,k1 γn ≤

    ε2 n4n4 1i∈I1 ���γ0 nDnφwθ,nPn,k1 Q−1 n1,k1 Pk1 (Zi) ��� 2ξ2 γ0 D0 k1 nDnφwθ,nφwθ,n0 nγn XCλmax(Qn,k1 )≤ 4ε2λ2 (Qn1,k1 ) n2nmin n1 i∈I1 ξ2 γ0 D0 γ0 Q−1 P 0 φ0 D0Cλmax(Qn,k1 ) k1 nDnφwθ,nφwθ,n

    0 nγn nDnφwθ,nPn,k1 n1,k1 n,k1 wθ,n nγn =

    ε2λ2 (Qn1,k1 ) n1 nn1 n2n1min ���� ���� 2γ0 D0Dnφwθ,nφ0 n wθ,n nξ2Cλ2 )max(Qn,k1 k1 γn ≤ ε2λ3 min(Qn1,k1 ) n1 nn1 ���� ���� 2Cλ2 )λ2 (n ) ξ2 max(Qn,k1 max −1φwθ,nφwθ,n0 k1 γ0 D0 nDn nγn (127)≤ = op(1),ε2λ2 (Qn1,k1 ) n1min n1

    33

  • where the last equality is by (42), (123), (126) and Assumptions 1(v). Similarly, we can show that

    1 X ⎡ ���γ0 Q−1 nDnφwθ,nPn,k2 n2,k2 Pk2 (Zi)εi ��� 4 ⎤ ε2

    i∈I2

    E⎢⎣

    n4n4 2

    ������� Fi,n ⎥⎦ = op(1), (128) which together with (120) and (127) proves (118). As a result, the asymptotic normality of θbn follows by the martingale CLT.

    B Proof of the Main Results in Section 4

    Lemma 7. Under Assumptions 1(i), 1(iii), 1(v), 2(v) and 3(iv)-3(vi), we have

    � � u 2 2 w (Z)φθ(Z, θ0)φθ(Z, θ0)

    0nn1Σn1 = E + op(1), � �2ε2 w (Z)φθ(Z, θ0)φθ(Z, θ0)

    0nand n2Σn2 = E + op(1).

    Proof. [Proof of Lemma 7] For j = 1, . . . , dθ, let e (z) = (z)0Q−1 P 0 φ0 whereφwθj ,k1,n Pk1 n,k1 wθj ,n φwθj ,nn1,k1 denotes the j-th row of φwθ,n. We frst show that X

    −1 n |φewθj ,k1 ,n(Zi) − wn(Zi)φθ(Zi, θ0)|2 = op(1) (129) i∈I

    for any j = 1, . . . , dθ. Let βewφj ,k1 = Q−1 P 0 φ0 . Then n1,k1 n,k1 wθj ,n X −1 n |φewθj ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0)|2

    i∈I X −1≤ 2n |P 0 (Zi)βewφj ,k1 − P 0 (Zi)βwφj ,k1,n|2 k1 k1

    i∈I X −1+ 2n |P 0 (Zi)βwφj ,k1 ,n − wn(Zi)φθ(Zi, θ0)|2 k1

    i∈I

    = ( βewφj ,k1 − βwφj ,k1,n)0Qn,k1 (βewφj ,k1 − βwφj ,k1,n) + o(1) (130) where the equality is by Assumption 3(vi). Moreover

    = Q−1 P 0 )0 (131)− βwφj ,k1,n n1,k1 n,k1 (φwθj ,n − φwθj ,n,k1 eβwφj ,k1 where φwθj ,n,k1 = (Pk

    0 1 (Zi)βwφj ,k1,n)i∈I , which implies that

    (βewφj ,k1 − βwφj ,k1 ,n)0Qn,k1 (βewφj ,k1 − βwφj ,k1,n) λ2 (Qn,k1 )max≤ (P 0 )−1P 0 )0(φwθj ,n − φwθj ,n,k1 )Pn,k1 n,k1 Pn,k1 n,k1 (φwθj ,n − φwθj ,n,k1nλ2 (Qn1,k1 )min Xλ2 )max(Qn,k1 −1 |Pk0 1 (Zi)βwφj ,k1,n − wn(Zi)φθ(Zi, θ0)|2 = op(1) (132)≤ n λ2 (Qn1,k1 )min i∈I

    where the equality is by Assumption 3(vi) and (42). Combining the results in (130) and (132), we immediately get (129).

    34

  • � 2Recall that σ2 (z) = E uu

    φwθ,k1,ne

    ��Z = z � . By defnition X −1 n1Σn1 = n σ

    2 (Zi)φewθ,k1 (Zi)φewθ,k1 (Zi)0 (133)1 ui∈I1

    where (z) = (φewθj ,k1,n(z))0 j=1,...,dθ . First note that XX −1 −1 2 n σ2 (Zi)φewθ,k1,n(Zi)φewθ,k1 ,n(Zi)0 − n σ2 (Zi)w (Zi)φθ(Zi, θ0)φθ(Zi, θ0)0 1 u 1 u n

    i∈I1 i∈I1 X −1 = n σ2 (Zi)(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))wn(Zi)φθ(Zi, θ0)0 1 u

    i∈I1 X −1+ n σ2 (Zi)wn(Zi)φθ(Zi, θ0)(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))0 1 u

    i∈I1 X −1+ n1 σu

    2 (Zi)(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))0 . (134) i∈I1

    For any any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, by the Cauchy-Schwarz inequality ����� X −1 n1 σu2 (Zi)(φewθj1 ,k1,n(Zi) − wn(Zi)φθj1 (Zi, θ0))(φewθj2 ,k1,n(Zi) − wn(Zi)φθj2 (Zi, θ0)) ����� 2

    i∈I1 −1≤ n1 σu2 (Zi)(φewθj1 ,k1,n(Zi) − wn(Zi)φθj1 (Zi, θ0))2

    i∈I1

    X X

    −1×n σ2 (Zi)(φewθj2 ,k1 ,n(Zi) − wn(Zi)φθj2 (Zi, θ0))2 1 ui∈I1 X

    ≤ Cn−1 (φewθj1 ,k1,n(Zi) − wn(Zi)φθj1 (Zi, θ0))2 1 i∈I1 X

    −1×n1 (φewθj2 ,k1,n(Zi) − wn(Zi)φθj2 (Zi, θ0))2 = op(1) (135) i∈I1

    where the second inequality is by Assumption 3(v), the equality is by (129). (135) then implies that X −1 n1 σu

    2 (Zi)(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))0 = op(1). (136) i∈I1

    For any any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, by the Cauchy-Schwarz inequality ����� X −1 n σ2 (Zi)wn(Zi)φθj1 (Zi, θ0)(φewθj2 ,k1,n(Zi) − wn(Zi)φθj2 (Zi, θ0))1 ui∈I1

    ����� 2

    XX −1 2 −1≤ n1 σu2 (Zi)wn(Zi)φ2 θj1 (Zi, θ0) × n1 σu

    2 (Zi)(φewθj2 ,k1,n(Zi) − wn(Zi)φθj2 (Zi, θ0))2 i∈I1 i∈I1 XX

    −1≤ Cn−11 φ2 θj1 (Zi, θ0) × n1 (φ

    ewθj2 ,k1,n

    (Zi) − wn(Zi)φθj2 (Zi, θ0))2 = op(1) (137)

    i∈I1 i∈I1

    where the second inequality is by Assumptions 2(v) and 3(v), the equality is by (129) and (82). (137) implies that

    −1 X

    σu2 (Zi)(φewθ,k1 ,n(Zi) − wn(Zi)φθ(Zi, θ0))wn(Zi)φθ(Zi, θ0)0 = op(1) (138)n1

    i∈I1

    35

  • and X −1 n1 σu

    2 (Zi)wn(Zi)φθ(Zi, θ0)(φewθ,k1,n(Zi) − wn(Zi)φθ(Zi, θ0))0 = op(1). (139) i∈I1

    Combining the results in (134), (136), (138) and (139), we have X X −1 −1 2 n1 σu

    2 (Zi)φewθ,k1 ,n(Zi)φewθ,k1,n(Zi)0 − n1 σu2 (Zi)wn(Zi)φθ(Zi, θ0)φθ(Zi, θ0)0 = op(1). (140) i∈I1 i∈I1

    For any j1 = 1, . . . , dθ and any j2 = 1, . . . , dθ, h� i� 2 �2 E �σu2 (Zi)wn(Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) ih� �

    ≤ CE �φθj1 (Zi, θ0)φθj2 (Zi, θ0)�2 h i h i ≤ CE φ4 (Zi, θ0) E φ4 (Zi, θ0) < C (141)θj1 θj2

    where the frst inequality is by Assumptions 2(v) and 3(v), the second inequality is by the Hölder inequality, and the last inequality is by Assumption 3(iv). The i.i.d. assumption together with (141) and the Markov inequality implies that X � �−1 σ2 2 σ2 2 n (Zi)w (Zi)φθj1 (Zi, θ0)φθj2 (Zi, θ0) − E (Z)w (Z)φθj1 (Z, θ0)φθj2 (Z, θ0) = op(1). (142)1 u n u n

    i∈I1

    Collecting the results in (140) and (142), we have

    � �2 2 n1Σn1 = E u wn(Z)φθ(Z, θ0)φθ(Z, θ0)

    0 + op(1) (143)

    which proves the frst claim of the lemma. The proof of the second claim of the lemma is similar and hence omitted.

    Proof. [Proof of Lemma 1] By Lemma 7, � � � � σ2 (Z) σε

    2(Z)2 u −1 −1Σn1 +Σn2 = E wn(Z) + φθ(Z, θ0)φθ0 (Z, θ0) + op(n1 + n2 ). (144)n1 n2

    By Assumptions 2(v), 3(ii) and 3(v), � � � � σ2 (Z) σε

    2(Z)2 u −1 −1E w (Z) + φθ(Z, θ0)φθ0 (Z, θ0) > C

    −1(n + n ) (145)n 1 2 n1 n2

    which together with (144) implies that � � � � σ2 (Z) σε

    2(Z)2 uΣn1 +Σn2 = E wn(Z) + φθ(Z, θ0)φ0 θ(Z, θ0) (1 + op(1)). (146)n1 n2

    The claim of the lemma then follows by combining (146) and Assumption 3(ii).

    36

  • ∗ −1 −1 −1 −1 ∗Proof. [Proof of Theorem 3] Let w (Z) = (n +n )(n σ2 (Z)+n σε 2(Z))−1 and H∗ = E [w (Z)φθ(Z, θ0)φ0 (Z, θ0)].n 1 2 1 u 2 0,n n θThen by defnition

    −1 −1V ∗ )−1 n,θ = (n + n )(H ∗ . (147)1 2 0,n

    For any wn(·), defne

    (Z) − H∗−1 ∗ An(Z, wn) = H−1φθ(Z, θ0)wn φθ(Z, θ0)w (Z). (148)0,n 0,n n

    Then we have Vn,θ − V ∗ = E [An(Z, wn)Ωn(Z)An(Z, wn)0] ≥ 0, (149)n,θ

    −1 −1for any n1 and any n2, where Ωn(Z) = n σ2 (Z) + n σε 2(Z) and the inequality is by the fact that 1 u 2 Ωn(Z) ∈ (0, ∞) and An(Z, wn)Ωn(Z)An(Z, wn)0 is a positive semidefnite matrix almost surely.

    ∗ −1σ2 −1σ2 −1 −1Proof. [Proof of Lemma 2] By defnition w (z)−1 = (n (z)+ n ε (z))(n + n )−1. Then for any n1,n 1 u 2 1 2 n2, ����

    ∗ min inf σ2 (z), inf σε 2(z) ≤ w (z)−1 ≤ max sup σ2 (z), sup σε 2(z)u n u

    z∈Z z∈Z z∈Z z∈Z

    which together with Assumption 3(v) proves the claim of the lemma.

    C Proof of the Main Results in Section 5

    Lemma 8. Under Assumptions 1, 2, 3 and 4(i), we have ��� ��� X 2 + k−2rh 2 ) (150)bφn2 bθn

    ) − φ(z, θ0)

    −1 −1 −1) − φ(Zi, θ0)(Zi, = Op(n + k2nn2 1 2 i∈I2

    and moreover, ��� ��� = Op(ξk2 −1/2 1/2 −1/2

    b

    b

    bb

    θnz∈Z

    Proof. [Proof of Lemma 8] By defnition,

    φn2 θn

    φn2

    bφn2 k−rh ). (151)2(z, k+ ξk2 2 + ξk2sup n n1 2

    ) = Pk0 2 (z)(Pk

    0 2,n2 Pk2 ,n2 )

    −1Pk0 2,n2 gn2 (θ

    bn(z, );

    (z, θ0) = Pk0 2 (z)(Pk

    0 2,n2 Pk2 ,n2 )

    −1Pk0 2,n2 gn2 (θ0),

    where gn2 (θbn θnb

    b(z, θn

    ))0 and gn2 (θ0) = (g(Zi, θ0))0 i∈I2 i∈I2 . By the triangle inequality, for any z,) = (g(Zi, ���bφn2 ) − φ(z, θ0) ��� ≤ ���bφn2 ���bθn(z, θ0) − h0,k2

    ) − φbn2(z, (z, θ0) ���bφn2 ��� (152)+ |h0,k2 (z) − h0(z)| .(z)+

    bBy the mean value expansion and the Cauchy-Schwarz inequality,

    θn

    ��� ���

    g(Xi, ) − g(Xi, θ0) ≤ sup kgθ(Xi, θ)k bθn − θ0 (153) θ∈Nn

    37

  • which together with Assumption 4(i) and (9) in Theorem 2 implies that ���g(Zi, ) − g(Zi, θ0) ��� 2 ≤ sup n −1 2 X kgθ(Xi, θ)k2

    bθn − θ0

    2 −1 = Op(n −1 −1+ n ). (154)1 2n2 θnθ∈Nni∈I2 i∈I2

    Under Assumptions 1(i), 1(iii), 1(iv) and 3(v), we can use similar arguments in proving (61) to show that

    b

    X −1 −1 n (φbn2 (Zi, θ0) − h0,k2 (Zi))2 = Op(k2n + k−2rh ). (155)2 2 2

    i∈I2

    Using (154), we get ��� ��� X 2 θn[gn2 (θ

    bn) − gn2 (θ0)]0Pk2 ,n2 (P 0 Pk2,n2 )−1P 0 [gn2 (θbn) − gn2 (θ0)]k2,n2 k2,n2 =

    n2

    bbφn2 ) − φbn2 −1 (Zi, (Zi, θ0)n2 i∈I2

    ��� ��� X 2 θni∈I2

    which together with (152), (155) and Assumption 1(iv) implies that

    b−1 −1 −1+ n ), (156)1 2≤ n ) − g(Zi, θ0)g(Zi, = Op(n2

    ��� ��� X 2 + k−2rh 2θnbi∈I2

    This proves the frst claim of the lemma. By (60) and the Cauchy-Schwarz inequality,

    bφn2 −1 −1 −1 (157)) − φ(Zi, θ0)(Zi, = Op(n + k2n ).n2 1 2

    ���bφn2 ��� 1/2 −1/2 k−rhn + ξk2 2 (158)(z, θ0) − h0,k2 (z) = Op(ξk2 k ).sup 2 2 z∈Z

    Using (154), we get ��� ���bφn2 bθn(b (P 0 )−1θn) − gn2 (θ0)]0Pk2,n2 k2,n2 Pk2,n2 ) − φbn2(z, (z, θ0)sup

    z∈Z

    ≤ ξk2 [gn2 !1/2 2��� ��� X bθn

    −1/2 −1/2 = Op n n ) (159)(ξk2 1 + ξk2 2

    which together with (152), (158) and Assumption 1(iv) implies that

    −1≤ ξk2 n ) − g(Zi, θ0)g(Zi,2 i∈I2

    ��� ��� −1/2 1/2 −1/2bθnz∈Z

    This proves the second claim of the lemma.

    Lemma 9. Under the conditions of Theorem 4, we have

    bφn2 + ξk2 k−rh ). (160)2) − φ(z, θ0)(z, = Op(ξk2 k+ ξk2 2sup n n1 2

    38

    X

  • i (i) Hbn = H0,n + op(1); (ii) n−1(φbwθ,n − φwθ,n)0Pn,k1 = op(1); (iii) n−1(bφwθ,n − φwθ,n)0Pn,k2 = op(1);

    γ0 ( bγk =1} k Qn1,u h h(iv) sup{γk ∈Rk : γ0 k − Qn1,u (1);)γk (v) sup{γk∈Rk : γ0 γk =1} γ

    0 (Qbn2,ε − Qn2,ε)γk = op(1).kk Proof. [Proof of Lemma 9] (i) The proof of the frst claim follows by the consistency of θbn and similar arguments in deriving (86). Hence it is omitted.

    b(ii) By defnition,

    θn

    i = op

    X −1−1(φbwθ,n − φwθ,n)0Pn,k1 = n (wbn(Zi) − wn(Zi))φbθ,n2 (Zi, )Pk1 (Zi)n

    i∈I X −1+ n wn(Zi)(φbθ,n2 (Zi, θbn) − φθ(Zi, θ0))Pk1 (Zi). (161)

    i∈I

    By (9) in Theorem 2, Lemma 4 and similar arguments in showing (80),

    X −1 n i∈I

    2 bφθ,n2 θnbφbθj θ,n2

    ) − φbθ,n2(Zi, (Zi, θ0)

    2

    X 2 bθn −1 −1−1 ), (162)≤ − θ0(Zi, θ) = Op(n + nsup n 1 2 θ∈Nδn i∈I

    which together with Assumption 3(iii) implies that

    X −1 n

    i∈I

    2 bφθ,n2 θnb

    b(Zi, θn

    ) − φθ(Zi, θ0)(Zi,

    bφθ,n2

    X 2 ) − φbθ,n2 (Zi, θ0)−1≤ 2n i∈I X −1 −1/2 −1/2+2n ||φbθ,n2 (Zi, θ0) − φθ(Zi, θ0)||2 = op(n + n ). (163)1 2

    i∈I

    By (42), (82), (163), Assumption 4(iv), and the Cauchy-Schwarz inequality,

    2X

    −1 (wbn(Zi) − wn(Zi))φbθ,n2 (Zi, bθnbφθ,n2

    )Pk1 (Zi)n i∈I !

    X X22 2b(z)| θn

    = op(k1δ2 ) = op(1), (164)w,n

    −1 −1

    b

    ≤ sup |wbn(z) − wnz∈Z

    θn

    kPk1 (Zi)k(Zi, )n n i∈I i∈I

    P 2 bφθ,n2where the n−1 = Op(1) is used in the frst equality which is by (82) and (163). By (Zi, )i∈I

    39

  • Assumptions 2(v) and 3(vii), (163) and the Cauchy-Schwarz inequality,

    2X n

    −1 wn(Zi)(φbθ,n2 θn

    bφθ,n2 b ) − φθ(Zi, θ0))Pk1(Zi, (Zi)

    i∈I

    X X2 2θni∈I i∈I

    b−1 −1 (165)≤ sup |wn(z)| n z∈Z

    ) − φθ(Zi, θ0) kPk1 (Zi)k(Zi, = op(1).n

    Collecting the results in (161), (164) and (165), we immediately prove the second claim of the lemma. (iii) The proof of this result is similar to the arguments in the proof in (ii) and hence is omitted. (iv) By defnition,

    Qn1,u b X −1− Qn1,u = n1 (Yi − bhn1 (Zi))2Pk1 (Zi)Pk0 1 (Zi) − Qn1,u

    i∈I1 X �−1 Pk1 (Zi)P 0 k1 (Zi)2 i − σ2 (Zi)u= n u1 i∈I1 X

    −1+ n (bhn1 (Zi) − h0(Zi))2Pk1 (Zi)P 0 (Zi)1 k1 i∈I1 X

    −1− 2n1 (bhn1 (Zi) − h0(Zi))uiPk1 (Zi)P 0 (Zi). (166)k1 i∈I1

    By (42), Assumptions 1(i), 3(v),

    (Zi)

    2 ������ {Zi}i∈I1

    ⎤ ⎦ ⎡

    X �−1 (Zi)P 0 k1 2 i − σ2 u⎣E (Zi) Pk1n u1 i∈I1 ���h� iX ��(Zi)

    −1 −1≤ Cλmax(Qk1,n1 )ξ2 k1n = Op(ξ2 k1n ), (167)k1 1 k1 1

    which together with the Markov inequality and Assumption 4(iv) implies that

    ���2 2−2 P 0 k1 2 i − σ2 uE (Zi) Zi (Zi)Pk1 = n u1 i∈I1 −2

    X P 0 k1 ≤ Cξk

    2 1

    (Zi)Pk1 (Zi)n1 i∈I1

    X �−1 (Zi)P 0 k1 (Zi) = op(1). 2 i − σ2 u (168)(Zi) Pk1n u1 i∈I1

    Using Assumption 1(iv), (43), we get ���z∈Z z∈Z

    ��� ������(z)

    ���+ supbhn1(z) − h0(z) ≤ sup (z) − hk1 |h0,k1 + sup

    (z) − h0(z)|bhn1sup ���z∈Z (βbk1,n1 − βh,k1 )0Pk1≤ sup |h0,k1 (z) − h0(z)|(z)

    z∈Z bβk1,n1 − βh,k1

    ξk1 + sup z∈Z

    z∈Z

    ≤ |h0,k1 (z) − h0(z)|

    1/2 −1/2 k−rh= Op(ξk1 k n + ξk1 ), (169)1 1 1

    where the frst inequality is by the triangle inequality, the second inequality is by the Cauchy-Schwarz

    40

  • inequality. For any γk ∈ Rk with γ0 γk = 1, we have k X −1 sup n1 (bhn1 (Zi) − h0(Zi))2γk0 Pk1 (Zi)Pk0 1 (Zi)γk 0{γk ∈Rk: γ

    ≤ sup ���

    k γk =1} i∈I1 ��� X2 1 γ0 (Zi)P 0 kPk1 k1 (Zi)γk bhn1 (z) − h0(z) sup n10 k{γk ∈Rk : γ γk =1}z i∈I1

    z

    −1 k−2rh= Op(ξ2 k1n + ξ

    2 ) = op(1) (170)k1 1 k1 1

    where the frst equality is by (42) and (169), the last equality is by Assumption 4(iv). By the triangle inequality,

    ��� ��� 2 ≤ sup bhn1 (z) − h0(z) λmax(Qk1,n1 )

    X −1 1 (h0,k1 (Zi) − h0(Zi))uiγk0 Pk1 (Zi)Pk0 1 (Zi)γksup n 0{γk ∈Rk : γ

    −1 (h0,k1 (Zi) − h0(Zi))uiPk1 (Zi)P 0 1 k1

    γk =1} i∈I1k

    (Zi)

    . X (171)≤ n i∈I1

    By Assumptions 1(i), 1(iii), 1(iv) and 3(v), ⎤⎡

    2X

    −1 1 (h0,k1 (Zi) − h0(Zi))uiPk1 (Zi)Pk0 1 ⎣ ⎦E (Zi)n

    i∈I1 #" X ��P 0 k1 ��2−1 2E (h0,k1 (Zi) − h0(Zi))2 u (Zi)Pk1 (Zi)= n i1 i∈I1

    2 � � (ξ2 k1 k

    1−2rh 1Cn

    −1 1

    −1P 0 k1 ξ2 k1 E (172)≤ |h0,k1 (z) − h0(z)| (Zi)Pk1 (Zi) = Op )sup n1

    z∈Z

    which together with (171), Assumption 4(iv), and the Markov inequality implies that

    (Zi)γk

    X −1 1 (h0,k1 (Zi) − h0(Zi))uiγk0 Pk1 (Zi)Pk0 1 (173)(1).sup n = op{γk ∈Rk : γ0 k γk =1} i∈I1

    By the Cauchy-Schwarz inequality,

    2X

    −1 1 (

    bhn1 (Zi) − h0,k1 (Zi))uiγk0 Pk1 (Zi)Pk0 1 (Zi)γksup n 0 k{γk ∈Rk : γ γk =1} i∈I1 �����

    ����� 2

    Xk1

    γk =1} j=1

    X2 bβk1,n1 −1 1 pj (Zi)uiγk0 Pk1 (Zi)Pk0 1 ≤ − βh,k1 (Zi)γksup n 0 k X {γk ∈R

    k : γ

    k1

    i∈I1

    X2 bβk1,n1 −1 1 pj (Zi)uiPk1 (Zi)Pk0 1 (Zi)||2 . (174)≤ − βh,k1 ||n j=1 i∈I1

    41

  • By Assumptions 1(i), 1(iii) and 3(v),

    k1⎤ X ⎡ X ⎣ ⎦E −1||n1 pj (Zi)uiPk1 (Zi)Pk0 1 (Zi)||2

    j=1 i∈I1 ihk1= n

    X �� ��2−1 1 P

    0 k1

    2 2E pj (Zi)u

    k1X (Zi)Pk1 (Zi)i

    j=1 � � Cn−1 1 = O(ξ

    4 k1 k1n

    −1 1ξ

    4 k1 E p

    2 j (Zi) (175)≤ ),

    j=1

    which together with (43), (174), Assumption 4(iv), and the Markov inequality implies that

    X −1 1 (bhn1 (Zi) − h0,k1 (Zi))uiγk0 Pk1 (Zi)Pk0 1 (Zi)γksup n {γk ∈Rk : γ0 kγk =1} i∈I1

    1/2 −1/2 1/2 −1/2 = Op(ξk

    2 1 k1 n1 (k1 n1 + k1

    −rh )) = op(1). (176)

    Combining the results in (173), (176) and the triangle inequality, we get

    (Zi)γk

    = op(1). X −1 1 (bhn1 (Zi) − h0(Zi))uiγk0 Pk1 (Zi)Pk0 1 (177)sup n {γk ∈Rk : γ0 kγk =1} i∈I1

    Combining the results in (166), (168), (170) and (177), we prove the fourth claim of the lemma. (v) By defnition,

    Qn2,ε − Qn2,εb X �−1 (Zi)P 0 k2 ε2 i − σ2 ε (Zi) Pk2 (Zi)= n2 i∈I2 i b

    bbb

    h ���

    θn θn

    θn θn

    X ) − φb(Zi, −1 ) − g(Xi, θ0) + φ(Zi, θ0) εiPk2 (Zi)Pk0 2 + 2n g(Xi, (Zi)2

    i∈I2� ���� X 2 ) − φb(Zi, −1 (Zi)P 0 (Zi). (178)k2) − g(Xi, θ0) + φ(Zi, θ0)g(Xi, Pk2+ n2 i∈I2

    Using similar arguments in showing (168), we get X � 1/2 −1/2−1 Pk2 (Zi)P 0 k2 ε2 i − E[ε2 i |Zi] (179)(Zi) = Op(ξk2 k ) = op(1).n n2 2 2 i∈I2

    By the Cauchy-Schwarz inequality,

    −1 X bθn) − g(Xi, θ0))2 = Op(ξ2 k2

    ) − g(Xi, θ0))2γk0 Pk2 (Zi)Pk0 2(g(Xi, (Zi)γksup n2 0{γk ∈Rk : γ bX γk =1}k i∈I2 θn−1 −1 −1≤ ξk2 2 n + ξk2 2 (180)(g(Xi, ) = op(1)n n2 1 2 i∈I2

    42

  • where the frst equality is by (154), the second equality is by Assumption 4(iv). Similarly

    θnbX −1 (φb(Zi,2 ) − φ(Zi, θ0))2γk0 Pk2 (Zi)Pk0 2 (Zi)γksup n 0{γk ∈Rk : γ γk =1}k i∈I2 X k−2rh 2θn

    i∈I2

    where the frst equality is by (150), the second equality is by Assumption 4(iv). Collecting the results in (180) and (181), we have

    b−1 (φb(Zi,2 −1ξ2 k2 ) − φ(Zi, θ0))2 = Op(ξ2 k2 + ξk2 2 (181)≤ k2n ) = op(1)n 2

    � ��� � 2 γ0 (Zi)P 0 kPk2 k2 ���X θng(Xi,

    b ) − φb(Zi,θnb

    θn

    ) − g(Xi, θ0)

    b−1 ) − g(Xi, θ0) − φ(Zi, θ0)g(Xi, (Zi)γksup n2 0 k{γk ∈Rk : γ γk=1} i∈I2 � ���� ��� X 2 2n −1 2 γ0 (Zi)P 0 kPk2 k2 ≤ (Zi)γksup 0

    k{γk∈Rk : γ γk =1} i∈I2 � 0γθn kPk2

    Next, note that by the Cauchy-Schwarz inequality and the triangle inequality,

    b���� ��� X 2 bφ(Zi, −1 (Zi)P 0 (Zi)γk = op(1). (182)k2) − φ(Zi, θ0)2n+ sup 2 0{γk ∈Rk : γ γk =1} i∈I2k ����� (Zi)γk

    ����� X θn(g(Xi,

    b ) − φb(Zi, θnb) − g(Xi, θ0))εiγk0 Pk2 (Zi)Pk0 2

    −1 ) − g(Xi, θ0) − φ(Zi, θ0))εiγk0 Pk2 (Zi)Pk0 2(g(Xi,n2 i∈I2�����

    �����