inference on conditional quantile processes in partially...

Inference on Conditional Quantile Processes in Partially LinearModels

Zhongjun Qu�

Boston University

Jungmo Yoony

Hanyang University

August 3, 2017

Abstract

This paper develops methods for estimating and conducting inference on conditional quantile

processes for models featuring both a linear and a nonparametric component. The estimation

procedure consists of three steps, where the bandwidth parameter is allowed to vary across quan-

tiles to adapt to data sparsity. For inference, the paper �rst establishes a Bahadur representation

that holds uniformly with respect to both the covariate value and the quantile index. It then

shows that the above estimator after standardization converges weakly to a continuous Gaussian

process. When desirable, the bias term a¤ecting the asymptotic distribution can be estimated

with a local quadratic regression. The theory accounts for its estimation uncertainty. Building

on these results, the paper shows how to construct uniform con�dence bands for the conditional

quantile process using (i) the asymptotic distribution and (ii) resampling. The resampling based

band does not require conditional density estimation, which can be desirable when the sample

size is relatively small. The paper also illustrates how to test hypotheses related to signi�cance,

homogeneity and conditional stochastic dominance. Finally, practically relevant situations re-

lated to boundary points and large sample sizes are discussed. The proposed procedures are

illustrated with simulations.

Keywords: semiparametric, quantile regression, uniform Bahadur representation, uniform con-

�dence band, resampling

JEL classi�cation: C14, C21

�Department of Economics, Boston University ([email protected]).yCollege of Economics and Finance, Hanyang University ([email protected]).

1 Introduction

Quantile regression, introduced by Koenker and Bassett (1978), has emerged as a versatile frame-

work for studying response heterogeneity. In practice, it is often desirable to consider a range of

quantiles to obtain a complete analysis of stochastic relationships between variables. This motivates

the study of conditional quantile processes. Koenker and Portnoy (1987) is a seminal contribution

that establishes a uniform Bahadur representation and serves as the foundation for further devel-

opments in this area. Koenker and Machado (1999) is another milestone that introduces several

inference processes related to likelihood ratio, Wald and regression rankscore. Koenker and Xiao

(2002) further broaden the scope of Koenker and Machado (1999) by building on Khmaladzation

(Khmaladze, 1981). Chernozhukov and Fernández-Val (2005) develop resampling as an alternative

approach. Angrist, Chernozhukov and Fernández-Val (2006) establish inferential theory in mis-

speci�ed models. The results in the above studies can be used to study a wide range of issues,

including but not restricted to (i) testing alternative model speci�cations, (ii) testing stochastic

dominance, and (iii) detecting treatment e¤ect signi�cance and heterogeneity.

The above literature on quantile processes mainly considers models with parametric conditional

quantile functions. However, depending on the application, considering a nonparametric or semi-

parametric speci�cation can be desirable. For example, in a regression discontinuity (RD) design

setting, it is important to allow a nonparametric relationship between the outcome distribution and

the distance from the threshold. In a job training program evaluation application, it is informative

to allow the earning function to depend nonparametrically or semiparametrically on individual�s

characteristics. Such �exible speci�cations are commonly adopted when estimating conditional

mean functions, see Hahn, Todd and van der Klaauw (2001, RD design) and Heckman, Ichimura

and Todd (1998, program evaluation).

Studies on quantile processes in a semi- or non-parametric setting have remained relatively

sparse. Belloni, Chernozhukov, Chetverikov and Fernández-Val (2016) model the conditional quan-

tile function as a series of increasing dimensions. They provide several inference procedures for

conditional quantile processes and linear functionals including average partial derivatives. See also

Chao, Velgushev and Cheng (2016) for a related study. Qu and Yoon (2015) adopt a local regression

approach. They study uniform con�dence bands and hypotheses testing on a nonparametrically

speci�ed conditional quantile process at �xed covariate values. Guerre and Sabbah (2012) provide

a Bahadur representation for a local polynomial estimator that is uniform with respect to both

the covariate value and the quantile index, although they do not further study conditional quantile

1

processes.

A nonparametric viewpoint coupled with local regressions allows a researcher to carry out study

at a covariate value of interest without �rst obtaining a globally adequate approximation to the data.

However, it is well-known that the resulting estimator is subject to the curse of dimensionality. It

is important to go beyond the parametric-nonparametric dichotomy. This motivates us to consider

conditional quantile processes in partially linear models. Speci�cally, we consider models where

the conditional quantile function takes on the following form: Q(� jx; z) = g(x; �) + z0�(�), where

g(x; �) is a nonparametric component, x is a vector of covariates, � is the quantile index, z0�(�)

is a linear component, and z is another set of covariates. The main goals of the paper are to: (i)

develop procedures for estimating Q(� jx; z) over � 2 T where T � (0; 1), (ii) construct a uniformcon�dence band for Q(� jx; z) over � 2 T , and (iii) test hypotheses about Q(� jx; z) over � 2 T .Although the focus is on partially linear models, we conjecture that some parts of the analysis,

such as those obtaining a uniform Bahadur representation and constructing a uniform band that

acknowledges the estimation bias, can be more generally useful for studying other semiparametric

models.

The estimation procedure consists of three steps: (1) The linear component of the model is

estimated with local linear or quadratic regressions using all observations of the sample. (2) The

nonparametric component is estimated with local linear regressions conditional on the linear com-

ponent using observations local to x. (3) The two components are combined and linear interpolation

between quantile levels is applied to obtain a continuous process over T . The two options in Step1 are intended to o¤er �exibility in practice. The local linear option operates under a weak as-

sumption on the smoothness of g(x; �) with respect to x, but the bandwidth conditions are hard to

implement in practice. The local quadratic option requires a strong smoothness assumption, but

the bandwidth choice is clear to make. The inference theory is developed for both options. The

estimation procedure builds upon the existing literature. Speci�cally, the averaging in Step 1 fol-

lows Lee (2003), which is also used in Cai and Xiao (2012) when considering models with partially

varying coe¢ cients. The linear interpolation follows Neocleous and Portnoy (2008). It permits a

tractable inferential theory as presented later.

For inference, the paper provides four sets of results. The �rst result is a Bahadur representation

that holds uniformly with respect to the covariate value and the quantile index. It can be of

independent interest. The second result is weak convergence of the estimator after standardization

to a continuous Gaussian process. This result shows explicitly how the presence of the linear

2

component a¤ects the asymptotic distribution while not altering the rate of convergence. The

third result is an MSE-optimal bandwidth formula for estimating Q(� jx; z) over T . The fourthresult includes two methods for constructing uniform con�dence bands for Q(� jx; z). One methodis based on simulating the asymptotic approximation, while the other is based on resampling

by building on Parzen, Wei and Ying (1994). The methods are complementary. The simulation

based band is computationally less demanding, but requires estimating nuisance parameters. The

resampling based band does not require estimating any nuisance parameter. However, it can be

computationally costly for large data sets. The two methods therefore cover a broader spectrum of

applications than either one of them can do.

As is typically the case with semiparametric estimation, the estimator here is a¤ected by a bias

term. The paper considers two options. First, it obtains bandwidth conditions under which the

bias vanishes asymptotically. This corresponds to the under-smoothing option often used in the

literature. Second, it measures the magnitude of the bias and then accounts for the measurement

uncertainty when constructing the con�dence band. This option is related to that of Calonico,

Cattaneo and Titiunik (2014). When studying inference on the average treatment e¤ect in an RD

design setting, they provide a novel formula for the variance of the average treatment e¤ect estimator

that accounts for the e¤ect of the bias estimation. Compared with their setting, the inference here is

on a process rather than a �nite dimensional parameter. Therefore, simply modifying the variance

estimator will not solve the problem. We make progress by studying the structure of the subgradient

conditions (to be made explicit later). Qu and Yoon (2016) follow a similar strategy, but they only

consider asymptotic approximation based band and only under the RD design setting.

The estimation method and theory cover the situation where x is in the interior of the data

support and the sample size is not too big. However, this leaves out the following two situations

that are important in empirical research. (1) The value of x may be close to the boundary such

that the kernel extends to outside of the data support. Or, x may represent a threshold such that

the researcher wants to estimate two conditional quantiles separately. To address this, the paper

derives an asymptotic approximation to the distribution of the estimator when x is close to or is on

the boundary of the data support. Further, when implementing the con�dence band, it provides

formulae that are valid irrespective of whether x is an interior or a boundary point. (2) The sample

size makes it infeasible to average over all covariate values in Step 1 of the estimation procedure.

Even if the sample size is not a concern, the researcher may still feel reluctant to average because

�(�) may take on a di¤erent value at data region distant from x. To address this, the paper

3

develops a procedure that basically carries out only Steps 2 and 3 of the estimation procedure. The

distribution of this estimator is studied. It converges at the same rate as the original three-step

estimator. Therefore, it is not subject to the curse of the dimensionality of z.

Besides the literature on conditional quantile processes, this paper also relies on the following

two overlapping strands of literatures: (i) Partially linear models, including Robinson (1988), He

and Shi (1996), Lee (2003), Wang, Zhu and Zhou (2009), Cai and Xiao (2012), and Sherwood

and Wang (2016). (ii) Nonparametric quantile regressions, including Stone (1977), Chaudhuri

(1991), Chaudhuri, Doksum and Samarov (1997), Yu and Jones (1998), Kong, Linton and Xia

(2010), and Guerre and Sabbah (2012). From a broad perspective, this paper contributes to the

semiparametric analysis literature that is becoming increasingly important because rich data sets

are becoming increasingly available.

The remainder of the paper is structured as follows. Section 2 describes the model. Section 3

presents the estimation procedure. Section 4 studies its asymptotic properties. Section 5 develops

uniform con�dence bands. The bias issue is studied in detail. Section 6 illustrates how the results

can be used to test hypotheses related to signi�cance, dominance and homogeneity. Section 7

provides two extensions. Section 8 reports �nite sample properties, while Section 9 concludes.

The proofs are included in two appendices: the main appendix for results in the paper and the

supplementary appendix for some auxiliary lemmas. An R code that implements the estimation

and inference procedures will be available on the authors�website.

The following notation is used. kzk denotes the Euclidean norm of a real-valued vector z. 1(�)is the indicator function. D[0;1] stands for the set of functions on [0; 1] that are right continuous

and have left limits, equipped with the Skorohod metric. The symbols �)�and �!p�denote weak

convergence under the Skorohod topology and convergence in probability, and Op(�) and op(�) isthe usual notation for the orders of stochastic magnitude.

2 The model

Let (Y;X 0; Z 0) be a random vector, where Y is an outcome variable with a continuous distribution,

and X and Z represent two sets of covariates. The conditional quantile function of Y given X = x

and Z = z is assumed to have a partially linear structure, given by

Q(� jx; z) = g(x; �) + z0�(�) for � 2 T ; (1)

where T = [�1; �2] with 0 < �1 � �2 < 1. In practice, T can be chosen depending on the research

interest. For example, if the research interest is on the lower part of the distribution, then we can

4

choose T = ["; 0:5] with " being a small positive number. In the above speci�cation, g(X; �) is

allowed to be a �exible function of X and � . The covariates Z a¤ect the quantile function linearly,

nevertheless its impact, measured by �(�), can still vary across quantiles. Throughout the paper,

X is assumed to have a continuous distribution, while Z can contain both continuous and discrete

components. We expect the model to be useful in applications where the dimension of X is low,

while that of Z is more �exible.

As an example,X can correspond to the level of a treatment, and Z includes variables controlling

for confounding factors. The model allows the treatment e¤ect to vary �exibly with the treatment

level. Given a treatment level, it allows the treatment to have heterogeneous e¤ects at di¤erent

parts of the conditional distribution. Therefore, it can potentially capture the heterogeneity in

both directions, with respect to both the covariate value and the quantile.

When restricting to a single quantile, the model is the same as in Lee (2003). When focusing

instead on the conditional mean, the model is �rst developed in Robinson (1988). The goals of

the current paper are to estimate and conduct inference on Q(� jx; z) over the entire � 2 T . As apreview, the following issues can be addressed using the methods developed. The formal analysis

is in the sections that follow.

Uniform con�dence bands. Let 0 < p < 1 be a coverage level. For given x and z, constructing

functions Lp(� jx; z) and Up(� jx; z) such that asymptotically:

P (Q(� jx; z) 2 [Lp(� jx; z); Up(� jx; z)] for all � 2 T ) � p:

Signi�cance. Testing the null hypothesis of Q(� jx1; z1) = Q(� jx2; z2) for all � 2 T against the

alternative hypothesis of Q(� jx1; z1) 6= Q(� jx2; z2) for some � 2 T , where (x1; z1) and (x2; z2) aretwo covariate values speci�ed by the researcher.

Homogeneity. Testing the null hypothesis of Q(� jx1; z1) � Q(� jx2; z2) being constant over Tagainst the alternative hypothesis of Q(� jx1; z1)�Q(� jx2; z2) 6= Q(sjx1; z1)�Q(sjx2; z2) for some� ; s 2 T :

Dominance. Testing the null hypothesis of Q(� jx1; z1) � Q(� jx2; z2) � 0 over T against the

alternative hypothesis of Q(� jx1; z1)�Q(� jx2; z2) < 0 for some � 2 T .The above four issues are empirically important and have appeared in a fair number studies.

Below, we only discuss those that are most closely related to the current study. First, if the

5

nonparametric component g(x; �) is absent, then the theory in Koenker and Machado (1999),

Koenker and Xiao (2002) and Chernozhukov and Fernández-Val (2005) can be used to analyze all

four issues above. Second, if the parametric component z0�(�) is absent, then the results in Belloni,

Chernozhukov, Chetverikov and Fernández-Val (2016, based on series approximation) and Qu and

Yoon (2015, based on local regressions) are applicable. Therefore, the analysis here can be viewed

as a partial bridge between the parametric and nonparametric polar cases.

The analysis of Belloni, Chernozhukov, Chetverikov and Fernández-Val (2016), and more gener-

ally the series framework, includes the partially linear model as a special case. The main di¤erences

between the series framework and the current local regressions framework are as follows. (a) The

challenges are di¤erent. In the series framework, the conditional quantile function is modeled glob-

ally with an increasing number of parameters. The main challenge is specifying a globally adequate

approximating model. Here, the conditional quantile function is approximated locally with a few

parameters, where the key challenge is determining a suitable bandwidth. (b) The computational

cost is di¤erent. For the series approach, the estimation can be done in one step using the full

sample, therefore is feasible even for large sample sizes. In the current approach, as seen below,

the computation will need to be done in steps, therefore the cost can be much higher. This dif-

ference motivates us to consider an extension that has substantially lower computational cost with

reduction in asymptotic e¢ ciency (see Section 7). (c) The theory for inference is di¤erent. For

series, if one assumes that the approximation error is small, then the inference can proceed as if

the estimated linear model were the true model. Meanwhile, in practice it is often unclear how

to quantify the size of the approximation error. For the current approach, a bias term will be

explicit in the asymptotic distribution. We show that it is feasible to quantify it when constructing

con�dence bands.

3 The estimation procedure

This section presents a three-step procedure for estimating Q(� jx; z) over T at given x and z. Thefocus is on the practical side; more technical discussions are left to the later sections.

Let fxi; zi; yigni=1 be a sample of size n, where the dimension of xi is d, while that of zi is q.The conditional quantile function evaluated at (xi; zi) is then

Q(� jxi; zi) = g(xi; �) + z0i�(�):

The procedure allows the user to choose running local linear or local quadratic regressions in

the �rst step. We need some notations to encompass both options. Note that the standard second

6

order Taylor series approximation to g(xi; �) at x can be written as

�0(x; �)+�1(x; �)0 (xi � x)+

1

2

dXj=1

�j;j(x; �) (xi;j � x:;j)2+dXj=1

dXl=j+1

�j;l(x; �) (xi;j � x:;j) (xi;l � x:;l) ;

(2)

where xi;j is the j-th element of xi, x:;j is the j-th element of x;

�0(x; �) = g(x; �) 2 R;

�1(x; �) =@g(xi; �)

@xi

��xi=x

2 Rd;

�j;l(x; �) =@2g(xi; �)

@xi;j@xi;l

��xi=x

2 R

for j; l = 1; :::; d. To further shorten the notation, let q (xi � x) and �2(x; �) be two d(d + 1)=2dimensional vectors such that q (xi � x)0 �2(x; �) equals the summation of the two second orderterms in the Taylor approximation. Speci�cally, let �2(x; �) be a d(d + 1)=2 dimensional vector

that includes �1;1(x; �)=2; �2;2(x; �)=2; :::; �d;d(x; �)=2 as its �rst d elements, followed by �j;l(x; �)

with (j; l) being in lexicographical order. Let q (xi � x) be a d(d + 1)=2 dimensional vector thatincludes (xi;j � x) (xi;l � x) as its elements, ordered in the same way as in �2(x; �). Using the

shortened the notation, we can rewrite (2) as

�0(x; �) + (xi � x)0 �1(x; �) + q (xi � x)0 �2(x; �): (3)

Immediately, the �rst order Taylor approximation to g(xi; �) at x is given by

�0(x; �) + (xi � x)0 �1(x; �);

where the coe¢ cients have the same de�nition as before.

The basic idea of the estimation procedure is to �rst obtain preliminary estimates of �(�), then

average them over to obtain a more precise estimate �(�) (as in Lee, 2003), and �nally estimate

g(x; �) conditionally on �(�). Let K (�) be a kernel function. Let hn and bn;� be two bandwidthparameters used in the �rst and the second step. Note that the bandwidth bn;� is allowed to vary

across quantiles to re�ect data sparsity at di¤erent quantile levels. The procedure is as follows.

Step 1. Partition T using m equally spaced quantiles f�1; :::; �mg. For each quantile k 2f1; :::;mg and each i 2 f1; :::; ng, minimize

nXj=1;j 6=i

��k�yj � a0 � (xj � xi)0 a1 � q (xj � xi)0 a2 � z0jb

�K

�xj � xihn

�(4)

7

over a0 2 R; a1 2 Rd; a2 2 Rd(d+1)=2 and b 2 Rq (this is the local quadratic option); or minimizenX

j=1;j 6=i��k�yj � a0 � (xj � xi)0 a1 � z0jb

�K

�xj � xihn

�(5)

over a0 2 R; a1 2 Rd and b 2 Rq (this is the local linear option). In either option, denote theoptimal value of b as e�(xi; �k). Compute

�(�k) =1

n

nXi=1

e�(xi; �k): (6)

Step 2. Estimate the nonparametric component at x by solving

minnXj=1

��k

�yj � z0j �(�k)� a0 � (xj � x)

0 a1�K

�xj � xbn;�k

�(7)

over a0 2 R and a1 2 Rd. Denote the optimal values for a0 and a1 by �0(x; �k) and �1(x; �k).

Step 3. Apply linear interpolation as in Neocleous and Portnoy (2008) to obtain continuous

processes over � 2 T :

�0(x; �) = w (�) �0(x; �k) + (1� w (�)) �0(x; �k+1);

�(�) = w (�) �(�k) + (1� w (�)) �(�k+1);

where w (�) = (�k+1 � �)=(�k+1 � �k) if � 2 [�k; �k+1]. Finally, compute

Q(� jx; z) = �0(x; �) + z0�(�) for any � 2 T :

We now discuss some practical aspects of Steps 1 to 3. The two options in Step 1 both lead to

tractable theory for inference. They are both considered here in order to o¤er �exibility in practice.

The options are motivated by a trade-o¤ between requirements for smoothness of g(x; �) and for

bandwidth. As seen in the next section, for the local linear option, the inference theory has fairly

mild requirement for smoothness: g(x; �) needs to have �nite second order derivatives with respect

to x over the data support. However, the bandwidth requirement on hn is more stringent from a

practical perspective. It needs to be of lower order than bn;� in order not to contaminate Q(� jx; z)with a bias term. Meanwhile, it should not be too small in order to estimate �(�) to desired

precision. Although the bandwidth requirement can be easily stated in theory (see Assumption 9),

implementing them in practice is not straightforward. The local quadratic option is the opposite.

8

The inference theory has a stronger requirement on smoothness, i.e., g(x; �) needs to have �nite

third order derivatives with respect to x over the data support. However, the requirement on hn

is now fairly straightforward to satisfy. It is allowed to take on any values of order n�1=(4+d) (the

MSE optimal rate for a local linear quantile regression) to n�1=(6+d) (the optimal rate for a local

quadratic quantile regression).

Step 2 allows the bandwidth to be di¤erent at di¤erent quantiles as in Qu and Yoon (2015).

This is desirable because the data can be sparse near the tails of the conditional distribution. The

estimation of �(�) will not a¤ect the distribution of �0(x; �k) and �1(x; �k) asymptotically. For

this reason, we do not allow the bandwidth in the �rst step to vary across quantiles.

Using local quadratic regression in Step 1 and local linear regression in Step 2 also has the

following advantages. We can derive (see Corollary 1) and apply the MSE optimal bandwidth for

estimating Q(� jx; z) in Step 2. Also, we are able to measure the magnitude of the bias a¤ectingQ(� jx; z) and take this into account when constructing the con�dence band. It appears that theseresults are hard to achieve if we use local linear regressions in both steps.

If a researcher prefers to use a common bandwidth for both Steps 1 and 2 (i.e., hn = bn;0:5),

then the following values are permitted by the theory. In the local quadratic speci�cation, one can

use hn = bn;0:5 = Cn�1=(4+d), where the constant C can be determined using the MSE optimal

bandwidth formula (Corollary 1). For the local linear case, one can use hn = bn;0:5 = cn�1=(4+d)��,

where c is a constant that in theory can take on any value and 0 < � < 4=(d(4 + d)).

Step 3 obtains a continuous process over T out of m points. As seen later, if m is su¢ ciently

large (m= (nbn;� )1=4 ! 1 as n ! 1), the distribution of Q(� jx; z) will be as if all quantiles in T

entered Step 1. Neocleous and Portnoy (2008) has a similar result in a parametric setting.

Steps 1 to 3 do not enforce quantile monotonicity (Q(�1jx; z) � Q(�2jx; z) for any �1 � �2 in

T ). This can be done by applying rearrangement to Q(� jx; z) (Chernozhukov, Fernández-Val andGalichon, 2010) to compute for all � 2 T

inffy 2 R :ZT1(Q(� jx; z) � y)du � � � �1g;

where �1 is the lower limit of T . This has no �rst order e¤ect on Q(� jx; z) if (nbdn;� )1=2(Q(� jx; z)�Q(� jx; z)) converges to a continuous Gaussian process, which will be the case as seen in the nextsection. To save notation, we call the rearranged estimator Q(� jx; z).

9

4 Asymptotic properties

This section studies the distribution ofqnbdn;� (Q(� jx; z)�Q(� jx; z)). The strategy is to sequentially

study the three estimators from Step 1 to Step 3, and to build the next result on the preceding

ones. Below, we �rst de�ne some notations and state the assumptions needed.

Assumption 1 f(xi; zi; yi)gni=1 is a sample of n observations that are i:i:d: as (X;Z; Y ).

Let f(�) denote the marginal density of X and f (�jX;Z) the density of Y conditional on X and

Z. In particular, f(Q(� jX;Z)jX;Z) equals the conditional density evaluated at the � -th conditionalquantile, which will be shortened as f(� jX;Z) unless confusion can arise. De�ne

u = [u1; u2; :::; ud]0 2 Rd:

By the de�nition of q(�) in (3):

q(u) = [u21; :::; u2d; u1u2; :::; u1ud; u2u3; :::; u2ud; :::; ud�1ud]

0. (8)

De�ne

�u =

266641

u

q(u)

37775 : (9)

Assumption 2 X and Z each has a compact support, denoted by Sx and Sz.

Assumption 3 f(� jx; z) is �nite over Sx � Sz at all � 2 T :

Assumption 4 Q(� jx; z) and @Q(� jx; z)=@xj are Lipschitz continuous with respect to x and � overT � Sx � Sz for all j = 1; :::; d. If the local linear option is used in Step 1: @2Q(� jx; z)=@xj@xkare �nite over T � Sx � Sz for all j; k = 1; :::; d. If the local quadratic option is used in Step 1:

@3Q(� jx; z)=@xj@xk@xl are �nite and @2Q(� jx; z)=@xj@xk are Lipschitz continuous with respect tox and � over T � Sx � Sz for all j; k; l = 1; :::; d.

Assumption 5 The kernel K(�) is compactly supported, bounded, having �nite �rst-order deriva-tives and satisfying K(�) � 0;

RK(u)du = 1;

RuK(u)du = 0;

R �u�u0K(u)du < 1. There exists a�nite C such that jK(u)�K(v)j � C ku� vk :

10

Assumption 6 If the local linear option is used in Step 1: hn = O(n�1=(4+d)) andpnhdn= log

2 n!1 as n!1. If the local quadratic option is used in Step 1: hn = O(n�1=(6+d)) and

pnhdn= log

2 n!1 as n!1.

Assumption 7 bn;� = c(�)bn, where bn = O(n�1=(4+d));pnbdn= log

2 n ! 1 as n ! 1 and c(�)

is Lipschitz continuous with c(0:5) = 1 and 0 < c � c(�) � �c <1 for all � 2 T .

Assumptions 1 and 2 are fairly standard; they rule out time series applications. Assumption

4 speci�es the smoothness requirement on Q(� jx; z) that is discussed in the previous section. Itdoes not require

@Q2(� jx; z)=@x@x 6= 0, which will be added when determining the MSE-optimalbandwidth.

Assumption 6 is also fairly standard. The rates permit the usual MSE-optimal rates for local

linear and quadratic regressions, respectively. This assumption is su¢ cient for analyzing Step 1 of

the estimation procedure. It will be strengthened when analyzing Step 2.

Assumption 7 also permits the usual MSE-optimal rate. It is su¢ cient for both Steps 1 and 2.

The Lipschitz condition is satis�ed by the optimal bandwidth derived later.

Now we analyze Step 1 of the estimation procedure. For the local quadratic option, let e�(x; �) bethe standardized di¤erence between the estimates and their true values in the Taylor approximation:

e�(x; �) =qnhdn0BBBBBB@

e�0(x; �)� �0(x; �)hn(e�1(x; �)� �1(x; �))h2n(e�2(x; �)� �2(x; �))e�(x; �)� �(�)

1CCCCCCA : (10)

The following vector contains all the regressors in the local quadratic regression (divided by the

bandwidth when relevant):

Wj(x; hn) =

266666641

xj�xhn

q(xj�x)h2n

zj

37777775 : (11)

For the local linear option, we de�ne e�(x; �) in the same way, but without the elements h2n(e�2(x; �)��2(x; �)). In this case,

Wj(x; hn) =

266641

xj�xhn

zj

37775 :

11

For both options, let

u0j (�) = yj � g(xj ; �)� z0j�(�): (12)

Also de�ne

Mn(x; �) = (nhdn)�1

nXj=1

f (� jxj ; zj)Wj(x; hn)Wj(x; hn)0K

�xj � xhn

�;

S0 (x; �) = (nhdn)�1=2

nXj=1

�� 1(u0j (�) � 0)

Wj(x; hn)K

�xj � xhn

�:

Assumption 8 There exists l > 0, such that Mn(x; �) is �nite and its smallest eigenvalue is

bounded away from 0 for all n > l uniformly over Sx � T :

The eigenvalue condition in Assumption 8 ensures identi�cation. It is analogous to require

n�1X 0X to be invertible in a standard linear regression model Y = X 0� + u.

The matrix Mn(x; �), to a large extent, determines the precision of the estimators e�(x; �) andQ(� jx; z). So it is worthwhile to take a further look at it. Its limit depends on the location of xrelative to the data support. If x is a �xed value in the interior of Sx, then Mn(x; �) converges to

E

0@f(X)f (� jX;Z)24 R

�u�u0K(u)duR�uK(u)duZ 0

ZR�u0K(u)du ZZ 0

35��X = x

1A (13)

in the local quadratic case and

E

0@f(X)f (� jX;Z)24 R uu0K(u)du 0

0 ZZ 0

35��X = x

1Ain the local linear case. If x is a �xed value on the boundary of Sx, or if it is modeled as a sequencethat approaches a value on the boundary, then the limit of Mn(x; �) will in general be di¤erent. In

such situations, we call x a boundary point. Following Ruppert and Wand (1994), we model x as

x = x@ + hnc for some �xed c 2 supp (K) and some �xed x@ on the boundary of Sx;

and de�ne the following set which serves as the domain for integration:

Dx;hn = fu 2 Rd : (x+ hnu) 2 Sxg \ supp (K) :

For example, suppose supp (K) = [�1; 1];Sx = [0; 1]. If x = 0, then Dx;hn = [0; 1]. If x = chn with

c > 0, then Dx;hn = [�c; 1]. For a boundary point, Mn(x; �) converges to

E

0@f(X)f (� jX;Z)24 R

Dx;hn�u�u0K(u)du

RDx;hn

�uK(u)duZ 0

ZRDx;hn

�u0K(u)du ZZ 0

35��X = x

1A12

in the local quadratic case. In the local linear case, Mn(x; �) converges to the same limit as above

but with u replacing �u when de�ning the integrals.

Lemma 1 Let Assumptions 1 to 6 and 8 hold. Then,

sup�2T

supx2Sx

e�(x; �)�Mn(x; �)�1S0 (x; �)

= Op

�(nhdn)

�1=4 log n+ hrn

qnhdn

�;

where r = 2 in the local linear case and r = 3 in the local quadratic case.

This Bahadur representation is uniform over T and Sx. It applies to the nonparametric model(i.e., Q(� jxi; zi) = g(xi; �)) as a special case. It can be of independent interest. In the expression,

hrnpnhdn appears because of the di¤erence between Q(� jxj ; zj) and the local approximation, there-

fore can be viewed as a bias term, while (nhdn)�1=4 log n can be viewed as a variance term. The

power �1=4, instead of �1=2, appears because the indicator function 1(s � 0) is not di¤erentiableat s = 0.

Because the Bahadur representation is often a key step towards establishing the asymptotic

distribution of a quantile regression estimator, it has appeared in a fair number of studies. A subset

of the studies considers semi- or non-parametric models estimated by kernel or local polynomial

methods. Chaudhuri (1991, Theorem 3.3) obtains a Bahadur representation for a nonparametric

model estimated by local polynomials. The result is pointwise with respect to the covariate value

and the quantile index. Chaudhuri, Doksum and Samarov (1997, Lemma 4.1, p.733) and Lee (2003,

Lemma 1, p.29) provide representations for nonparametric and partially linear models, while Kong,

Linton and Xia (2010) study the general issue of estimating M-regression function for strongly

mixing stationary processes. Their results are uniform with respect to the covariate value but

pointwise with respect to the quantile index. Qu and Yoon (2015) obtain a representation that is

uniform with respective to the quantile, but pointwise with respect to the covaraite value, for a

purely nonparametric model.

Guerre and Sabbah (2012) present the �rst Bahadur representation that is uniform with respect

to both x and � . The main di¤erences of the current result from theirs are twofold. First, the

current result is for a semiparametric model. Second, its proof relies on a di¤erent strategy. More

speci�cally, the main components of the proof in the appendix include the following. (i) Use

Knight�s (1998) identity to decompose the criterion function in Step 1 to establish the convergence

rate of e�(x; �). This leads to the following result: P (sup�2T supx2Sx jje�(x; �)jj � log n) ! 1; see

Lemma B.3 in the supplementary appendix. This result allows the subsequent analysis to focus

13

on a compact set. (ii) Apply a chaining argument to decompose the stochastic variation of the

recentered subgradient over T � Sx�f� : jj�jj � log ng. And then make use of the monotonicity ofthe indicator function, say 1(u0j (�) � l), with respect to both � and l to relate di¤erent quantiles

to each other. This reduces the problem of establishing uniformity over a compact set to over

a grid of points. See Steps 1 and 2 in the proof of Lemma B.1 in the supplementary appendix.

(iii) Apply Bernstein�s inequality to determine the orders of various terms from (ii). See Step 3 in

the proof of Lemma B.1. This proof can be viewed as a further development of that in Qu and

Yoon (2015). The strategy may be useful for analyzing other types of semiparametric conditional

quantile models.

The next lemma establishes the order of �(�k)� �(�), where �(�k) is de�ned in (6).

Lemma 2 Let the assumptions of Lemma 1 hold. Then, uniformly over T :

�(�)� �(�) = Op(n�1=2 + (nhdn)

�3=4 log n+ hrn);

where r = 2 in the local linear case and r = 3 in the local quadratic case.

The term n�1=2 follows from averaging Mn(xi; �)�1S0 (xi; �) over xi. It is always of lower order

than (nbdn)�1=2. Because Q(� jx; z) can�t converge faster than (nbdn)�1=2, this term will have no

�rst-order e¤ect on Q(� jx; z) asymptotically. The e¤ect of the remaining two terms on Q(� jx; z)depends on whether the local linear or quadratic regression is used in Step 1. If the local quadratic

regression is used with the bandwidth hn � bn, then these two terms will be both of lower order

than (nbdn)�1=2, provided that Assumptions 6 and 7 also hold. If the local linear regression is

used, then for these two terms to be of lower order than (nbdn)�1=2, we need to further require

(nbdn)1=2(nhdn)

�3=4 log n ! 0 and h2n(nbdn)1=2 ! 0. We strengthen the bandwidth conditions to

make these requirements clear:

Assumption 9 In the local quadratic regression case: hn � bn. In the local linear regression case:

(nbdn)1=2(nhdn)

�3=4 log n! 0 and (nbdn)1=2h2n ! 0.

The next result characterizes asymptotic properties of Q(� jx; z).

Theorem 1 Let Assumptions 1 to 9 hold. Assume m=(nbdn)1=4 ! 1. Assume x and z are some

�xed values, where x is in the interior of Sx. Thenqnbdn;�

�Q(� jx; z)�Q(� jx; z)� b2n;�B(x; �)

�) G1 (x; �) ;

14

where

B(x; �) =1

2tr

�@2g(x; �)

@x@x0

Zuu0K(u)du

�;

G1 (x; �) is a mean-zero continuous Gaussian process over T with

E(G1 (x; r)G1 (x; s)) (14)

=r ^ s� rs

f (x)E[f(rjX;Z)jX = x]E[f(sjX;Z)jX = x] (c (r) c (s))d=2

ZK

�u

c (r)

�K

�u

c (s)

�du

for any r; s 2 T ; c (�) is de�ned in Assumption 7, and f(rjX;Z) denotes the conditional density ofY evaluated at the r-th conditional quantile.

This result generalizes Qu and Yoon (2015, Theorem 2.1) to the partially linear model. As ex-

pected, the limiting distribution is not a¤ected by the estimation of �(�), and B(x; �) is independent

of the distribution of Z. If �(�) is constant over T , then E[f(� jX;Z)jX = x] = [@g(x; �)=@� ]�1, in

which case Z and �(�) have no e¤ect on G1 (x; �), even when Z and X are correlated. The resulting

covariance kernel is the same as in Qu and Yoon (2015, Theorem 2.1). If �(�) is not constant over

T , then E[f(� jX;Z)jX = x] = E[f@g(X; �)=@� +Z 0(@�(�)=@�)g�1jX = x], in which case both the

shape of �(�) and the dependence between Z and X will a¤ect G1 (x; �).

Besides E[f(� jX;Z)jX = x] and f (x), EG1 (x; �)2 depends only on the dimension of X, the

kernel, and its bandwidth. This feature is useful for obtaining an optimal bandwidth that minimizes

the mean squared error for estimating Q(� jx; z) over � 2 T .

Corollary 1 Let the assumptions in Theorem 1 hold. Assume��tr �@2g(x; �)=@x@x0�� > 0. Then,

for any � 2 T , the bandwidth that minimizes the (interior) asymptotic MSE of Q(� jx; z) is

h�n;� =

0B@ � (1� �) dRK (u)2 du

f (x)nE[f(� jX;Z)jX = x] tr

�@2g(x;�)@x@x0

Ruu0K(u)du

�o21CA1=(4+d)

n�1=(4+d): (15)

This result generalizes Corollary 1 in Qu and Yoon (2015) to the partially linear model. In the

proof, we verify that it satis�es the Lipschitz continuity condition in Assumption 7. As expected,

the dimension of Z does not a¤ect the bandwidth, but it can still have e¤ects through the term

E[f(� jX;Z)jX = x]. The result implies that obtaining an optimal bandwidth for estimating

Q(� jx; z) over T is conceptually no more di¢ cult than in the conventional situation of estimating

a single quantile.

To compute the bandwidth (15), the main challenge is in estimating @2g(x; �)=@x@x0. We sug-

gest implementing an approximation that is due to Yu and Jones (1998), which treats @2g(x; �)=@x@x0

15

as constant across quantiles. Speci�cally, under such an approximation, we �rst obtain the optimal

bandwidth at the median using (15), then we compute an approximation to h�n;� using h�n;�h�n;1=2

!4+d= 4� (1� �)

�E[f(0:5jX;Z)jX = x]

E[f(� jX;Z)jX = x]

�2:

Using a Normal reference method (assuming the conditional density to be Gaussian) as in Yu and

Jones (1998), the above relationship further simpli�es to h�n;�h�n;1=2

!4+d=

2� (1� �)��(��1(�))2

; (16)

where � and � are the density and the cdf of a standard normal random variable. This procedure

delivers a sequence of bandwidths that automatically satis�es Assumption 7. The same procedure

is also used in Qu and Yoon (2015). Of course, one does not have to use this approximation when

@2g(x; �)=@x@x0 can be estimated to desired precision over T :Theorem 1 assumes x is an interior point. This simpli�cation leads to transparent expressions

that reveal key features of the estimator. For this reason, in the next section we will continue

to focus on the interior point case when presenting the theoretical results. Meanwhile, we will

accommodate the boundary point situation as follows. First, when discussing the implementation

of the con�dence bands, we give formulae that are valid even when x is a boundary point. These

formulae are applied to the R code and used in the simulation study. Second, in Section 7, we

provide a corollary that explicitly characterizes the distribution of Q(� jx; z) in the boundary pointcase. The result will show why the formulae used are always valid.

5 Con�dence bands

This section considers two approaches to construct uniform con�dence bands for Q(� jx; z) over Tat some x and z. The �rst approach applies the asymptotic approximation to the distribution of

Q(� jx; z). The second approach is based on a resampling method, following Parzen, Wei and Ying(1994). For both approaches, we �rst derive a con�dence band that works with undersmoothing.

Then, we measure the bias term a¤ecting Q(� jx; z) and construct a band that accounts for themeasurement uncertainty.

We begin by presenting an infeasible con�dence band assuming that the nuisance parameters are

all known. This result serves as the basis for constructing con�dence bands using the asymptotic

approximation. It is also needed for proving that the resampling based band is asymptotically

16

valid. De�ne

�n;� =�nbdn;�

��1=2qEG1 (x; �)

2;

where G1 (x; �) is de�ned in Theorem 1.

Corollary 2 Under the conditions of Theorem 1, an asymptotic p-percent con�dence band for

Q(� jx; z) over � 2 T is given byhQ(� jx; z)�B(x; �)b2n;� � �n;�Cp; Q(� jx; z)�B(x; �)b2n;� + �n;�Cp

i:

where Cp is the p-th percentile of sup�2T jG1 (x; �) =qEG1 (x; �)

2j.

5.1 Con�dence bands based on asymptotic approximations

A con�dence band that is valid with undersmoothing can be obtained by simulating G1 (x; �) and

then computing �n;� and Cp. Speci�cally, by the proof of Theorem 1, G1 (x; �) is the limit of

e01

24(nbdn;� )�1 nXj=1

f (� jx; zj)K�xj � xbn;�

��Wj(x; bn;� ) �Wj(x; bn;� )

0

35�1 (17)

��nbdn;�

��1=2 nXi=1

�� 1(u0i (�) � 0

�) �Wi(x; bn;� )K

�xi � xbn;�

�;

where

�Wj(x; bn;� ) =

24 1

xj�xbn;�

350 and e1 =

24 1

0d

35 : (18)

Conditional on fxi; zigni=1, the distribution of the expression (17) remains unchanged when u0i (�)is replaced by ui � � , where ui are i:i:d. Uniform(0,1) random variables independent of fxi; zigni=1.This observation is a slight generalization of the observation made by Parzen, Wei and Ying (1994)

in the sense that now (17) corresponds to a process over T . Still, the expression (17) dependson the nuisance parameter f (� jx; zj). In the simulation section, we replace it with the followingestimator (see Koenker, 2005 for a more detailed discussion of this estimator)

f (� jx; zj) =2�n;�

Q (� + �n;� jx; zj)� Q (� � �n;� jx; zj); (19)

where �n;� is a bandwidth parameter. Because Q (� jx; zj) converges in probability uniformly overT , f (� jx; zj) converges uniformly to f (� jx; zj) if �n;� (nbdn;� )1=2 ! 1. In summary, we generate

17

independent copies of

e01

24(nbdn;� )�1 nXj=1

f (� jx; zj)K�xj � xbn;�

��Wj(x; bn;� ) �Wj(x; bn;� )

0

35�1 (20)

��nbdn;�

��1=2 nXi=1

(� � 1(ui � � � 0)) �Wi(x; bn;� )K

�xi � xbn;�

�and use the empirical distribution to approximate the distribution of G1 (x; �) :

Putting the pieces together, the con�dence band can be constructed as follows. (i) Simulate

(20) keeping fxi; zigni=1 �xed. Repeat this for N times and denote the resulting values by G(i)1 (�)

with i = 1; :::; N . (ii) Compute s(�)2 = N�1PNi=1G

(i)1 (�)2. Then, compute sup�2T jG

(i)1 (�) =s(�)j

for i = 1; :::; N and save the p-th percentile denoted by Cp. (iii) Compute �n;� = (nbdn;� )�1=2s(�)

and obtain the con�dence band as

[Q(� jx; z)� �n;� Cp; Q(� jx; z) + �n;� Cp]: (21)

Corollary 3 Let conditions of Theorem 1 hold. Assume nbd+4n;� ! 0 and �n;� (nbdn;� )1=2 !1 for all

� 2 T . Then, (21) is an asymptotically valid p-percent con�dence band for Q(� jx; z) over � 2 T .

The proof follows directly from Corollary 2 and is omitted. A key requirement in the Corollary

is nbd+4n;� ! 0. Without it, the bias will in general have a �rst order e¤ect on (nbdn;� )1=2(Q(� jx; z)�

Q(� jx; z)), therefore the band may not have correct coverage even asymptotically. Even if nbd+4n;� ! 0

is satis�ed, the bias can still have a substantial e¤ect on the distribution in �nite samples. Therefore,

it is desirable to have a procedure that can take the bias e¤ect into account.

Below, we consider a con�dence band that is centered at Q(� jx; z)� B(x; �)b2n;� , where B(x; �)is an estimator for the bias term B(x; �). Because this method requires estimating the second order

derivative, for consistency we suppose that the local quadratic regression option is used in Step 1

of the estimation procedure in Section 3. Now, we �rst discuss how to estimate the bias. Then, we

discuss how to account for the estimation uncertainty.

We estimate the bias term B(x; �) conditional on �(�k) as follows. For each �k 2 f�1; :::; �mg,minimize

nXj=1

��k

�yj � z0j �(�k)� 0 � (xj � x)

0 1 � q (xj � x)0 2�K

�xj � xrn;�

�over 0 2 R; 1 2 Rd and 2 2 Rd(d+1)=2, where rn;� is a bandwidth parameter to be speci�ed later.Apply linear interpolation to obtain

0(x; �); 1(x; �) and 2(x; �): (22)

18

Compute the bias estimator as

B(x; �) = e01

24(nbdn;� )�1 nXj=1

K

�xj � xbn;�

��Wj(x; bn;� ) �Wj(x; bn;� )

0

35�1 (23)

�

8<:(nbdn;� )�1nXj=1

�Wj(x; bn;� )K

�xj � xbn;�

�q

�xj � xbn;�

�09=; 2(x; �);

where �Wj(x; bn;� ) and e1 are de�ned in (18) and q(�) is given in (3).To account for the estimation uncertainty, we study the joint distribution of (nbdn;� )

1=2(Q(� jx; z)�B(x; �)b2n;� �Q(� jx; z)) and (nbd+4n;� )

1=2(B(x; �)�B(x; �)). Let

fWj(x; rn;� ) =

266641

(xj�x)rn;�

q(xj�x)r2n;�

37775 :Assumption 10 The bandwidth satis�es rn;� = ec(�)rn, where ec(�) is Lipschitz continuous withec(0:5) = 1 and 0 < c � ec(�) � �c < 1 for all � 2 T , and c1bn � rn � c2hn for some �nite c1 and

c2 independent of n.

Lemma 3 Let the conditions in Theorem 1 and Assumption 10 hold. Assume a local quadratic

regression is used in Step 1 of the estimation procedure. Then, uniformly over T :qnbdn;�

�Q(� jx; z)� B(x; �)b2n;� �Q(� jx; z)

�= D1 (x; �)�D2 (x; �) + op (1) ;

where

D1 (x; �) =

�nbdn;�

��1=2Pni=1

�� 1

�u0i (�) � 0

�K(xi�xbn;�

)

f(x)E[f(� jX;Z)jX = x];

D2 (x; �) =

0@qnbd+4n;�qnrd+4n;�

1A�(nrdn;� )�1=2Pni=1

�� 1

�u0i (�) � 0

�fWi(x; rn;� )K(xi�xrn;�

)

f(x)E[f(� jX;Z)jX = x];

� =

�Zq(u)K(u)du

�0e03

�Z�u�u0K(u)du

��1;

where e03 selects the last d(d+1)/2 elements of a vector and q(u) and �u are de�ned in (8) and (9) .

The term D1 (x; �) is an approximation to (nbdn;� )1=2(Q(� jx; z)�B(x; �)b2n;� �Q(� jx; z)) while

D2 (x; �) is to (nbdn;� )1=2b2n;� (B(x; �)�B(x; �)). If the two bandwidths rn;� and bn are of the same

19

order (i.e., rn;�=bn = �(�) with 0 < �(�) < 1 independent of n for all � 2 T ), then D1 (x; �)and D2 (x; �) will be of the same stochastic order. If rn;�=bn ! 1 for all � 2 T , then D2 (x; �)will converge to zero. In this case, including D2 (x; �) leads to a re�nement to the conventional

asymptotic approximation.

The above approximation has two features. First, D1 (x; �) and D2 (x; �) both depend on

1�u0i (�) � 0

�(i = 1; :::; n). Second, they are both conditionally pivotal. These two features imply

that the joint distribution of D1 (x; �) and D2 (x; �) conditional on fxi; zigni=1 can be simulatedby drawing ui � i:i:d: Uniform (0,1) and evaluating D1 (x; �) and D2 (x; �) with u0i (�) replaced

by ui � � . That is, the joint distribution can be estimated by simulating (20) for D1 (x; �) and

simulating 0@qnbd+4n;�qnrd+4n;�

1A e01

24(nbdn;� )�1 nXj=1

K

�xj � xbn;�

��Wj(x; bn;� ) �Wj(x; bn;� )

0

35�1 (24)

�

8<:(nbdn;� )�1nXj=1

�Wj(x; bn;� )K

�xj � xbn;�

�q

�xj � xbn;�

�09=;�e03

0@(nrdn;� )�1 nXj=1

f (� jx; zj)K�xj � xrn;�

�fWj(x; rn;� )fWj(x; rn;� )0

1A�1

�(nrdn;� )�1=2nXi=1

f� � 1 (ui � � � 0)gfWi(x; rn;� )K

�xi � xrn;�

�for D2 (x; �), where e3 and e1 have the same de�nitions as before.

Putting the pieces together, the con�dence band can be obtained as follows. (i) Simulate (20)

and (24) keeping fxi; zigni=1 �xed. Repeat for N times and denote the values by G(i)1 (�) and

G(i)2 (�) (i = 1; :::; N). (ii) Compute s(�)2 = N�1PN

i=1(G(i)1 (�) � G

(i)2 (�))2. Then, compute

sup�2T j(G(i)1 (�)�G

(i)2 (�))=s(�)j for all i = 1; :::; N and save its p-th percentile Cp. (iii) Compute

�n;� = (nbdn;� )

�1=2s(�) and obtain the con�dence band as

[Q(� jx; z)� B(x; �)b2n;� � �n;� Cp; Q(� jx; z)� B(x; �)b2n;� + �n;� Cp]; (25)

where B(x; �) is given by (23).

The next result establishes the limit of (nbdn;� )1=2(Q(� jx; z)� B(x; �)b2n;� �Q(� jx; z)) under two

bandwidth sequences: rn;�=bn = �(�) with 0 < �(�) < 1 and rn;�=bn ! 1. The result impliesthat the band has correct asymptotic coverage under both sequences.

20

Theorem 2 Let the conditions in Theorem 1 and Assumption 10 hold. Assume a local quadratic

regression is used in Step 1 of the estimation procedure. If rn;�=bn = �(�) with 0 < �(�) <1 over

T , thenqnbdn;�

�Q(� jx; z)� B(x; �)b2n;� �Q(� jx; z)

�) G1 (x; �)� (c(�)=� (�))2+

d2 G2 (x; �) ;

where G1 (x; �) is de�ned in Theorem 1 and G2 (x; �) is a vector of mean-zero continuous Gaussian

processes over T satisfying

E(G2 (x; t)G2 (x; s)0) =

(t ^ s� ts)f (x) (� (t)� (s))d=2E[f(tjX;Z)jX = x]E[f(sjX;Z)jX = x]

�Z�(u; t)�(u; s)0K

�u

� (t)

�K

�u

� (s)

�du;

E(G1 (x; t)G2 (x; s)0) =

(t ^ s� ts)f(x) (c(t)�(s))d=2E[f(tjX;Z)jX = x]E[f(sjX;Z)jX = x]

�Z�(u; s)0K

�u

c (t)

�K

�u

� (s)

�du;

where �(u; t) =h1 u0

�(t)q(u)0

�(t)2

i0. If rn;�=bn !1, thenq

nbdn;�

�Q(� jx; z)� B(x; �)b2n;� �Q(� jx; z)

�) G1 (x; �) over � 2 T :

The next result considers the same situation as in Theorem 2, but restricting rn;� = bn;� .

Corollary 4 Assume the conditions in Theorem 2 hold with rn;� = bn;� for all � 2 T . De�ne�Q(� jx; z) = 0(x; �) + z

0�(�), where 0(x; �) is given by (22). Then,qnbdn;�

��Q(� jx; z)�Q(� jx; z)

�) G1 (x; �)�G2 (x; �) ;

where G1 (x; �) and G2 (x; �) are the processes in Theorem 2 with �(�) = c(�) for all � 2 T .

Therefore, with rn;� = bn;� , the following two estimators for Q(� jx; z) are asymptotically equiv-alent: (a) the estimator using the local quadratic regression (22) and (b) the estimator using the

local linear regression but with bias correction. A similar equivalence result is obtained in Calonico,

Cattaneo and Titiunik (2014) when estimating the conditional mean function.

5.2 Con�dence bands based on resampling

We �rst give a procedure that requires undersmoothing and then a procedure that measures the

bias and accounts for its uncertainty. These procedures only include simple, though repetitive,

operations. They do not require estimating any conditional or marginal density function.

21

Recall that the estimation procedure in Section 3 produces �0(x; �) and �(�) as estimates for

g(x; �) and �(�). The resampling procedure consists of three steps. STEP B1: For all k 2 f1; :::;mg,�nd ��0(x; �k) and �

�1(x; �k) that solve

(nbdn;�k)�1=2

nXj=1

n�k � 1(yj � z0j �(�k)� ��0(x; �k)� (xj � x)

0 ��1(x; �k) � 0)o

(26)

� �Wj(x; bn;�k)K

�xj � xbn;�k

�= �(nbdn;�k)

�1=2nXj=1

f�k � 1(uj � �k � 0)g �Wj(x; bn;�k)K

�xj � xbn;�k

�;

where uj are i:i:d: Uniform(0,1) random variables independent of fxi; zigni=1 and �Wj (�; �) is de�nedin (18). As in Parzen, Wei, and Ying (1994), the solutions can be found using augmented quantile

regressions. Let Un+1 equal the right hand side of (26). Consider a local linear quantile regression

with one new observation:

mina0;a1

nXj=1

��k

�yj � z0j �(�k)� a0 � (xj � x)

0 a1�K

�xj � xbn;�k

�+ ��k

�yn+1 � z0n+1�(�k)� xn+1;0a0 � x0n+1;1a1

�where zn+1 = 0,

�xn+1;0; (xn+1;1=bn;�k)

0�0 = ��1kqnbdn;�kUn+1 and yn+1 is a large number such

that 1(yn+1� z0n+1�(�k)� x0n+1a0� (x1n+1)0a1 � 0) is always zero. Repeat the above estimation toobtainN independent copies of ��0(x; �) and call them �

�(j)0 (x; �) (j = 1; :::; N). STEP B2: Compute

��(x; �)2 = N�1PNj=1(�

�(j)0 (x; �)� �0(x; �))2 and sup�2T j(�

�(j)0 (x; �)� �0(x; �))=��(x; �)j for each

j. Denote the the p-th percentile by C�p . STEP B3: Compute the con�dence band ashQ(� jx; z)� ��(x; �)C�p ; Q(� jx; z) + ��(x; �)C�p

i:

The next result implies that this con�dence band is asymptotically valid.

Theorem 3 Under the conditions of Theorem 1, we haveqnbdn;� (�

�0(x; �)� �0(x; �)) = D�

1(x; �) + op (1) ;

where D�1(x; �) equals D1(x; �) with u

0j (�) replaced by uj � � and the order holds uniformly over T .

Further, D�1(x; �)) G1 (x; �) over � 2 T :

22

The con�dence band that re�ects the bias estimation uncertainty also consists of three steps.

Recall that the bias B(x; �) is estimated using B(x; �) in (23). STEP R1: Find �0(x; �k); �1(x; �k)

and �2(x; �k) that solve

(nrdn;�k)�1=2

nXj=1

n� � 1

�yj � z0j �(�k)� �0(x; �k)� (xj � x)

0 �1(x; �k)� q (xj � x)0 �2(x; �k) � 0

�o�fWj(x; rn;�k)K

�xj � xrn;�k

�= �(nrdn;�k)

�1=2nXj=1

f�k � 1(uj � �k � 0)gfWj(x; rn;�k)K

�xj � xrn;�k

�;

where uj are i:i:d: Uniform(0,1) random variables independent of fxi; zigni=1. Apply linear in-

terpolation to obtain �2(x; �) and then compute B�(x; �) using (23) with 2(x; �) replaced by

�2(x; �). Repeat this N time and denote the estimates by B�(j)(x; �) (j = 1; :::; N). STEP

R2: Compute ��(x; �)2 = N�1PNj=1[(�

�(j)0 (x; �) � �0(x; �)) � b2n;� (B

�(j)(x; �) � B(x; �))]2 and

sup�2T j[(��(j)0 (x; �) � �0(x; �)) � b2n;� (B

�(j)(x; �) � B(x; �))]=��(x; �)j for each j. Call the p-th

percentile C�p . STEP R3: Compute the con�dence band ashQ(� jx; z)� B(x; �)b2n;� � ��(x; �)C�p ; Q(� jx; z)� B(x; �)b2n;� + ��(x; �)C�p

i:

In the procedure, the distributions of (��0(x; �) � �0(x; �)) and (B�(x; �) � B(x; �)) are used

to estimate those of (Q(� jx; z)�B(x; �)b2n;� �Q(� jx; z)) and (B(x; �)�B(x; �)). The next result

implies that this con�dence band is asymptotically valid.

Corollary 5 Let the conditions in Theorem 1 and Assumption 10 hold. Assume a local quadratic

regression is used in Step 1 of the estimation procedure. Thenqnbdn;�

�(��0(x; �)� �0(x; �))� b2n;�

�B�(x; �)� B(x; �)

��= D�

1(x; �)�D�2(x; �) + op (1) ;

where D�1(x; �) and D

�2(x; �) equal D1(x; �) and D2(x; �) with u

0j (�) replaced by uj�� and the order

holds uniformly over T .

Between the two approaches, the resampling approach is simpler to implement because it does

not require estimating any nuisance parameter directly. But it is also computationally more inten-

sive. Therefore, we may expect the resampling approach to be attractive when the sample size is

relatively small, and the asymptotic approximation based approach to be more useful otherwise.

23

6 Hypothesis tests

This section applies the methods developed so far to test the hypotheses in Section 2. Let (x1; z1)

and (x2; z2) be two sets of covariate values of interest. Let w(�) � 0 be a weight function chosenby the user that satis�es w(�)!p w(�) uniformly over T , where w(�) is a deterministic continuousfunction of � . For example, w(�) can equal 1 over T . De�ne

�(�) = Q(� jx1; z1)�Q(� jx2; z2):

Let

�(�) = Q(� jx1; z1)� Q(� jx2; z2)

and

W (�) =qnbdn;� w(�)(�(�)� b2n;� (B(x1; �)� B(x2; �)):

The three hypotheses outlined in Section 2 can be tested as follows:

Signi�cance: WS (T ) = sup�2T

jW (�)j ;

Homogeneity: WH (T ) = sup�2T

��W (�)�qnbdn;� w(�)R

s2T

qnbdn;sw(s)ds

Z�2T

W (�)d�

�� ;Dominance: WA (T ) = sup

�2Tj1 (W (�) � 0)W (�)j :

To present the asymptotic approximation, de�ne

D3(�) = w(�)f[D1 (x1; �)�D2 (x1; �)]� [D1 (x2; �)�D2 (x2; �)]g; (27)

where D1 (x; �) and D2 (x; �) are de�ned in Lemma 3.

Corollary 6 Let the conditions in Theorem 1 and Assumption 10 hold. Assume a local quadratic

regression option is used in Step 1 of the estimation procedure. Also, assume w(�) !p w(�)

uniformly over T . Then:

1. Signi�cance: Under �(�) = 0 for all � 2 T , WS (T )� sup�2T jD3(�)j = op (1) :

2. Homogeneity: Under �(�) = � for all � 2 T for some � 2 R,

WH (T )� sup�2T

��D3(�)�qnbdn;�w(�)R

s2T

qnbdn;sw(s)ds

Z�2T

D3(�)d�

�� = op (1) :

24

3. Dominance: Under the least favorable null hypothesis of �(�) = 0 for all � 2 T ;

WA (T )� sup�2T

j1 (D3(�) � 0)D3(�)j = op (1) :

The proof consists of applying Lemma 3 twice to (x1; z1) and (x2; z2). It is omitted. In imple-

mentation, the distributions of D1 (x1; �)�D2 (x1; �) and D1 (x2; �)�D2 (x2; �) can be simulatedusing (20) and (24) after replacing x with x1 and x2. The above procedure involves bias estima-

tion. If instead one wishes to work with undersmoothing, then W (�) needs to be constructed as

(nbdn;� )1=2w(�)�(�) and D3(�) as w(�)fD1 (x1; �) � D1 (x2; �)g. The simulation only needs (20)

after replacing x with x1 and x2:

7 Extensions

This section considers two empirically important situations. The �rst situation is about large

sample sizes. We give a simpler procedure that is less e¢ cient but has the same rate of convergence

as the estimator in Section 3. The second situation is about a boundary point. We study the

asymptotic properties of the estimator in Section 3 with and without bias correction at a boundary

point. The result will show why the implementation in Section 5 are valid for both interior and

boundary points.

7.1 Large sample size

Step 1 of the estimation procedure enables us to estimate the parametric component of the model

using information from the full sample. If the sample is large (e.g., containing millions of observa-

tions), then this step can be computationally infeasible. In such a situation, if a loss in e¢ ciency

is considered acceptable, then both components of the model can be estimated jointly using only

information local to x. That is, we directly solve the following estimation problem:

min

nXj=1

��k�yj � a0 � (xj � x)0 a1 � z0jb

�K

�xj � xbn;�k

�

over a0 2 R; a1 2 Rd, b 2 Rq. Denote the estimates after linear interpolation by

Ql(� jx; z):

Here, no averaging over xi is involved. The main computational cost comes from solving the above

minimization problem separately at m quantile levels.

25

To characterize the distribution of the estimator, let

Ml(x; �) = E

0@f(X)f (� jX;Z)24 1 Z 0

Z ZZ 0

35��X = x

1A ;

Jl(x; �) = E

0@f(X)f (� jX;Z)24 1

Z

35��X = x

1A :

Corollary 7 Under the conditions of Theorem 1, uniformly over T :qnbdn;�

�Ql(� jx; z)�Q(� jx; z)� b2n;�Bl(x; z; �)

�= D1;l (x; z; �) + op (1) ;

where

D1;l (x; z; �) = e0zMl(x; �)�1(nbdn;� )

�1=2nXj=1

�� 1(u0j (�) � 0)

24 1

zj

35K �xj � xbn;�

�Bl(x; z; �) = e0zMl(x; �)

�1Jl(x; �)B(x; �); ez = [ 1 z0 ]0;

and B(x; �) is de�ned in Theorem 1.

The distribution of D1;l (x; z; �) is conditionally pivotal. It can be simulated as in Section 5.

The bias now depends on the relationship between X and Z. Nevertheless, it can still be estimated,

therefore permitting a bias corrected estimator. Let Bl(x; z; �) denote the estimator for Bl(x; z; �)

by running the local quadratic regression (22) and applying the formula (23), except that all the

parameters are estimated jointly and that e01 and �Wj(x; bn;� ) in (23) are replaced by [1 0d z0]

and Wj(x; bn;� ) (i.e., with the Z component appended to these two vectors). The following result

mirrors Lemma 3 in Section 5. Its proof is similar and omitted.

Corollary 8 Let the conditions in Theorem 1 and Assumption 10 hold. Then, uniformly over T :qnbdn;�

�Ql(� jx; z)� Bl(x; z; �)b2n;� �Q(� jx; z)

�= D1;l (x; z; �)�D2;l (x; z; �) + op (1) ;

where D1;l (x; z; �) is given in Corollary 7,

D2;l (x; z; �) =

8<:0@qnbd+4n;�qnrd+4n;�

1A e0zMl(x; �)�1Jl(x; �)

�Zq(u)K(u)du

�09=;�e02M(x; �)�1(nrdn;� )�1=2

nXj=1

�� 1

�u0j (�) � 0

�Wj(x; rn;� )K

�xj � xrn;�

�;

where e2 selects the (d+2)-th to the (d2+3d+4)/2-th elements of a vector, M(x; �) equals (13), and

Wj(x; rn;� ) is de�ned in (11).

26

The above two results imply that Ql(� jx; z) and Ql(� jx; z) � Bl(x; z; �)b2n;� can be used to

construct asymptotic or resampling based con�dence bands as in Section 5. The main di¤erence

in implementation is that the parametric component is re-estimated whenever the nonparametric

component is estimated. We omit the details.

7.2 Boundary point

This subsection provides two results to characterize the distribution of Q(� jx; z) and Q(� jx; z) �Bv(x; �)b

2n;� in the boundary point situation.

Corollary 9 Let Assumptions 1 to 9 hold. Assume m=(nbdn)1=4 !1. Thenq

nbdn;�

�Q(� jx; z)�Q(� jx; z)� b2n;�Bv(x; �)

�= D1;v (x; �) + op (1) ;

where

D1;v (x; �) = e01 [f (x)E(f(� jX;Z)jX = x)Nx (�)]�1

��nbdn;�

��1=2 nXi=1

�� 1(u0i (�) � 0

�) �Wi(x; bn;� )K

�xi � xbn;�

�;

Nx (�) =

ZDx;bn;�

24 1

u

35 �1 u0�K (u) du; (28)

Bv(x; �) =1

2e01Nx (�)

�1ZDx;bn;�

u0@2g(x; �)

@x@x0u

24 1

u

35K (u) du;and the domain of integration is determined by the location of x relative to the boundary: Dx;bn;� =fu 2 Rd : (x+ bn;�u) 2 Sxg \ supp (K) :

If Dx;bn;� = supp (K), then Nx (�) is block diagonal and D1;v (x; �) and Bv(x; �) reduce to

D1 (x; �) and B(x; �) in Lemma 3 and Theorem 1. Therefore, the result covers the interior

point situation as a special case. The �rst part of formula (17) is a consistent estimator for

e01[f (x)E(f(� jX;Z)jX = x)Nx (�)]�1. This explains why the implementation (17) is valid in both

situations.

Corollary 10 Let Assumptions 1 to 9 hold. Assume m=(nbdn)1=4 ! 1 and a local quadratic

regression option is used in Step 1 of the estimation procedure. Thenqnbdn;�

�Q(� jx; z)� Bv(x; �)b2n;� �Q(� jx; z)

�= D1;v (x; �)�D2;v (x; �) + op (1) ;

27

where D1;v (x; �) is de�ned in Corollary 9,

D2;v (x; �) =


1A�v(x; �)((nrdn;� )�1=2 nXi=1

�� 1

�u0i (�) � 0

�fWi(x; rn;� )K

�xi � xrn;�

�);

�v(x; �) =

24e01Nx (�)�10@Z

Dx;bn;�

24 1

u

35 q(u)0K (u) du1A35

�e03

f (x)E[f(� jX;Z)jX = x]

ZDx;rn;�

�u�u0du

!�1;

and the domain of integration is Dx;rn;� = fu 2 Rd : (x+ rn;�u) 2 Sxg \ supp (K) :

If Dx;bn;� = Dx;rn;� = supp (K), then �v(x; �) reduces to �=[f (x)E(f(� jX;Z)jX = x)] in

Lemma 3. Therefore, the above result also covers the interior point situation as a special case. The

�rst three lines of the expression (24) is a consistent estimator for (qnbd+4n;� =

qnrd+4n;� )�v(x; �). This

explains why the implementation (24) is valid in both situations.

8 Simulations

This section considers three issues: (i) the performance of bandwidth selection rules, (ii) the per-

formance of estimators Q(� jx; z) and Q(� jx; z)� B(x; �)b2n;� , and (iii) the performance of uniformcon�dence bands. Local quadratic regressions are used in the �rst step of the estimation procedure.

We consider two models. Their conditional quantile functions are given by

Q(� jx; z) = g(x1; x2; �) + �1(�)z1 + �2(�)z2;

where

�1(�) = � and �2(�) =p0:2 + � :

Therefore, some heterogeneity with respect to � is allowed in the linear part. The two models di¤er

in the shape of g(x1; x2; �):

Model 1 : g(x1; x2; �) = (0:5 + 2x1 + sin(2�x1 � 0:5)) + x2Qe1(�);

Model 2 : g(x1; x2; �) = log(x1x2) +1

1 + exp (�x1Qe1(�)� x2Qe2(�))+ x2Qe1(�):

Model 1 is a location-scale model with nonlinearity in the location. Model 2 is a fairly general

nonlinear model. They both have di¤erent degrees of curvatures at di¤erent evaluation points.

28

This poses substantial challenges for estimation and inference. These two functions are the same

as DGPs 2 and 3 considered in the simulation section in Qu and Yoon (2015).

The covariates x1; x2; z1; z2 each has a U(0; 1) marginal distribution but are allowed to be

correlated. Speci�cally, the pairs (x1; z1) and (x2; z2) are independent of each other, but within a

pair, the correlation coe¢ cient equals 0.3. The presence of correlations implies that omitting z1 or

z2 may lead to omitted variable bias when estimating g(x1; x2; �).

Other aspects of the simulation design are as follows. The error terms e1 and e2 are i.i.d. N(0; 1)

and U(0; 1) respectively. The evaluation point x takes on three values: (0:5; 0:5), (0:5; 0:75), and

(0:75; 0:75), while z is �xed at z = (0:5; 0:5). Among them, x = (0:75; 0:75) can be viewed as

a boundary point because the selected bandwidths are often signi�cantly bigger than 0:25. The

quantile range is T = [0:2; 0:8]. The sample size equals 500 and 1000. The kernel function is the

product of univariate Epanechnikov kernels. All subsequent results are based on 1000 replications.

8.1 Bandwidth selection

First, consider how to estimate the MSE-optimal bandwidth in Corollary 1. This is done in two

steps.

Step A. (Obtaining a pilot bandwidth) Apply cross validation to select a bandwidth at the

median. (i) For a given candidate bandwidth value, estimate the conditional median at (xi; zi) by

running a local quadratic regression while dropping (yi; xi; zi). The goodness of �t is determined

by the di¤erence in yi and the estimated conditional median. (ii) Repeat the estimation at di¤erent

(xi; zi) and compute the mean absolute deviation. (iii) The cross-validation bandwidth is the value

that minimizes this mean absolute deviation. Call the cross-validation bandwidth hcv.

There are two things to note. First, all parameters are estimated jointly in the step (i) and the

averaging step is not applied to get �(�). So the step (i) is essentially the simpli�ed procedure in

Section 7.1. Second, the step (ii) is applied to 50% of observations that are closest to x. This is to

ensure that the chosen bandwidth is not too in�uenced by the values distant from x.

Step B. (Estimating nuisance parameters) Use hcv to compute relevant quantities needed to

compute the MSE-optimal bandwidth at the median, b�n;0:5. The second order derivatives are esti-

mated from the local quadratic regression of Step A. All other terms can be computed numerically,

except E[f (0:5jX;Z)jX = x]2 and f (x). For them, note that

E[f(0:5jX;Z)jX = x]2f (x) =[E[f (0:5jX;Z)jX = x] f (x)]2

f (x):

29

The numerator is estimated by24(nhdn)�1 nXj=1

f (0:5jx; zj)K�xj � xhn

�352and the denominator is by

(nhdn)�1

nXj=1

K

�xj � xhn

�:

Finally, f (0:5jx; zj) can be computed as described in Section 5.1. Call the resulting MSE-optimalbandwidth hopt.

Now there are two bandwidths: (i) the cross-validated bandwidth hcv, and (ii) the MSE optimal

bandwidth hopt. We will report two sets of �nal estimates depending on which bandwidth is used

in Step 1:

Bandwidth Option 1 : Use hcv in the local quadratic regression, i.e., set hn = hcv:

Bandwidth Option 2 : Use hopt in the local quadratic regression, i.e., set hn = hopt:

This will allow us to examine whether the �nal estimates are sensitive to the bandwidth used in

Step 1.

In Step 2 of the estimation procedure, �(x; �) is computed using bandwidth bn;� for � 2 T : Tocompute them, we set bn;0:5 = hopt and relate bn;� to bn;0:5 using Yu and Jones�(1998) rule. For

inference (i.e., constructing con�dence bands), we maintain rn;� = bn;� throughout the analysis.

Table 1 reports summary statistics of selected bandwidths under di¤erent models, sample sizes,

and evaluation points. They show that the procedure performs fairly well in capturing the curva-

tures of Q(� jx; z). Between the two models, bandwidths for Model 2 tend to be bigger than thosefor Model 1 when compared at the same evaluation point. This re�ects that Model 1 is in general

more curved than Model 2 when measured by tr�@2Q(� jx; z)=@x@x0

�2. Such consistency is alsoobserved within each model. Take the two design points (0:5; 0:75) and (0:75; 0:75) as example. In

Model 1, the curvature at the former is slightly smaller than the latter. The bandwidths selected are

comparable. In Model 2, the curvature at the former is greater than the latter. The bandwidths at

x = (0:75; 0:75) tend to be bigger. Finally, hcv tends to be bigger than hopt. It is expected because

hcv is obtained for a local quadratic regression and hopt is optimal for a local linear regression.

8.2 The performance of the estimators

This section examines two issues. (i) What is the �nite sample performance of Q(� jx; z) andQ(� jx; z)� B(x; �)b2n;�? (ii) Are they sensitive to bandwidths used in Step 1?

30

Table 2 reports root mean squared errors (RMSEs) and biases of Q(� jx; z) and Q(� jx; z) �b2n;� B(x; �) under �Bandwidth Option 1�and �Bandwidth Option 2�. The sample size equals 500.

The tradeo¤ implied by the asymptotic theory is clearly present in �nite samples. The estimator

Q(� jx; z) is often substantially biased, however its RMSE is comparable or lower than that of

Q(� jx; z)� b2n;� B(x; �).Comparing the bandwidth options shows the following. First, the RMSEs and biases are always

similar. This is encouraging because it suggests that the proposed estimator allows a wide range

of bandwidth values without sacri�cing its e¢ cacy. Second, a closer inspection suggests that using

a larger bandwidth, hcv, sometimes produces a smaller RMSE. But the di¤erence is too small to

be of any importance. The results with n = 1000 are similar and omitted.

8.3 Properties of uniform con�dence bands

This section examines the following issues: (i) whether con�dence bands with bias estimation (i.e.,

robust bands) show meaningful improvement over conventional bands, (ii) whether the improvement

comes at the cost of a substantially wider band. We start with n = 500 and then consider n = 1000.

Table 3 shows coverage rates of several uniform bands at two nominal levels p = 0:90 and

0:95 using Bandwidth Option 1. We start by considering the robust bands �Asy R�(robust band

based on asymptotic approximation) and �Res R�(robust band based on resampling). The coverage

rates of �Asy R�are overall close to the nominal level, although some undercoverage exists when

x is close to the boundary. At x = (0:75; 0:75) and when the bandwidth equals 0.335, there are

only approximately 170 observations for estimating the quantile process and nuisance parameters,

therefore the task is challenging. The coverage rates of �Res R�are higher than �Asy R�. Some slight

undercoverage exists at (0:75; 0:75).

The other con�dence bands are less satisfactory. �Asy�and �Res�are conventional bands that

estimate the bias but do not adjust the standard error. They exhibit signi�cant under-coverage

in all cases. �Asy 2�and �Res 2�are conventional bands that ignore the bias, i.e., assuming it is

zero. They also show undercoverage in most cases. �Asy M�is a modi�ed band proposed in Qu

and Yoon (2015). It allows the bias to change the length of the con�dence band, but in an ad-hoc

manner. �Res M�applies the same idea to the resampling based band. The performance of these

modi�ed bands are good in Model 2 but less so in Model 1. In summary, these bands suggest that

accounting for the bias in these two models is not only important but also fairly challenging.

Table 4 is obtained under the same setting as Table 3, except that Bandwidth Option 2 is used.

31

The values are close to those in Table 3. This con�rms that the coverage rate is not sensitive to

the bandwidth choice in Step 1.

Are robust bands too wide to be informative? In Table 5, the length of robust bands are

compared to that of conventional bands. The sample size is 500, the nominal level is 90%, and

Bandwidth Option 1 is used. The following patterns emerge. The length of a robust band is

greater than that of a conventional band by 40% to 81%. The di¤erence in length is greater when

a conventional band is shorter, but smaller when a conventional band is already wide. Overall,

it appears that robust bands can be informative while having reliable coverage. Lengths from

Bandwidth Option 2, or at the 95% level, show similar patterns. So we omit the tables to save

space.

We examine coverage rates and lengths when n = 1000. Table 6 shows the coverage rates of

uniform bands using Bandwidth Option 1. The coverage rates of robust bands overall improve

relative to n = 500. At the 90% level, the maximum size distortions of the two bands reduce

from 8.1% and 5.4% to 5.4% and 4.1% respectively. The resampling based band continues to have

higher coverage than the asymptotic approximation based band. In contrast, the conventional

bands remain inadequate.

Table 7 compares the lengths at the 90% nominal level. The di¤erence in lengths between a

robust and a conventional band is from 45% to 76%. Between the two robust bands, the resam-

pling based band is still wider, while the di¤erence is smaller than when n = 500. Results using

Bandwidth Option 2, or at the 95% nominal level, show similar patterns. They are omitted.

9 Conclusion

This paper has developed methods for analyzing conditional quantile processes in partially linear

models. The framework allows the researcher to be �exible about stochastic relationship between

some variables, while still being able to control for a fair number of confounding factors. Two

estimation procedures are developed that are suitable for moderate or large sample sizes. Two

inference procedures are developed, again with complementary features. It is shown that the meth-

ods can be used to test hypotheses related to signi�cance, homogeneity and conditional stochastic

dominance. Although the paper has focused on partially linear models, we conjecture that some

parts of the analysis, such as those obtaining a uniform Bahadur representation and constructing a

con�dence band that acknowledges the estimation bias, can be more generally useful for studying

other semiparametric models.

32

References

Angrist, J., V. Chernozhukov, and I. Fernández-Val (2006). Quantile regression under misspeci�-cation, with an application to the U.S. wage structure. Econometrica 74 (2), pp. 539�563.

Bai, J. (1996). Testing for parameter constancy in linear regressions: An empirical distributionfunction approach. Econometrica 64 (3), pp. 597�622.

Billingsley, P. (1968). Convergence of Probability Measures. Wiley.

Cai, Z. and Z. Xiao (2012). Semiparametric quantile regression estimation in dynamic models withpartially varying coe¢ cients. Journal of Econometrics 167 (2), pp. 413�425.

Calonico, S., M. D. Cattaneo, and R. Titiunik (2014). Robust nonparametric con�dence intervalsfor regression-discontinuity designs. Econometrica 82 (6), 2295�2326.

Chao, S. K., S. Volgushev, and G. Cheng (2016). Quantile Processes for Semi and NonparametricRegression. Working Paper, Department of Statistics, Purdue University .

Chaudhuri, P. (1991). Nonparametric estimates of regression quantiles and their local bahadurrepresentation. The Annals of Statistics 19 (2), pp. 760�777.

Chaudhuri, P., K. Doksum, and A. Samarov (1997). On average derivative quantile regression. TheAnnals of Statistics 25 (2), 715�744.

Chernozhukov, V. and I. Fernández-Val (2005). Subsampling inference on quantile regressionprocesses. Sankhya: The Indian Journal of Statistics 67 (2), pp. 253�276.

Guerre, E. and C. Sabbah (2012). Uniform bias study and bahadur representation for local poly-nomial estimators of the conditional quantile function. Econometric Theory 28 (01), pp. 87�129.

Hahn, J., P. Todd, and W. Van der Klaauw (2001). Identi�cation and estimation of treatmente¤ects with a regression-discontinuity design. Econometrica 69 (1), 201�209.

Hall, P. and C. C. Heyde (1980). Martingale Limit Theory and Its Application. Academic Press.

He, X. and P. Shi (1996). Bivariate tensor-product b-splines in a partly linear model. Journal ofMultivariate Analysis 58 (2), pp. 162�181.

Heckman, J. J., H. Ichimura, and P. Todd (1998). Matching as an econometric evaluation estimator.The Review of Economic Studies 65 (2), 261�294.

Khmaladze, È. V. (1981). Martingale approach in the theory of goodness-of-�t tests. Theory ofProbability and Its Applications 26 (2), 240�257.

Knight, K. (1998). Limiting distributions for L1 regression estimators under general conditions.The Annals of Statistics 26 (2), pp. 755�770.

33

Koenker, R. (2005). Quantile Regression. Cambridge University Press.

Koenker, R. and G. Bassett, Jr (1978). Regression quantiles. Econometrica 46 (1), pp. 33�50.

Koenker, R. and J. A. F. Machado (1999). Goodness of �t and related inference processes forquantile regression. Journal of the American Statistical Association 94 (448), 1296�1310.

Koenker, R. and Z. Xiao (2002). Inference on the quantile regression process. Econometrica 70 (4),pp. 1583�1612.

Kong, E., O. Linton, and Y. Xia (2010). Uniform bahadur representation for local polynomialestimates of m-regression and its application to the additive model. Econometric Theory 26 (05),pp. 1529�1564.

Lee, S. (2003). E¢ cient semiparametric estimation of a partially linear quantile regression model.Econometric Theory 19 (1), pp. 1�31.

Masry, E. (1996). Multivariate local polynomial regression for time series : Uniform strong consis-tency and rates. Journal of Time Series Analysis 17 (6), pp. 571�599.

Neocleous, T. and S. Portnoy (2008). On monotonicity of regression quantile functions. Statistics& Probability Letters 78 (10), pp. 1226�1229.

Oka, T. and Z. Qu (2011). Estimating structural changes in regression quantiles. Journal ofEconometrics 162, pp. 248�267.

Parzen, M. I., L. J. Wei, and Z. Ying (1994). A resampling method based on pivotal estimatingfunctions. Biometrika 81 (2), pp. 341�350.

Portnoy, S. and R. Koenker (1989). Adaptive l-estimation for linear models. The Annals ofStatistics 17 (1), pp. 362�381.

Qu, Z. and J. Yoon (2015). Nonparametric estimation and inference on conditional quantileprocesses. Journal of Econometrics 185 (1), 1�19.

Qu, Z. and J. Yoon (2016). Uniform inference on quantile e¤ects under sharp regression disconti-nuity designs. Working Paper, Department of Economics, Boston University .

Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, 931�954.

Ruppert, D. and M. P. Wand (1994). Multivariate locally weighted least squares regression. TheAnnals of Statistics 22 (3), pp. 1346�1370.

Sherwood, B. and L. Wang (2016). Partially linear additive quantile regression in ultra-high di-mension. The Annals of Statistics 44 (1), 288�317.

34

Stone, C. J. (1977). Consistent nonparametric regression. The Annals of Statistics 5 (4), pp.595�620.

Wang, H. J., Z. Zhu, and J. Zhou (2009). Quantile regression in partially linear varying coe¢ cientmodels. The Annals of Statistics 37 (6B), 3841�3866.

Yu, K. and M. C. Jones (1998). Local linear quantile regression. Journal of the American StatisticalAssociation 93 (441), pp. 228�237.

35

Appendix A. Proof of Results in the PaperWe de�ne some notation needed for a local quadratic quantile regression. Let a0 2 R; a1 2

Rd; a2 2 Rd(d+1)=2 and b 2 Rq be some generic parameter values. Let

�(x; �) =qnhdn

0BBBBBB@a0 � �0(x; �)

hn (a1 � �1(x; �))

h2n (a2 � �2(x; �))

b� �(�)

1CCCCCCA ;

where �0(x; �); �1(x; �) and �2(x; �) are values in the second order Taylor approximation of thetrue conditional quantile function, see (2) and (3). De�ne

V (x; � ; �) =

nXj=1

��

�u0j (�)� ej (x; �)� (nhdn)�1=2Wj(hn; x)

0��K

�xj � xhn

�(A.1)

�nXj=1

��u0j (�)� ej(x; �)

�K

�xj � xhn

�;

where u0j (�) is de�ned in (12), Wj(hn; x) in (11), and

ej (x; �) = �0(x; �) + (xj � x)0 �1(x; �) + q (xj � x)0 �2(x; �)� g(xj ; �):

Note that �(x; �) minimizes (A.1) if and only if the corresponding values of a0; a1; a2 and b minimize(4). Let S (x; � ; �) be minus the subgradient of (A.1) recentered to have mean zero:

S (x; � ; �) = (nhdn)�1=2

nXj=1

nP�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ e(x; �)��xj ; zj�

� 1�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ e(x; �)�o

Wj(hn; x)K

�xj � xhn

�;

where P�u0j (�) � s

��xj ; zj� stands for the cumulative distribution function of Y conditional on

X = xj and Z = zj evaluated at g(xj ; �) + z0j�(�) + s. Finally, recall

S0 (x; �) = (nhdn)�1=2

nXj=1

�� 1

�u0j (�) � 0

�Wj(hn; x)K

�xj � xhn

�: (A.2)

Note that S (x; � ; �) equals S0 (x; �) when � = 0 and e(x; �) = 0.We use the same notation when analyzing a local linear quantile regression, except when de�ning

�(x; �) and ej (x; �) the entries corresponding to the second order terms in the Taylor expansion,h2n (a2 � �2(x; �)) and q (xj � x)

0 �2(x; �), are excluded.Proof of Lemma 1. By Lemma B.3 in the supplementary appendix,

Pr

�sup�2T

supx2Sx

e�(x; �) � log n�! 1: (A.3)

A-1

Therefore, we can restrict the attention to��(x; �) : sup�2T supx2Sx k�(x; �)k � log n

.

Consider

(nhdn)�1=2

nXj=1

n� � 1(u0j (�) � ej(x; �) + (nh

dn)�1=2Wj(hn; x)

0� (x; �)o

(A.4)

�Wj(hn; x)K

�xj � xhn

�:

Adding and subtracting terms, (A.4) can be rewritten as

fS(x; � ; � (x; �))� S0 (x; �)g+ S0 (x; �) (A.5)

+(nhdn)�1=2

nXj=1

n� � P

�u0j (�) � ej(x; �) + (nh

dn)�1=2Wj(hn; x)

0� (x; �)��xj ; zj�o

�Wj(hn; x)K

�xj � xhn

�:

We now evaluate (A.4) and its components in (A.5) at � (x; �) = e� (x; �). At this value, (A.4) isOp((nh

dn)�1=2) uniformly over T and Sx by Theorem 2.1 in Koenker (2005). Because of Lemma B.1

in the supplementary appendix and (A.3), the term in curly brackets in (A.5) is Op((nhdn)�1=4 log n)

uniformly over T and Sx. The term S0 (x; �) does not depend on e� (x; �). Therefore, we only needto further analyze the last term in (A.5).

Apply a second-order Taylor expansion to this term and then evaluate the expression at � (x; �) =e� (x; �). We obtain�(nhdn)�1=2

nXj=1

f (� jxj ; zj) ej (x; �)Wj(hn; x)K

�xj � xhn

�

�

0@(nhdn)�1 nXj=1

f (� jxj ; zj)K�xj � xhn

�Wj(hn; x)Wj(hn; x)

0

1Ae� (x; �)�12(nhdn)

�1=2nXj=1

f 0 (eyj jxj ; zj) ej (x; �)2Wj(hn; x)K

�xj � xhn

�

�12(nhdn)

�1=2

0@(nhdn)�1 nXj=1

f 0 (eyj jxj ; zj) hWj(hn; x)0e� (x; �)i2Wj(hn; x)K

�xj � xhn

�1A ;

where eyi is a value between Q(� jxj ; zj) and Q(� jxj ; zj) + ej(x; �) + (nhdn)�1=2Wj(hn; x)

0e� (x; �).BecauseKj((xj�x)=hn) equals 0 unless xj is in a vanishing neighborhood of x, it su¢ ces to considervalues close to x. Let � be a �nite constant, such thatKj((xj�x)=hn) = 0 whenever kxj � xk > �hn.Within this �-neighborhood, ej(x; �) = O(hrn), where r = 3 in the local quadratic regression case andr = 2 in the local linear regression case. Also, eyj approaches Q(� jxj ; zj) as n ! 1. This impliesthere exists C < 1 such that

f 0 (eyj jxj ; zj) (ej (x; �)2 =h2rn )Wj(hn; x) � 1 (kxj � xk � �hn) � C

A-2

with probability arbitrarily close to 1 in large samples. Applying this result, we have (nhdn)�1=2nXj=1

f 0 (eyj jxj ; zj) ej (x; �)2Wj(hn; x)K

�xj � xhn

� (A.6)

=

(nhdn)�1=2nXj=1

f 0 (eyj jxj ; zj) ej (x; �)2Wj(hn; x) � 1 (kxj � xk � �hn)K

�xj � xhn

� � h2rn (nh

dn)�1=2

nXj=1

f 0 (eyj jxj ; zj) ej (x; �)2h2rnWj(hn; x) � 1 (kxj � xk � �hn)K

�xj � xhn

� � Ch2rn (nh

dn)1=2

0@(nhdn)�1 nXj=1

K

�xj � xhn

�1A= Op

�(nhdn)

1=2h2rn

�uniformly over T and Sx:

Apply similar arguments to the other second order term: (nhdn)�1=20@(nhdn)�1 nX

j=1

f 0 (eyj jxj ; zj) [Wj(hn; x)0e� (x; �)]2K �xj � x

hn

�Wj(hn; x)

1A � C(nhdn)

�1=2 log2 n

0@(nhdn)�1 nXj=1

K

�xj � xhn

�1A= Op

�(nhdn)

�1=2 log2 n�= op

�(nhdn)

�1=4 log n�;

where the log2 n term following the inequality arises because of (A.3).The above results jointly imply

e� (x; �) =

0@(nhdn)�1 nXj=1

f (� jxj ; zj)Wj(hn; x)Wj(hn; x)0K

�xj � xhn

�1A�18<:S0 (x; �)� (nhdn)�1=2

nXj=1

f (� jxj ; zj) ej (x; �)Wj(hn; x)K

�xj � xhn

�+Op

�(nhdn)

� 14 log n+ (nhdn)

1=2h2rn

�o; (A.7)

Further, the order of (nhdn)�1=2Pn

j=1 f (� jxj ; zj) ej (x; �)Wj(hn; x)K ((xj � x)=hn) isOp�(nhdn)

1=2hrn�,

which can be established using the same argument as when analyzing (A.6). Applying this to (A.7),we obtain

e� (x; �) =

0@(nhdn)�1 nXj=1

f (� jxj ; zj)Wj(hn; x)Wj(hn; x)0K

�xj � xhn

�1A�1 S0 (x; �)+Op

�(nhdn)

� 14 log n+ (nhdn)

1=2hrn

�:

A-3

This completes the proof.Proof of Lemma 2. By Lemma 1,

�(�)� �(�) = 1

n (nhdn)1=2

nXi=1

e04Mn(xi; �)�1S0 (xi; �) +Op

�(nhdn)

� 34 log n+ hrn

�;

where e4 selects the last q elements of a vector and S0 (xi; �) is given by (A.2) but with the i-thobservation excluded from the summation. To prove the Lemma, it is su¢ cient to show that

1pn (nhdn)

1=2

nXi=1

Mn(xi; �)�1S0 (xi; �) = Op (1) uniformly over T : (A.8)

For any �xed � , the left hand side of (A.8) converges to a multivariate normal distribution, seeLee (2003). It only remains to verify its tightness as a process of � over T . Applying the de�nitionof S0 (xi; �), it can be rewritten as

1pn (nhdn)

1=2

nXi=1

Mn(xi; �)�1

0@(nhdn)�1=2 nXj=1;j 6=i

�� 1

�u0j (�) � 0

�Wj(hn; xi)K

�xj � xihn

�1A=

1pn

nXj=1

(1

nhdn

nXi=1

Mn(xi; �)�1Wj(hn; xi)K

�xj � xihn

�)�� 1

�u0j (�) � 0

�+ op (1) ; (A.9)

where the op (1) term arises because terms with j = i are added to the summation. To shorten theexpression, write the leading term of (A.9) as

U(�) � n�1=2nXj=1

Tj(�)�� 1

�u0j (�) � 0

�; (A.10)

where

Tj(�) =1

nhdn

nXi=1

Mn(xi; �)�1Wj(hn; xi)K

�xj � xihn

�.

Below, we shall show that for any " > 0 and � > 0, there exists a � > 0 such that

P

sup

� 00;� 02T ;j� 00�� 0j��

U �� 00�� U �� 0� > "

!< �:

Note that for any �; T contains 1=� intervals of length �. Therefore, the above inequality holds iffor any " > 0 and � > 0, there exists a � > 0, such that (see Billingsley 1968, p.58, equation 8.12)

P

sup

�2[�1;�+�1]\TkU (�)� U (�1)k > "

!< �� for any �1 2 T when n is su¢ ciently large. (A.11)

We prove (A.11) using a chaining argument. Let 0 < � < 1=2 be some constant. Partitionthe interval T into small intervals of size cn�1=2�� where c can be any �nite constant greater

A-4

than 1. Denote the number of intervals by �bn = O(n1=2+�). For any �, among these �bn intervals,bn = O(�n1=2+�) of them provide a cover for [�1; � + �1]. For simplicity, assume these bn intervalsstart at �1. Let � j denote the lower limit of the j-th interval. Then, by the triangle inequality:

sup�2[�1;�+�1]\T

kU (�)� U (�1)k (A.12)

� sup1�j�bn

kU (� j)� U (�1)k+ sup1�j�bn

sup�2[�j ;�j+1]

kU (�)� U (� j)k :

This inequality reduces the overall variation into within- and between-interval variations.Consider the �rst term on the right hand side. Theorem 12.2 in Billingsley (1968) can be used

to derive a bound for it. Speci�cally, the theorem states that if there exists � � 0; � > 1 andul � 0 (l = 1; :::; bn) such that E(kU (� j)� U (� i)k�) � (

Pi<l�j ul)

� for any 0 � i � j � bn, thenP�sup1�j�bn kU (� j)� U (�1)k > "

�� "��C�;� (u1 + :::+ ubn)

�. Setting � = 2� > 2, then LemmaB.4 in the supplementary appendix implies E kU (� j)� U (� i)k� � �C(� j � � i)� for 0 � i � j � bn,where �C is a �nite constant. Therefore,

P

sup

1�j�bnkU (� j)� U (�1)k >

"

5

!� C�;��

"5

�� C(� bn � �1)� = �

C�;��"5

�� C��1!:

For any " and �, there exists a �� such that��

"5

��C�;� �C��

��1�= �. For any � � ��, the preceding

display implies

P

sup

1�j�bnkU (� j)� U (�1)k >

"

5

!� ��: (A.13)

Now consider the second term on the right hand side of (A.12). Because we need an upperbound for it, we now take the two supremums over the �bn intervals covering T rather than the bnintervals covering just [�1; � + �1]. That is, we consider

sup1�j��bn

sup�2[�j ;�j+1]

kU (�)� U (� j)k ;

where � j now stands for the lower limit of the j-th interval with j = 1; ::::;�bn. The precedingdisplay is independent of �. Further,

U (�)� U (� j) = n�1=2nXi=1

(Ti(�)� Ti(� j))�� 1

�u0i (�) � 0

�(a)

+n�1=2nXi=1

Ti(� j)�� 1

�u0i (�) � 0

�� j � 1

�u0i (� j) � 0

��(b).

Below, the two terms (a) and (b) are analyzed separately. The supremum of the term (a)is bounded by n�1=2 sup1�j��bn sup�2[�j ;�j+1]

Pni=1 kTi(�)� Ti(� j)k. In Ti(�), only the conditional

density function depends on � . Because the function is Lipschitz continuous with respect to �and the eigenvalues of Mn(x; �) are strictly positive, it follows that this supremum is of order

A-5

Op�n1=2n�1=2��

�= Op (n

��). Because this supremum is independent of �, it implies that for any� > 0, " > 0 and � > 0, the following holds for su¢ ciently large n:

P

sup

1�j��bnsup

�2[�j ;�j+1]jj(a)jj > "

5

!� ��: (A.14)

Now consider the term (b). Let �q denote the dimension of Ti(� j) and let Ti;k(� j) be its k-thcomponent. Then

Ti(� j) =

�qXk=1

T+i (� j ; k)��qX

k=1

T�i (� j ; k);

where T+i (� j ; k)=(0; :::Ti;k(� j); :::; 0) 1(Ti;k(� j) � 0) and T�i (� j ; k)=(0; :::� Ti;k(� j); :::; 0) 1(Ti;k(� j) <

0). The term (b) can then be represented as

n�1=2�qX

k=1

nXi=1

T+i (� j ; k)�� 1

�u0i (�) � 0

�� j � 1

�u0i (� j) � 0

��n�1=2

�qXk=1

nXi=1

T�i (� j ; k)�� 1

�u0i (�) � 0

�� j � 1

�u0i (� j) � 0

��:

This decomposition follows Bai (1996, p. 612). All the weights multiplying the curly brackets arenon-negative. This permits applying a monotonicity argument. Because the 2�q summations can bestudied in the same way, we consider only the one with weights T+i (� j ; k). When � 2 [� j ; � j+1],

n�1=2nXi=1

T+i (� j ; k)�� 1

�u0i (�) � 0

�� j � 1

�u0i (� j) � 0

�� n�1=2

nXi=1

T+i (� j ; k)�� j+1 � 1

�u0i (� j) � 0

�� j � 1

�u0i (� j) � 0

�� cn�1��

nXi=1

T+i (� j ; k)

and

n�1=2nXi=1

T+i (� j ; k)�� 1

�u0i (�) � 0

�� j � 1

�u0i (� j) � 0

�� n�1=2

nXi=1

T+i (� j ; k)�� j � 1

�u0i (� j+1) � 0

�� j � 1

�u0i (� j) � 0

��= n�1=2

nXi=1

T+i (� j ; k)�i (� j ; � j+1)

+n�1=2nXi=1

T+i (� j ; k)�P�u0i (� j) � 0jxj ; zj

�� P

�u0i (� j+1) � 0jxj ; zj

�;

A-6

where

�i (� j ; � j+1) = 1�u0i (� j) � 0

�� 1

�u0i (� j+1) � 0

�� P

�u0i (� j) � 0jxj ; zj

�+ P

�u0i (� j+1) � 0jxj ; zj

�:

Combining the two set of inequalities, we obtain

n�1=2 sup1�j��bn

sup�2[�j ;�j+1]

nXi=1

T+i (� j ; k)�� 1

�u0i (�) � 0

�� j � 1

�u0i (� j) � 0

�� n�1=2 sup

1�j��bn

nXi=1

T+i (� j ; k)�i (� j ; � j+1)

(b1)

+n�1=2 sup1�j��bn

nXi=1

T+i (� j ; k)�P�u0i (� j+1) � 0jxj ; zj

�� P

�u0i (� j) � 0jxj ; zj

� (b2)

+cn�1�� sup1�j��bn

nXi=1

T+i (� j ; k): (b3)

The terms (b1)-(b3) only depend on the boundaries of intervals in the partition. Further, the term(b1) satis�es, for any " > 0,

P

n�1=2 sup

1�j��bn

nXi=1

T+i (� j ; k)�i (� j ; � j+1)

> "

5�q

!� �bn max

1�j��bnP

n�1=2nXi=1

T+i (� j ; k)�i (� j ; � j+1)

> "

5�q

!:

Because only the k-th element of T+i (� j ; k) is non-zero, we can treat T+i (� j ; k) as if it was a scalar.

Then, the preceding display is bounded from the above by, for any > 1;

�bn max1�j��bn

�"

5�q

��2 E

0@ n�1=2nXi=1

T+i (� j ; k)�i (� j ; � j+1)

2 1A :

Apply Rosenthal�s inequality, the above display is further bounded by

C�bnn� �"

5�q

��2 max1�j��bn

( nXi=1

E T+i (� j ; k)�i (� j ; � j+1) 2

! +

nXi=1

E T+i (� j ; k)�i (� j ; � j+1) 2

):

Because E(�i (� j ; � j+1)2 jxi; zi) � E(�i (� j ; � j+1)

2 jxi; zi) � C (� j+1 � � j) and that E T+i (� j ; k) 2

is �nite, the above display is of order

C�bnn� �"

5�q

��2 nn(1=2��) + n(1=2��)

o;

which converges to zero by choosing a large . The term (b2) is op (1) by the mean value theorem,while the term (b3) is op (1) by a uniform law of large numbers. The above results for (b1)-(b3)are independent of �. They imply that for any " > 0, � > 0 and � > 0, the following inequalityholds for su¢ ciently large n:

P

sup

1�j��bnsup

�2[�j ;�j+1]jj(b)jj > 3"

5

!� ��: (A.15)

A-7

The inequality (A.11) follows by combining (A.13), (A.14) and (A.15). This completes the proof.Proof of Theorem 1. The proof is similar to that of Theorem 2 of Qu and Yoon (2015). The �rststep shows that (nbdn;� )

1=2��0(x; �)� g(x; �)�B(x; �)b2n;�

�, where �0(x; �) is obtained by solving

the minimization problem in Step 2 for all � 2 T , converges weakly to the desired limit. The secondstep shows that the linearly interpolated estimator using m points has the same limit.

Consider minus the subgradient evaluated at x and z, normalized by (nbdn;� )�1=2:

(nbdn;� )�1=2

nXj=1

n� � 1(u0j (�) � ej(x; �) + (nb

dn;� )

�1=2Wj(bn;� ; x)0� (x; �)

o(A.16)

� �Wj(x; bn;� )K

�xj � xbn;�

�;

where � (x; �) equals e� (x; �) except that e�(x; �); e�0(x; �) and e�1(x; �) are replaced by �(�); �0(x; �)and �1(x; �), respectively. Also,

Wj(bn;� ; x) =

266641

xj�xbn;�

zj

37775 and �Wj(x; bn;� ) =

24 1

xj�xbn;�

35 .Similar to the proof of Lemma 1, (A.16) can be rewritten asn

�S(x; � ; � (x; �))� �S0 (x; �)o+ �S0 (x; �) (A.17)

+(nbdn;� )�1=2

nXj=1

n� � P

�u0j (�) � ej(x; �) + (nb

dn;� )

�1=2Wj(bn;� ; x)0� (x; �)

��xj ; zj�o� �Wj(bn;� ; x)K

�xj � xbn;�

�;

where

�S (x; � ; �) = (nbdn;� )�1=2

nXj=1

nP�u0j (�) � (nbdn;� )�1=2Wj(bn;� ; x)

0�+ ej(x; �)��xj ; zj�

� 1�u0j (�) � (nbdn;� )�1=2Wj(bn;� ; x)

0�+ ej(x; �)�o

�Wj(bn;� ; x)K

�xj � xbn;�

�;

and �S0 (x; �) equals �S (x; � ; �) after setting � = 0 and ej(x; �) = 0. In �S (x; � ; �), the bandwidthdi¤ers across quantiles, so Lemma B.1 is not directly applicable. However, here we only need toprove a result that is pointwise with respect to x for �xed �(�). Therefore, we can apply LemmaB5 in Qu and Yoon (2015, p.18), which implies �S(x; � ; � (x; �))� �S0 (x; �) = op (1), where the orderis uniform over T . As in Lemma 1, the second term in (A.17) can be analyzed using a second order

A-8

Taylor expansion, leading to the following expression:

�(nbdn;� )�1=2nXj=1

f (� jxj ; zj) ej (x; �) �Wj(bn;� ; x)K

�xj � xbn;�

�

�

0@(nbdn;� )�1 nXj=1

f (� jxj ; zj) �Wj(bn;� ; x)Wj(bn;� ; x)0K

�xj � xbn;�

�1A � (x; �)

�12(nbdn;� )

�1=2nXj=1

f 0 (eyj jxj ; zj) ej (x; �)2 �Wj(bn;� ; x)K

�xj � xbn;�

�

�12(nbdn;� )

�1=2

0@(nbdn;� )�1 nXj=1

f 0 (eyj jxj ; zj) [Wj(bn;� ; x)0� (x; �)]2 �Wj(bn;� ; x)K

�xj � xbn;�

�1A ;

where eyi lies between Q(� jxj ; zj) and Q(� jxj ; zj) + ej(x; �) + (nbdn;� )

�1=2Wj(bn;� ; x)0� (x; �). The

third and the fourth term in the display are op (1) uniformly over T . The second term can berewritten as

�

0@(nbdn;� )�1 nXj=1

f (� jxj ; zj) �Wj(bn;� ; x)z0jK

�xj � xbn;�

�1A (nbdn;� )1=2 �� (�)� � (�)�

�

0@(nbdn;� )�1 nXj=1

f (� jxj ; zj) �Wj(bn;� ; x) �Wj(bn;� ; x)0K

�xj � xbn;�

�1A� (nbdn;� )1=2

0@ �0(x; �)� �0(x; �)

bn;� (�1(x; �)� �1(x; �))

1A :

Because (nbdn;� )1=2(� (�) � � (�)) = op (1), the �rst line in the preceding expression converges in

probability to 0. Collecting the remaining terms and noticing that (A.16) is op (1), we obtain

�S0 (x; �)� (nbdn;� )�1=2nXj=1


�xj � xbn;�

�(A.18)

�

0@(nbdn;� )�1 nXj=1


�xj � xbn;�

�1A� (nbdn;� )1=2

0@ �0(x; �)� �0(x; �)

bn;� (�1(x; �)� �1(x; �))

1A= op (1) :

A-9

Note that

(nbdn;� )�1

nXj=1


�xj � xbn;�

�

! pf(x)E (f (� jX;Z)jX = x)

24 1 0

0Ruu0K(u)du

35and

(nbdn;� )�1=2

nXj=1


�xj � xbn;�

�

=1

2(nbd+4n;� )

1=2f(x)E (f (� jX;Z)jX = x)

Z 8<:u0@2g(x; �)@x@x0u

24 1

u

35K(u)du9=;+ op (1) ;

Applying these two results to the display (A.18) leads to

qnbdn;�

��0(x; �)� g(x; �)�B(x; �)b2n;�

�=

�nbdn;�

��1=2Pni=1

�� 1(u0i (�) � 0

�)K�xi�xbn;�

�f (x)E (f (� jX;Z)jX = x)

+op (1) ;

where the order holds uniformly over T .The leading term on the right hand side of the above display does not depend on �(�). That is,

the situation is as if we were studying a purely nonparametric model. Lemma B3 in Qu and Yoon(2015) implies that this term is stochastically equicontinuous. It follows that it converges to theGaussian process as stated in the Theorem. The e¤ect of the linear interpolation can be analyzedin exactly the same way as on page p.15-16 of Qu and Yoon (2015). This completes the proof.Proof of Corollary 1. The proof is fairly standard. It is included for completeness. The MSE atan interior point x is

1

4tr

�@2g(x; �)

@x@x0

Zuu0K(u)du

�2b4n;� +

� (1� �)RK (u)2 du

nbdn;�f (x) [E (f(� jX;Z)jX = x)]2+ op(nb

dn;� ):

Computing the derivatives of the �rst two terms leads to the desired formula. The Lipschitz continu-ity requirement is satis�ed because E [f (� jX;Z)jx)]�2=(4+d) ; tr

�Ruu0K(u)du@2g(x; �)=@x@x0

��2=(4+d)and (� (1� �))1=(4+d) all have bounded �rst derivatives over T . This completes the proof.Proof of Corollary 2. The proof is fairly standard. Denote the band in the corollary by Bp: Byconstruction, it satis�es for any Cp > 0,

P (Q(� jx; z) =2 Bp for some � 2 T )

= P

�1

�n;�

��Q(� jx; z)�B(x; �)b2n;� �Q(� jx; z)�� > Cp for some � 2 T�

= P

�sup�2T

1

�n;�

��Q(� jx; z)�B(x; �)b2n;� �Q(� jx; z)�� > Cp

�: (A.19)

A-10

Theorem 1 implies

1

�n;�

�Q(� jx; z)�Q(� jx; z)� b2n;�B(x; �)

�) G1 (x; �) =

qEG1 (x; �)

2:

Therefore, setting Cp to the p-th quantile of sup�2T jjG1 (x; �) =qEG1 (x; �)

2jj leads to the desiredcoverage probability asymptotically. This completes the proof.Proof of Lemma 3. It su¢ ces to study (nbd+4n;� )

1=2(B(x; �)�B(x; �)). The proof is similar to thatof Theorem 1 with the main di¤erence being that a local quadratic regression is used. Considerminus the subgradient evaluated at x and z, normalized by (nrdn;� )

�1=2:

(nrdn;� )�1=2

nXj=1

n� � 1(u0j (�) � ej(x; �) + (nr

dn;� )

�1=2Wj(rn;� ; x)0� (x; �)

ofWj(rn;� ; x)K

�xj � xrn;�

�;

where � (x; �) equals e� (x; �) except that e�(x; �); e�0(x; �); e�1(x; �) and e�2(x; �) are replaced by�(�); 0(x; �); 1(x; �) and 2(x; �). By adding and subtracting terms, the above display can berewritten asneS(x; � ; � (x; �))� eS0 (x; �)o+ eS0 (x; �) (A.20)

+(nbdn;� )�1=2

nXj=1

n� � P

�u0j (�) � ej(x; �) + (nr

dn;� )

�1=2Wj(rn;� ; x)0� (x; �)

��xj ; zj�o�fWj(rn;� ; x)K

�xj � xrn;�

�;

where

eS (x; � ; �) = (nrdn;� )�1=2

nXj=1

nP�u0j (�) � ej(x; �) + (nr

dn;� )

�1=2Wj(rn;� ; x)0��xj ; zj�

� 1�u0j (�) � ej(x; �) + (nr

dn;� )

�1=2Wj(rn;� ; x)0��ofWj(rn;� ; x)K

�xj � xbn;�

�and eS0 (x; �) equals eS (x; � ; �) after setting � = 0 and ej(x; �) = 0. Using the same argument as inTheorem 1, the display (A.20), whose order is op(1); equals

�(nrdn;� )�1=2nXj=1

f (� jxj ; zj) ej (x; �)fWj(rn;� ; x)K

�xj � xrn;�

�(A.21)

�

0@(nrdn;� )�1 nXj=1

f (� jxj ; zj)K�xj � xrn;�

�fWj(rn;� ; x)fWj(rn;� ; x)0

1A

�(nrdn;� )1=2

0BBB@ 0(x; �)� 0(x; �)

rn;� ( 1(x; �)� 1(x; �))

r2n;� ( 2(x; �)� 2(x; �))

1CCCA+eS0 (x; �) + op (1) +Op �(nrdn;� )1=2h3n� :

A-11

Because

(nrdn;� )�1

nXj=1


�fWj(rn;� ; x)fWj(rn;� ; x)0

p! f(x)E (f (� jX;Z)jX = x)

Z�u�u0K(u)du

and

(nrdn;� )�1=2

nXj=1

f (� jxj ; zj) ej (x; �)fWj(rn;� ; x)K

�xj � xrn;�

�= Op

�(nrdn;� )

1=2r3n;�

�;

we obtainqnbd+4n;� ( 2(x; �)� 2(x; �))

=

qnbd+4n;�qnrd+4n;�

e03

�Z�u�u0K(u)du

��1 (nrdn;� )�1=2Pni=1

�� 1

�u0i (�) � 0

�fWi(rn;� ; x)K(xi�xrn;�

)

f(x)E[f(� jX;Z)jX = x]

+

qnbd+4n;�qnrd+4n;�

e03

�Z�u�u0K(u)du

��1 nop (1) +Op

�(nrdn;� )

1=2�r3n;� + h

3n

��o:

Because c1bn � rn � c2hn, the second term on the right hand side is of order (qnbd+4n;� =

qnrd+4n;� )(op (1)+

Op(nrdn;� )

1=2h3n), which is of lower order thanqnbd+4n;� =

qnrd+4n;� unless (nrdn;� )

1=2h3n is nonzero inthe limit (that is, when rn;� has the same rate as the MSE-optimal bandwidth for a local quadraticregression). But in the latter case bn;�=rn;� ! 0 so the order of the whole term is still op (1). Thiscompletes the proof.Proof of Theorem 2. We only consider the situation with rn;�=bn = �(�) with 0 < �(�) < 1over T because the other situation follows immediately from Lemma 3. First, for �xed � , D1(x; �)and D2(x; �) both converge to normal random variables with mean zero. Second, for any t 6= s, itis simple to verify that that the covariance of D1(x; t) and D2(x; s) and also that of D2(x; t) andD2(x; s) satisfy the expressions given in the theorem. Third, the stochastic equicontinuity of theprocess D2(x; �) follows from Lemma B3 in Qu and Yoon (2015). Therefore, D1(x; �) � D2(x; �)converges weakly to the Gaussian process with the speci�ed covariance kernel. Finally, the e¤ectof the linear interpolation can be analyzed in exactly the same way as on page p.15-16 in Qu andYoon (2015). This completes the proof.Proof of Corollary 4. This can be proved by showing

(nbdn;� )1=2� 0(x; �) + (B(x; �)�B (x; �))b2n;� � g(x; �)

�and

(nbdn;� )1=2�Q(� jx; z)� b2n;�B(x; �)�Q(� jx; z)

�are asymptotically equivalent.

A-12

Because of (A.21), we have

�Z�u�u0K(u)du

�qnbdn;�

0BBB@ 0(x; �)� g(x; �)

bn;� ( 1(x; �)� 1(x; �))

b2n;� ( 2(x; �)� 2(x; �))

1CCCA =eS0 (x; �)

f(x)E[f (� jX;Z)jX = x]+ op (1) :

Note that the �rst row of fR�u�u0K(u)dug equals

R[ 1 u q(u)0 ]K(u)du. Therefore, the �rst

element of the left hand side vector of the above display equals

Z h1 u q(u)0

iK(u)du�

qnbdn;�

0BBB@ 0(x; �)� g(x; �)

bn;� ( 1(x; �)� 1(x; �))

b2n;� ( 2(x; �)� 2(x; �))

1CCCA=

qnbdn;�

� 0(x; �)� g(x; �) + b2n;�

�Zq(u)0K(u)du

�( 2(x; �)� 2(x; �))

�;

where we have usedRuK(u)du = 0. By the de�nition of B(x; �), the right hand side of the

preceding display is asymptotically equivalent toqnbdn;� ( 0(x; �)�g(x; �)+b2n;� (B(x; �)�B (x; �))).

Meanwhile, the �rst element of the vector eS0 (x; �) =(f(x)E[f (� jX;Z)jX = x]) is asymptotically

equivalent toqnbdn;� (Q(� jx; z)� b2n;�B(x; �)�Q(� jx; z)) by Theorem 1. This completes the proof.

Proof of Theorem 3. The argument used is similar to Koenker (2005, p.109). Let a0(x; �) anda1(x; �) be any value such that the norm of

qnbdn;�

0BBB@a0(x; �)� �0(x; �)

bn;� (a1(x; �)� �1(x; �))

�(�)� �(�)

1CCCAdoes not exceed log n. Such values form a compact set. ��0(x; �) and �

�1(x; �) are in this set with

probability approaching one; so are �0(x; �) and �1(x; �). By Lemma B5 in Qu and Yoon (2015,p.18), the following quantity is op (1) uniformly over this set and T :

(nbdn;� )�1=2

nXj=1

nP�yj � z0j �(�)� a0(x; �)� (xj � x)

0 a1(x; �) � 0jxj ; zj�o

(A.22)

� �Wj(bn;� ; x)K

�xj � xbn;�

��(nbdn;� )�1=2

nXj=1

n1�yj � z0j �(�)� a0(x; �)� (xj � x)

0 a1(x; �) � 0�o

� �Wj(bn;� ; x)K

�xj � xbn;�

��(nbdn;� )�1=2

nXj=1

�� 1

�u0j (�) � 0

��Wj(bn;� ; x)K

�xj � xbn;�

�:

A-13

Evaluating the above display at (��0(x; �), ��1(x; �)) and (�0(x; �), �1(x; �)) and then take the

di¤erence, we obtain

(nbdn;� )�1=2

nXj=1

nP�yj � z0j �(�)� ��0(x; �)� (xj � x)

0 ��1(x; �) � 0jxj ; zj�o

� �Wj(bn;� ; x)K

�xj � xbn;�

��(nbdn;� )�1=2

nXj=1

n1�yj � z0j �(�)� ��0(x; �)� (xj � x)

0 ��1(x; �) � 0�o

� �Wj(bn;� ; x)K

�xj � xbn;�

��(nbdn;� )�1=2

nXj=1

nP�yj � z0j �(�)� �0(x; �)� (xj � x)

0 �1(x; �) � 0jxj ; zj�o

� �Wj(bn;� ; x)K

�xj � xbn;�

�+(nbdn;� )

�1=2nXj=1

n1�yj � z0j �(�)� �0(x; �)� (xj � x)

0 �1(x; �) � 0�o

� �Wj(bn;� ; x)K

�xj � xbn;�

�:

Because (��0(x; �), ��1(x; �)) satis�es (26) and (�0(x; �), �1(x; �)) solves (7), the display equals

(nbdn;� )�1=2

nXj=1

nP�yj � z0j �(�)� ��0(x; �)� (xj � x)

0 ��1(x; �) � 0jxj ; zj�o

� �Wj(bn;� ; x)K

�xj � xbn;�

��(nbdn;� )�1=2

nXj=1

f� � 1(ui � � � 0)g �Wj(bn;� ; x)K

�xj � xbn;�

�

�(nbdn;� )�1=2nXj=1

nP�yj � z0j �(�)� �0(x; �)� (xj � x)

0 �1(x; �) � 0jxj ; zj�o

� �Wj(bn;� ; x)K

�xj � xbn;�

�+ op (1) :

Expanding both the �rst and the third term around (�(�); �0(x; �); �1(x; �)) using �rst order Taylorexpansions, the preceding display equals:8<:�nbdn;��1

nXj=1

f(� jxj ; zj) �Wj(bn;� ; x) �Wj(bn;� ; x)0K

�xj � xbn;�

�9=;qnbdn;�0@ ��0(x; �)� �0(x; �)

bn;� (��1(x; �)� �1(x; �))

1A��nbdn;�

��1=2 nXj=1

f� � 1(ui � � � 0)g �Wj(bn;� ; x)K

�xj � xbn;�

�+ op (1) :

A-14

Therefore,

�nbdn;�

��1=20@ ��0(x; �)� �0(x; �)

bn;� (��1(x; �)� �1(x; �))

1A=

8<:�nbdn;��1nXj=1

f(� jxj ; zj) �Wj(bn;� ; x) �Wj(bn;� ; x)0K

�xj � xbn;�

�9=;�1

��nbdn;�

��1=2 nXj=1

f� � 1(uj � � � 0)g �Wj(bn;� ; x)K

�xj � xbn;�

�+ op (1) :

After taking the limit of the term in curly brackets, we obtain D1(x; �) with u0j (�) replaced by withuj � � . Its weak convergence also follows immediately. This completes the proof.Proof of Corollary 5. By Theorem 3,

qnbdn;� (�

�0(x; �)� �0(x; �)) = D�

1(x; �)+op (1). Therefore,

it is su¢ cient to showqnbdn;� b

2n;� (B

�(x; �) � B(x; �)) = D�2(x; �) + op (1). Applying the same

argument as that of Theorem 3, except to a local quadratic regression, we obtain8<:�nrdn;��1nXj=1

f(� jxj ; zj)fWj(rn;� ; x)fWj(rn;� ; x)0K

�xj � xrn;�

�9=;�qnrdn;�

0BBB@ �0(x; �)� �0(x; �)

rn;� ( �1(x; �)� �1(x; �))

r2n;� ( 2(x; �)� �2(x; �))

1CCCA��nrdn;�k

��1=2 nXj=1

f� � 1(uj � � � 0)gfWj(rn;� ; x)K

�xj � xrn;�

�+ op (1) :

Therefore,

qnbdn;�

0BBB@ �0(x; �)� �0(x; �)

rn;� ( �1(x; �)� �1(x; �))

r2n;� ( 2(x; �)� �2(x; �))

1CCCA=

qnbdn;�qnrdn;�

8<:�nrdn;��1nXj=1

f(� jxj ; zj)fWj(rn;� ; x)fWj(rn;� ; x)0K

�xj � xrn;�

�9=;�1

��nrdn;�

��1=2 nXj=1

f� � 1(uj � � � 0)gfWj(rn;� ; x)K

�xj � xrn;�

�+ op (1) :

After taking the limit of the term in curly brackets, we obtain D2(x; �) with u0j (�) replaced by withuj � � . This completes the proof.

A-15

Proof of Corollary 7. De�ne Wl;j(x; bn;� ) =h1 z0j (xj � x)0=bn;�

i0. We have

qnbdn;�

0BBB@�0;l(x; �)� �0;l(x; �)

�l(x; �)� �(�)

bn;� (�1;l(x; �)� �1;l(x; �))

1CCCA=

8<:(nbdn;� )�1nXj=1

f (� jx; zj)Wl;j(x; bn;� )Wl;j(x; bn;� )0K

�xj � xbn;�

�9=;�1

�

8<:�nbdn;��1=2nXj=1

�� 1(u0j (�) � 0)

Wl;j(x; bn;� )K

�xj � xbn;�

�

+ (nbdn;� )�1

nXj=1

f (� jx; zj) ej (x; �)Wl;j(x; bn;� )K

�xj � xbn;�

�9=;+ op (1) ;The term in the �rst set of curly brackets converges to

E

0BBB@f(X)f (� jX;Z)266641 Z 0 0

Z ZZ 0 0

0 0Ruu0K(u)du

37775��X = x

1CCCA ;

whose inverse is block diagonal. Applying the block-diagnality, we have

qnbdn;�

h1 z0 0

i0BBB@�0;l(x; �)� �0;l(x; �)

�l(x; �)� �(�)

bn;� (�1;l(x; �)� �1;l(x; �))

1CCCA=

h1 z0

iMl(x; �)

�1

8<:�nbdn;��1=2nXj=1

�� 1(u0j (�) � 0)

24 1

zj


�9=;+h1 z0

iMl(x; �)

�1

8<:(nbdn;� )�1nXj=1

f (� jx; zj) ej (x; �)

24 1

zj


�9=;+ op (1) :The �rst term on the right hand side equals D1;l (x; z; �) and the second converges to Bl(x; z; �).

A-16

Proof of Corollary 9. The proof is similar to Theorem 1, so we only give an outline to avoidrepetition. The following result holds uniformly over T :

(nbdn;� )�1=2

nXj=1

�� 1

�u0j (�) � 0

��Wj(x; bn;� )K

�xj � xbn;�

�

�(nbdn;� )�1=2nXj=1

f (� jx; zj) ej (x; �) �Wj(x; bn;� )K

�xj � xbn;�

�

�

0@(nbdn;� )�1 nXj=1

f (� jxj ; zj) �Wj(x; bn;� ) �Wj(x; bn;� )0K

�xj � xbn;�

�1A�(nbdn;� )1=2

0@ �0(x; �)� �0(x; �)

bn;� (�1(x; �)� �1(x; �))

1A = op (1) :

We have

(nbdn;� )�1

nXj=1

f (� jx; zj) �Wj(x; bn;� ) �Wj(x; bn;� )0K

�xj � xbn;�

�p! f(x)E (f (� jX;Z)jX = x)Nx (�)

and

(nbdn;� )�1=2

nXj=1

f (� jx; zj) ej (x; �) �Wj(x; bn;� )K

�xj � xbn;�

�

=1

2(nbd+4n;� )

1=2f(x)E (f (� jX;Z)jX = x)

ZDx;bn;�

u0@2Q(� jx)@x@x0

u

24 1

u

35K (u) du+ op (1) :Combining the above two expressions lead to the desired result.Proof of Corollary 10. The proof is again similar to Theorem 1, so we only give an outline toavoid repetition. Recall fWj(x; rn;� ) =

h1

xj�xrn;�

q(xj�x)0r2n;�

i0:

The following result holds uniformly over T :0@qnbd+4n;�qnrd+4n;�

1A (nrdn;� )�1=2 nXj=1

�� 1

�u0j (�) � 0

�fWj(x; rn;� )K

�xj � xrn;�

�

�


1A0@(nrdn;� )�1 nXj=1


�fWj(x; rn;� )fWj(x; rn;� )0

1A

�(nrdn;� )1=2

0BBB@ 0(x; �)� 0(x; �)

rn;� ( 1(x; �)� 1(x; �))

r2n;� ( 2(x; �)� 2(x; �))

1CCCA= op (1) :

A-17

Because

(nrdn;� )�1

nXj=1

f (� jxj ; zj)fWj(x)fWj(x)0K

�xj � xrn;�

�p! f(x)E (f (� jX;Z)jX = x)

ZDx;rn;�

�u�u0K(u)du;

we have qnbd+4n;� ( 2(xj ; �)� 2(x; �))

=


1A e03

"f(x)E (f (� jX;Z)jX = x)

ZDx;rn;�

�u�u0K(u)du

#�1

�(nrdn;� )�1=2nXj=1

�� 1

�u0j (�) � 0

�fWj(x; rn;� )K

�xj � xrn;�

�+ op (1) :

The result followings after applying the expression (23). This completes the proof.

A-18

Appendix B. Auxiliary LemmasThe same notation as in the main appendix is used here. The next result relates S (x; � ; �) to

S0 (x; �). It is needed for establishing the convergence rate of the estimator in the �rst step of theestimation procedure and the Bahadur representation.

Lemma B.1 Under the same Assumptions as in Lemma 1, we have

supx2Sx

sup�2T

supk�k�logn

kS (x; � ; �)� S0 (x; �)k = Op

�(nhdn)

� 14 log n

�: (B.1)

Proof. The proof is fairly long. It is structured into three steps as follows. Step one appliesa chaining argument to bound the left hand side of (B.1) with three terms. Step two exploitsthe structure of S (x; � ; �) to derive further upper and lower bounds for it. Step three appliesBernstein�s inequality. We proceed under the local quadratic speci�cation, and comment on thedi¤erence, if any, for the local linear case. Let C denote a �nite constant that can change valuebetween cases.

Step 1. Apply a chaining argument. Because the support of x, Sx, is compact, it can bepartitioned into

Lx = C�(nhdn)

3=4h�2n

�dcubes such that the side length of each cube is at most

�nhdn

��3=4h2n. Similarly, the set � =

f� : k�k � log ng can be partitioned into

L� = C((nhdn)1=4 log n)dim(�)

cubes with the side length is at most (nhdn)�1=4. Also, T can be partitioned into

L� = C(nhdn)3=4

intervals with interval length does not exceed (nhdn)�3=4. Let

N = L�L�Lx = C3h�2dn (log n)dim(�)(nhdn)(dim(�)+3d+3)=4:

De�ne � = (x0; � ; �0)0 and � = Sx � T � �. Write � 2 Is if � falls into the s-th cube, wheres 2 f1; :::; Ng. Let �s be the smallest value in the s-th cube including the values of the boundaries.

Apply the above partition to the left hand side of (B.1):

supx2Sx

sup�2T

supk�k�logn

kS (x; � ; �)� S0 (x; �)k

� max1�s�N

sup�2�\Is

kS (x; � ; �)� S0 (x; �)� S (xs; � s; �s) + S0 (xs; � s)k

+ max1�s�N

kS (xs; � s; �s)� S0 (xs; � s)k

� max1�s�N

sup�2�\Is

kS (x; � ; �)� S (xs; � s; �s)k (I)

+ max1�s�N

sup�2�\Is

kS0 (x; �)� S0 (xs; � s)k (II)

+ max1�s�N

kS (xs; � s; �s)� S0 (xs; � s)k (III).

B-1

Step 2. Derive upper and lower bounds. This step focuses on Term (I). The goal is to derivebounds for S (x; � ; �) � S (xs; � s; �s) that depend on �s but not �. This will be done in a fewsmall steps. Because (II) equals Term (I) with � = 0 and e(x; �) = 0, a separate analysis of (II) isunnecessary. This step does not further study Term (III). We have:

S (x; � ; �) (B.2)

= (nhdn)�1=2

nXj=1


0�+ ej(x; �)��xj;zj�

� 1�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)�o

��Wj(hn; x)K

�xj � xhn

��Wj(hn; xs)K

�xj � xshn

��+(nhdn)

�1=2nXj=1


0�+ ej(x; �)��xj;zj�

� 1�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)�o

Wj(hn; xs)K

�xj � xshn

�:

The norm of the �rst summation on the right hand side is bounded from the above by

2(nhdn)�1=2

nXj=1

Wj(hn; x)K

�xj � xhn

��Wj(hn; xs)K

�xj � xshn

� � 2(nhdn)

�1=2nXj=1

kWj(hn; x)�Wj(hn; xs)kK�xj � xhn

�(A)

+2(nhdn)�1=2

nXj=1

kWj(hn; xs)k K �xj � xhn

��K

�xj � xshn

� : (B)

Suppose � 2 � \ Is. Apply the de�nition of Wj(hn; x) in (11):

(A) � 2C(nhdn)1=2kxs � xk

hn

8<:(nhdn)�1nXj=1

K

�xj � xhn

�9=; :

The term in curly brackets is Op(1) uniformly in x (c.f. Theorem 2 in Masry, 1996). Because

kxs � xk ��nhdn

��3=4h2n as implied by the size of the cubes, we have

2C(nhdn)1=2h�1n kxs � xk = O

�(nhdn)

1=2(nhdn)�3=4hn

�= o

�(nhdn)

�1=4�:

Therefore,

(A) = op

�(nhdn)

�1=4�:

Because Wj(hn; x) is bounded for all x,

(B) � 2C(nhdn)�1=2nXj=1

K �xj � xhn

��K

�xj � xshn

� :B-2

BecauseK(�) has a compact support, there exists 1 < � <1 such thatK(u) = 0 whenever kuk > �.This implies

(B) � 2C(nhdn)�1=2

nXj=1

K �xj � xhn

��K

�xj � xshn

� (B.3)

� 1 (minfkxj � xk ; kxj � xskg � �hn)

� 2C2(nhdn)1=2

x� xshn

8<:(nhdn)�1

nXj=1

1 (kxj � xsk � 2�hn)

9=; ;

where the second inequality follows because kxs � xk ��nhdn

��3=4h2n < hn and K �xj � xhn

��K

�xj � xshn

� � C

x� xshn

:Therefore,

(B) = op

�(nhdn)

�1=4�:

Combining the results for (A) and (B), we have that whenever � 2 � \ Is,

S (x; � ; �)� S (xs; � s; �s) (B.4)

= (nhdn)�1=2

nXj=1


0�+ ej(x; �)��xj;zj�

� 1�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)�

� P�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)��xj;zj�

+ 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)�o

Wj(hn; xs)K

�xj � xshn

�+op

�(nhdn)

�1=4�:

Below, we continue to analyze the leading term in (B.4). LetWj;k(hn; x) denote the k-th elementof Wj(hn; x). De�ne

W+j (hn; x; k) = (0; :::Wj;k(hn; x); :::; 0) 1(Wj;k(hn; x) � 0);

W�j (hn; x; k) = (0; :::�Wj;k(hn; x); :::; 0) 1(Wj;k(hn; x) < 0):

Then, Wj(hn; x) can be expressed using 2dim(Wj(hn; x)) non-negative terms:

Wj(hn; x) =

dim(Wj(hn;x))Xk=1

W+j (hn; x; k)�

dim(Wj(hn;x))Xk=1

W�j (hn; x; k):

This decomposition follows Bai (1996). Using it, the summation in (B.4) can be represented using2dim(Wj(hn; x)) terms. These terms can be studied in the same way. It is therefore su¢ cient to

B-3

consider just one term:

(nhdn)�1=2

nXj=1


0�+ ej(x; �)��xj;zj� (B.5)

� 1�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)�

� P�u0j (� s) � (nhdn)�1=2Wj(hn; xs)


+ 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)�o

W+j (hn; xs; k)K

�xj � xshn

�:

Because � s � � � � s+1 and by the monotonicity, the �rst two components in (B.5) satisfy

P�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)��xj;zj� (B.6)

�1�u0j (�) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)�

� P�u0j (� s+1) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �))jxj;zj�

�1�u0j (� s) � (nhdn)�1=2Wj(hn; x)

0�+ ej(x; �)�:

Because Wj(hn; x)0��Wj(hn; x)

0�s � k�� sk kWj(hn; x)k � (nhdn)�1=4 kWj(hn; x)k � C(nhdn)

�1=4;

we haveWj(hn; x)

0�s � C(nhdn)�1=4 �Wj(hn; x)0� �Wj(hn; x)

0�s + C(nhdn)�1=4:

Consequently, (B.6) is further bounded from the above by

P�u0j (� s+1) � (nhdn)�1=2Wj(hn; x)�s + C(nh

dn)�3=4 + ej(x; �)jxj;zj

�(B.7)

�1�u0j (� s) � (nhdn)�1=2Wj(hn; x)�s � C(nhdn)�3=4 + ej(x; �)

�:

Because kWj(hn; x)�Wj(hn; xs)k � C kx� xsk =hn, we have

kWj(hn; x)�s �Wj(hn; xs)�sk = k(Wj(hn; x)�W (hn; xs))�sk

� Ckx� xsk

hnk�sk

� C1

hn

�nhdn

��3=4h2n log n

� C(nhdn)�3=4:

As a result, (B.7) is further bounded from the above by

P�u0j (� s+1) � (nhdn)�1=2Wj(hn; xs)�s + 2C(nh

dn)�3=4 + ej(x; �)jxj;zj

�(B.8)

� 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s � 2C(nhdn)�3=4 + ej(x; �)

�:

B-4

It remains to relate ej(x; �) to ej(xs; � s). Recall that ej (x; �) equals

ej (x; �) = g(x; �) +@g(x; �)

@x0(xj � x) +

1

2(xj � x)0

@2g(x; �)

@x@x0(xj � x)� g(xj ; �):

Apply this de�nition:

ej(x; �)� ej(xs; � s)= g(x; �)� g(xs; � s)

+@g(x; �)

@x0(xj � x)�

@g(xs; � s)

@x0(xj � xs)

+1

2(xj � x)0

@2g(x; �)

@x@x0(xj � x)�

1

2(xj � xs)0

@2g(xs; � s)

@x@x0(xj � xs) :

Because of the Lipschitz continuity in Assumption 4, the three di¤erence on the right hand sideare all bounded by C(nhdn)

�3=4=3. Therefore, (B.8), and consequently (B.6), has a upper bound

P�u0j (� s+1) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s) + 3C(nh

dn)�3=4jxj;zj

�(B.9)

� 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4

�By applying the same argument, we can �nd that a lower bound for (B.6), given by

P�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4jxj;zj

�(B.10)

� 1�u0j (� s+1) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s) + 3C(nh

dn)�3=4

�:

Combining (B.9) and (B.10) with the non-negativity of W+j (hn; xs; k), we obtain an upper

bound for (B.5) given by

UB(xs; � s; � s+1; �s)

= (nhdn)�1=2

nXj=1

nP�u0j (� s+1) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s) + 3C(nh

dn)�3=4jxj;zj

�� 1

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4

�� P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)


+ 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)�o

W+j (hn; xs; k)K

�xj � xshn

�;

B-5

and a lower bound for (B.5)

LB(xs; � s; � s+1; �s)

= (nhdn)�1=2

nXj=1

nP�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4jxj;zj

�� 1

�u0j (� s+1) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s) + 3C(nh

dn)�3=4

�� P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)


+ 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)�o

W+j (hn; xs; k)K

�xj � xshn

�:

It then follows that

Term (I) � max1�s�N

kUB(xs; � s; � s+1; �s)k+ max1�s�N

kLB(xs; � s; � s+1; �s)k+ op�(nhdn)

�1=4�: (B.11)

Letting � = 0 and e(x; �) = 0 in UB(xs; � s; � s+1; �s) and LB(xs; � s; � s+1; �s) we obtain boundsfor Term (II). This implies that order of Term (II) does not exceed that of Term (I).

Step 3. Apply Bernstein�s inequality. We further analyze UB(xs; � s; � s+1; �s) and LB(xs; � s; � s+1; �s)in (B.11) and Term (III).

Adding and subtracting terms,

UB(xs; � s; � s+1; �s)

= (nhdn)�1=2

nXj=1


dn)�3=4jxj;zj

�� P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4jxj ; zj

�o�W+

j (hn; xs; k)K

�xj � xshn

�(D)

+(nhdn)�1=2

nXj=1

nP�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4jxj;zj

�� 1

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4

�o�W+

j (hn; xs; k)K

�xj � xshn

�(E)

�(nhdn)�1=2nXj=1

nP�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)jxj;zj�

� 1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)�o

�W+j (hn; xs; k)K

�xj � xshn

�(F)

+op

�(nhdn)

�1=4�;

B-6

where the three summations are denoted by (D), (E) and (F) respectively. For (D):

k(D)k = (nhdn)�1=2nXj=1


dn)�3=4jxj;zj

�� P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s) + 3C(nh

dn)�3=4jxj;zj

�+ P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s) + 3C(nh

dn)�3=4jxj;zj

�� P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4jxj;zj

�o�W+

j (hn; xs; k)K

�xj � xshn

�:

By the Lipschitz continuity of Q (� jx; z) with respect to � , the preceding display is bounded fromthe above by

C(nhdn)�1=4

0@(nhdn)�1 nXj=1

W+j (hn; xs; k)K

�xj � xshn

�1A :

Applying the same argument as (B.3), the preceding display is of order Op�(nhdn)

�1=4�, which holdsuniformly over s 2 f1; :::; Ng because the values xs are not stochastic.

Terms (E) and (F) need to be analyzed jointly. De�ne

�j (xs; � s) = P�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4jxj;zj

��1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)� 3C(nhdn)�3=4

��P

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0�s + ej(xs; � s)jxj;zj�

+1�u0j (� s) � (nhdn)�1=2Wj(hn; xs)

0s�s + ej(xs; � s)

�:

Then, for any �nite constant M > 0,

P

�max1�s�N

k(E)+(F)k �M(nhdn)�1=4 log n

�

= P

0@ max1�s�N

(nhdn)�1=2nXj=1

�j (xs; � s)W+j (hn; xs; k)K

�xj � xshn

� �M(nhdn)�1=4 log n

1A� N max

1�s�NP

0@ (nhdn)�1=2nXj=1


�xj � xshn


1A :

B-7

Because the summands are mean zero, bounded and mutually independent, Bernstein�s inequalityis applicable:

P

0@ (nhdn)�1=2nXj=1


�xj � xshn


1A (B.12)

� 2 exp

0B@� n�1nM(nh

dn)1=4 log n

�22n�1

Pnj=1E

��j (xs; � s)W

+j (hn; xs; k)K

�xj�xshn

��2+ 2C 1

3nM(nhdn)1=4 log n

1CA= 2 exp

0B@� (M log n)2

2(nhdn)�1=2Pn

j=1E��j (xs; � s)W

+j (hn; xs; k)K

�xj�xshn

��2+ 2C 1

3M(nhdn)�1=4 log n

1CA :

The second term in the denominator converges to 0. Note that for any � 1;

E�E�jj�j (xs; � s) jj2 jxj ; zj

�� E

�E�jj�j (xs; � s) jj2jxj ; zj

�� C2(nhdn)

�3=4:

Therefore, the �rst term in the denominator satis�es

2(nhdn)�1=2

nXj=1

E

��j (xs; � s)W

+j (hn; xs; k)K

�xj � xshn

��2

= 2(nhdn)�1=2

nXj=1

E

(E

��j (xs; � s)W

+j (hn; xs; k)K

�xj � xshn

��2��xj ; zj!)

� 2C2(nhdn)�5=4

nXj=1

E

�W+j (hn; xs; k)K

�xj � xshn

��2= 2C2(nhdn)

�5=4nXj=1

E

�W+j (hn; xs; k)K

�xj � xshn

�1 (kxj � xsk � �hn)

�2� 2(nhdn)

�5=4C3nXj=1

E (1 (kxj � xsk � �hn))

= O�(nhdn)

�1=4�;

Applying this result, (B.12) is less than, in large samples, 2 exp��(M log n)2

�. Because

2 exp��(M log n)2

�N ! 0

for any �nite M , we have

P

�max1�s�N

k(E)+(F)k �M(nhdn)�1=4 log n

�! 0;

which further implies

max1�s�N

kUB(xs; � s; � s+1; �s)k = Op

�(nhdn)

�1=4 log n�:

B-8

Similarly,

max1�s�N

kLB(xs; � s; � s+1; �s)k = Op

�(nhdn)

�1=4 log n�:

ThereforeTerm (I)=Op

�(nhdn)

�1=4 log n�:

Because the order of Term (II) does not exceed Term (I), it follows that

Term (II)=Op�(nhdn)

�1=4 log n�:

Finally, Term (III) can also be bounded using Bernstein�s inequality. Note that

S (xs; � s; �s)� S0 (xs; � s)

= (nhdn)�1=2

nXj=1

nP�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)jxj;zj

�� 1

�u0j (� s) � (nhdn)�1=2Wj(hn; xs)�s + ej(xs; � s)

�� P

�u0j (� s) � 0)

��xj;zj�+ 1

�u0j (� s) � 0

�Wj(hn; xs)K

�xj � xshn

�:

Denote the four terms in curly brackets as �j (xs; � s). The approximation error kej(xs; � s)k =O(h3n) = O((nhdn)

�1=2) (If a local linear regression is considered, then kej(xs; � s)k = O(h2n) =

O�(nhdn)

�1=2:�Then, apply Bernstein�s inequality as before, with �j (xs; � s) replacing �j (xs; � s):

P�kS (xs; � s; �s)� S0 (xs; � s)k �M(nhdn)

�1=4 log n�

� 2 exp

0B@� n�1nM(nh

dn)1=4 log n

�22 1nPnj=1E

��j (xs; � s)Wj(hn; xs)K

�xj�xshn

��2+ 2C 1

3nM(nhdn)1=4 log n

1CA= 2 exp

0B@� (M log n)2

2(nhdn)�1=2Pn

j=1E��j (xs; � s)Wj(hn; xs)K

�xj�xshn

��2+ 2

3CM(nhdn)�1=4 log n

1CA :

Noticing E�jj�j (xs; � s) jj2

�� C(nhdn)

�1=2 log n, the �rst term in the denominator is bounded fromthe above by C log n. Therefore, in large samples, the preceding display is bounded by

2 exp

� M2 log n

2C + 23CM(nh

dn)�1=4 log n

!� 2 exp (�M log n) :

by choosing a su¢ ciently large M . Because

2 exp (�M log n)N ! 0

for su¢ ciently large M , we have

Term (III)=Op�(nhdn)

�1=4 log n�:

This completes the proof.

B-9

Lemma B.2 The following results hold under the same Assumptions as in Lemma 1:

sup�2T

supx2Sx

(nhdn)�1=2nXj=1

� � (u

0j (�)� ej(x; �))� � (u0j (�))

Wj(hn; x)K

�xj � xhn

� = Op

�plog n

�and

sup�2T

supx2Sx

kS0 (x; �)k = Op

�plog n

�;

where � (u) = � � 1(u < 0).

Proof. Consider the �rst result:

(nhdn)�1=2

nXj=1

� � (u

0j (�)� ej(x; �))� � (u0j (�))

Wj(hn; x)K

�xj � xhn

�= S (x; � ; 0)� S0 (x; �)

+(nhdn)�1=2

nXj=1

�P�u0j (�) � 0jxj;zj

�� P

�u0j (�) � ej(x; �)jxj;zj

�Wj(hn; x)K

�xj � xhn

�:

Because of Lemma B.1,

sup�2T

supx2Sx

kS (x; � ; 0)� S0 (x; �)k = Op

�(nhn)

�1=4 log n�= Op

�plog n

�:

Meanwhile, (nhdn)�1=2nXj=1

�P�u0j (�) � 0jxj;zj

�� P

�u0j (�) � ej(x; �)jxj;zj

�Wj(hn; x)K

�xj � xhn

� =

(nhdn)�1=2nXj=1

f (eyj jxj ; zj) ej(x; �)Wj(hn; x)K

�xj � xhn

� ;where eyj lies between Q(� jxj ; zj) and Q(� jxj ; zj) + ej(x; �). Because Kj((xj � x)=hn) equals 0unless xj is in a vanishing neighborhood of x, it su¢ ces to consider such values. At these values,ej(x; �) = O((nhdn)

�1=2) and eyj approaches Q(� jxj ; zj) as n ! 1. This implies there existsC < 1 such that in large samples,kf (eyj jxj ; zj) ej(x; �)Wj(hn; x)k � C(nhdn)

�1=2. Therefore, withprobability 1 the preceding display is bounded by

C(nhdn)�1

nXj=1

K

�xj � xhn

�= Op (1) :

The second result can be proved using the same arguments as in Lemma B.1. So we only sketchthe main steps. Apply the same partition of T and Sx as in the Lemma B.1. Let �N = L�Lx and

B-10

�� = (x0; �)0. Write �� 2 Is if �� falls into the s-th cube, where s 2�1; :::; �N

. Let ��s be the smallest

value in the s-th cube including the values of the boundaries. Then,

sup�2T

supx2Sx

kS0 (x; �)k

� max1�s� �N

sup��2��\Is

kS0 (x; �)� S0 (xs; � s)k+ max1�s� �N

kS0 (xs; � s)k :

By Lemma B.1, the �rst term on the right hand side is Op�(nhdn)

�1=4 log n�= Op

�log1=2 n

�. The

summands in the second term are bounded, so we can again apply Bernstein�s inequality:

P�kS0 (xs; � s)k �M

plog n

�= 2 exp

0B@� M2 log n

2(nhdn)�1Pn

j=1E�� 1(u0j (�) � 0)Wj(hn; xs)K

�xj�xshn

��2+ 2

3CM(nhdn)�1=2plog n

1CA :

The �rst term in the denominator is �nite. The second term converges to zero for any �nite M .Therefore, by choosing a su¢ ciently large M , the right hand side can be bounded from the aboveby 2 exp (�M log n), which can satisfy 2 exp (�M log n) �N ! 0 with a su¢ ciently large M becauselog �N = O(log n) by construction.

The next result is needed for proving Lemma 1. Its proof is similar to Step 1 in the proof ofTheorem 1 in Qu and Yoon (2015). The result in Qu and Yoon (2015) is pointwise with respect to xfor a purely nonparametric model. Here, the result is uniform with respect to x for a semiparametricmodel.

Lemma B.3 Under the Assumptions of Lemma 1, (10) satis�es

Pr

�sup�2T

supx2Sx

e�(x; �) � log n�! 1:

Proof. By construction, e�(x; �) is the minimizer of (A.1). Because V (x; � ; 0) = 0, we always haveV (x; � ; e�(x; �)) � 0 for each � and every n. Therefore, to prove the result, it is su¢ cient to showthat for any � > 0, there exist some �nite N0 and � > 0 independent of � and x, such that

P

�inf�2T

infx2Sx

infk�k�logn

V (x; � ; �) > � log2 n

�> 1� � for all n � N0: (B.13)

Further, because V (x; � ; �) is convex in �, the inequality

V (x; � ; �)� V (x; � ; 0) � (V (x; � ; �)� V (x; � ; 0))

holds for any � 1. Therefore, a further su¢ cient condition for (B.13) is

P

�inf�2T

infx2Sx

infk�k=logn

V (x; � ; �) > � log2 n

�> 1� � for all n � N0: (B.14)

Below we establish (B.14).

B-11

Consider the following decomposition of (A.1) due to Knight (1998):

V (x; � ; �) =W(x; � ; �) + Z(x; � ; �); (B.15)

where

W(x; � ; �) = �(nhdn)�1=2nXj=1

� (u0j (�)� ej(x; �))K

�xj � xhn

�Wj(hn; x)

0� with � (u) = � � 1(u < 0);

Z(x; � ; �) =

nXj=1

K

�xj � xhn

� Z (nhdn)�1=2Wj(hn;x)

0�

0

�1(u0j (�)� ej(x; �) � s)� 1(u0j (�)� ej(x; �) � 0)

ds:

Apply this decomposition:

inf�2T

infx2Sx

infk�k=logn

V (x; � ; �)

log2 n(B.16)

� inf�2T

infx2Sx

infk�k=logn

Z(x; � ; �)log2 n

� sup�2T

supx2Sx

supk�k=logn

jW(x; � ; �)jlog2 n

:

Below we bound the two term on right hand side of (B.16) separately.For the second term in (B.16):

sup�2T

supx2Sx

supk�k=logn

jW(x; � ; �)jlog2 n

� 1

log nsup�2T

supx2Sx

(nhdn)�1=2nXj=1

� � (u

0j (�)� ej(x; �))� � (u0j (�))

Wj(hn; x)

0K

�xj � xhn

� +

1

log nsup�2T

supx2Sx

(nhdn)�1=2nXj=1

� (u0j (�))Wj(hn; x)

0K

�xj � xhn

� :The two terms on the right hand side are both Op

�(log n)�1=2

�= op (1) by Lemma B.2. Therefore,

sup�2T

supx2Sx

supk�k=logn

jW(x; � ; �; e)jlog2 n

= op (1) : (B.17)

We now show that the �rst term in (B.16) is strictly positive with probability tending to 1.First notice that the integral appearing in Z(x; � ; �) is always nonnegative and satis�es (see LemmaA.1 in Oka and Qu, 2011)Z (nhdn)

�1=2Wj(hn;x)0�

0

�1(u0j (�)� ej(x; �) � s)� 1(u0j (�)� ej(x; �) � 0)

ds

� (nhdn)�1=2Wj(hn; x)

0�

2

�1

�u0j (�)� ej(x; �) � (nhdn)�1=2

Wj(hn; x)0�

2

�� 1(u0j (�)� ej(x; �) � 0)

�.

B-12

Applying this inequality to Z(x; � ; �) :1

log2 nZ(x; � ; �)

� 1

(nhdn)1=2 log2 n

��

2

�0 nXj=1

�1

�u0j (�)� ej(x; �) � (nhdn)�1=2

Wj(hn; x)0�

2

�� 1(u0j (�)� ej(x; �) � 0)

�

�Wj(hn; x)0K

�xj � xhn

�=

1

log2 n

��

2

�0�S (x; � ; 0)� S

�x; � ;

�

2

��+

1

log2 n(nhdn)

�1=2��

2

�0 nXj=1

�P

�u0j (�)� ej(x; �) � (nhdn)�1=2

Wj(hn; x)�

2

��xj ; zj�

� P (u0j (�)� ej(x; �) � 0��xj ; zj)Wj(hn; x)K

�xj � xhn

�= (G)+ (H)

Because of Lemma B.1 and k�k = log n,

(G) = Op

�(nhdn)

�1=4�= op (1) : (B.18)

Apply a mean value theorem:

(H) = (1=4)1

log2 n�0

0@(nhdn)�1 nXj=1

f (eyj jxj ; zj)K �xj � xhn


0

1A�;

where eyj lies between Q(� jxj ; zj) + ej(x; �) and Q(� jxj ; zj) + ej(x; �) + (nhdn)�1=2Wj(hn; x)

0�=2.Because K((xj�x)=hn) equals 0 unless xj is in a vanishing neighborhood of x, it su¢ ces to considerthose xj satisfying kxj � xk � �hn with � being some �nite constant. At such values, ej(x; �) and(nhdn)

�1=2Wj(hn; x)0�=2 both approach 0 because k�k = log n. Therefore, eyj approaches Q(� jxj ; zj)

as n ! 1. This implies, for any " > 0, f (eyj jxj ; zj) � f (� jxj ; zj) � " holds for all xj and zj withprobability arbitrarily close to one in large samples. This implies

(H) � (1=4)1

log2 n�0

8<:(nhdn)�1nXj=1

f (� jxj ; zj)K�xj � xhn


0

9=;�

�"(1=4) 1

log2 n�0

8<:(nhdn)�1nXj=1

K

�xj � xhn


0

9=;�

with probability arbitrarily close to one in large samples. The term in the �rst set of curly bracketshas eigenvalues bounded away from 0. Denote its smallest eigenvalue by �min. The term in thesecond set of parentheses is �nite (say it is less than C) in probability. Therefore, uniformly in �and x,

(H) � 1

4�min �

1

4"C � 1

8�min (B.19)

B-13

with probability arbitrarily close to one in large samples, where the last inequality holds because "can be chosen to be small.

Combining (B.17), (B.18) and (B.19), we see that (B.19) is strictly positive and dominates(B.17) and (B.18) with probability tending to 1. This completes the proof.

Lemma B.4 There exist > 1 and �C < 1, such that for any �1; �2 2 T satisfying j�2 � �1j �n�1=2��, we have E(kU (�2)� U (�1)k2 ) � �C j�2 � �1j :

Proof. It su¢ ces to show (E kU (�2)� U (�1)k2 )1= � �C1= (�2 � �1) for �2 � �1. Let

A1i =��2 � 1(u0i (�2) � 0)

��1 � 1(u0i (�1) � 0)

�Ti(�2)

A2i =��1 � 1(u0i (�1) � 0)

�(Ti(�2)� Ti(�1)) :

Then, we can write

U (�2)� U (�1) = n�1=2nXi=1

(A1i +A2i) :

Let �q denote the dimension of U (�1), we have

�E kU (�2)� U (�1)k2

�1= �

�qXk=1

8<:n� E��nXi=1

A1i;k +A2i;k

��2 9=;1=

; (B.20)

where A1i;k and A2i;k denote the k-th element of A1i and A2i respectively, and the inequality followsfrom Minkowski�s inequality.

We bound the term inside curly brackets using arguments similar to Bai (1996, Lemma A1):

n� E

��nXi=1

A1i;k +A2i;k

��2

� Cn�

nXi=1

E jA1i;k +A2i;kj2!

+ Cn� nXi=1

E jA1i;k +A2i;kj2

� 2 Cn�

nXi=1

EA21i;k + EA22i;k

! + 2 Cn�

nXi=1

E�A21i;k +A

22i;k

� � 2 Cn�

nXi=1

E kA1ik2 + E kA2ik2!

+ 2 Cn� nXi=1

E�kA1ik2 + kA2ik2

� � 2 Cn�

nXi=1

E kA1ik2 + E kA2ik2!

(I)

+2 n� CnXi=1

�(E kA1ik2 )1= + (E kA2ik2 )1=

� (J),

where the �rst inequality is because of the Rosenthal inequality for independent random variables(Hall and Heyde, 1980, p. 23) with the constant C depending only on , the second is because

B-14

of the triangle inequality, the third is because A21i;k � kA1ik2 and A22i;k � kA2ik2, and the lastinequality is due to Minkowski�s inequality. Further,

E kA1ik2 = EnE�kA1ik2 jxi; zi

�o= E

�En��2 � 1(u0i (�2) � 0)� �1 + 1(u0i (�1) � 0)

�2 ��xi; zio kTi(�2)k2 �� E

�En��2 � 1(u0i (�2) � 0)� �1 + 1(u0i (�1) � 0)

�2��xi; zio kTi(�2)k2 �� C (�2 � �1) :

where the �rst inequality follows because��2 � 1(u0i (�2) � 0)� �1 + 1(u0i (�1) � 0)�� 1. Mean-

while,

E kA2ik2 = E ��1 � 1(u0i (�1) � 0)� (Ti(�1)� Ti(�2)) 2

� E kTi(�1)� Ti(�2)k2

� C (�2 � �1)2 ;

where the second inequality follows from the Lipschitz continuity of Ti(�) with respect to � . Theterms E kA1ik2 and E kA2ik2 in (I) can be bounded in the same way, leading to E kA1ik2 �C (�2 � �1) and E kA2ik2 � C (�2 � �1)2. These bounds imply

(I) � 2 C

n�1

nXi=1

�C (�2 � �1) + C (�2 � �1)2

�! � M (�2 � �1) for some constant M;

and

(J) � 2 Cn� nXi=1

�(C (�2 � �1))1= +

�C (�2 � �1)2

�1= � � Mn1� (�2 � �1)= M(n (�2 � �1))1� (�2 � �1) for some constant M .

By the de�nition of the size of the interval in the Lemma, we have �2� �1 � n�1=2�� > n�1, whichimplies n (�2 � �1) > 1. Consequently, M (n (�2 � �1))1� < M because > 1. Therefore,

(J) �M (�2 � �1) :

It therefore follows that each term inside the curly brackets in (B.20) is bounded by 2M (�2 � �1) .Consequently, (B.20) is bounded by (�q+1) (2M)1= (�2 � �1). Let �C = (�q+1) (2M)1= . This com-pletes the proof.

B-15

Table 1: Summary Statistics of Bandwidths

Models n=500 n=1000

hcv hopt hcv hopt

Model 1

x = (0.50, 0.50) 0.417 0.286 0.386 0.261

(0.072) (0.069) (0.059) (0.041)

x = (0.50, 0.75) 0.428 0.334 0.397 0.287

(0.075) (0.104) (0.062) (0.064)

x = (0.75, 0.75) 0.438 0.335 0.406 0.273

(0.076) (0.109) (0.064) (0.047)

Model 2

x = (0.50, 0.50) 0.429 0.314 0.346 0.301

(0.114) (0.108) (0.078) (0.106)

x = (0.50, 0.75) 0.475 0.396 0.378 0.369

(0.140) (0.141) (0.088) (0.148)

x = (0.75, 0.75) 0.544 0.456 0.434 0.433

(0.178) (0.164) (0.113) (0.161)

Averages and standard deviations (in parentheses) of bandwidths based on1000 simulation runs. hcv is the cross validation bandwidth at the median andhopt is the MSE optimal bandwidth at the median.

Table 2: Root Mean Squared Error and Bias of Conditional Quantile Estimates, n = 500

Models Without bias correction With bias correction

RMSE Bias RMSE Bias

Q(0.2) Q(0.5) Q(0.8) Q(0.2) Q(0.5) Q(0.8) Q(0.2) Q(0.5) Q(0.8) Q(0.2) Q(0.5) Q(0.8)

I. Bandwidth Option 1

Model 1

x = (0.50, 0.50) 0.191 0.172 0.173 -0.145 -0.126 -0.137 0.179 0.179 0.175 0.014 0.009 -0.011

x = (0.50, 0.75) 0.251 0.220 0.224 -0.186 -0.157 -0.174 0.223 0.211 0.218 0.007 0.010 -0.020

x = (0.75, 0.75) 0.296 0.291 0.307 0.251 0.246 0.268 0.272 0.253 0.245 0.051 0.032 0.033

Model 2

x = (0.50, 0.50) 0.179 0.164 0.165 -0.087 -0.078 -0.103 0.211 0.209 0.192 0.058 0.046 0.015

x = (0.50, 0.75) 0.179 0.167 0.173 -0.088 -0.076 -0.106 0.242 0.227 0.220 0.065 0.055 0.012

x = (0.75, 0.75) 0.161 0.159 0.149 -0.021 -0.027 -0.050 0.262 0.251 0.216 0.069 0.049 0.015

II. Bandwidth Option 2

Model 1

x = (0.50, 0.50) 0.191 0.172 0.172 -0.145 -0.126 -0.136 0.180 0.179 0.174 0.014 0.009 -0.009

x = (0.50, 0.75) 0.251 0.220 0.224 -0.186 -0.157 -0.173 0.224 0.212 0.219 0.008 0.011 -0.019

x = (0.75, 0.75) 0.296 0.291 0.307 0.251 0.246 0.268 0.273 0.252 0.245 0.051 0.031 0.032

Model 2

x = (0.50, 0.50) 0.181 0.164 0.165 -0.090 -0.079 -0.103 0.213 0.209 0.194 0.056 0.045 0.015

x = (0.50, 0.75) 0.178 0.167 0.174 -0.087 -0.075 -0.106 0.242 0.229 0.223 0.065 0.057 0.013

x = (0.75, 0.75) 0.162 0.159 0.150 -0.020 -0.028 -0.050 0.262 0.251 0.217 0.068 0.048 0.015

Root Mean Squared Errors (RMSE) and Biases (Bias) of conditional quantile estimates. They are based on 1000 simulation runs.z = (0.5, 0.5). Under ‘Without bias correction’, Q(τ) stands for Q(τ |x, z). Under ‘With bias correction’, Q(τ) is Q(τ |x, z)− b2n,τ B(x, τ).

Table 3: Coverage Ratios of Uniform Confidence Bands, n = 500, Bandwidth Option 1

Models Asy Asy 2 Asy M Asy R Res Res 2 Res M Res R

I. p = 0.90

Model 1

x = (0.50, 0.50) 0.468 0.550 0.748 0.889 0.603 0.637 0.794 0.954

x = (0.50, 0.75) 0.492 0.549 0.755 0.907 0.623 0.596 0.791 0.950

x = (0.75, 0.75) 0.442 0.362 0.646 0.819 0.549 0.428 0.693 0.876

Model 2

x = (0.50, 0.50) 0.449 0.700 0.844 0.902 0.599 0.762 0.883 0.943

x = (0.50, 0.75) 0.525 0.752 0.898 0.903 0.655 0.789 0.916 0.936

x = (0.75, 0.75) 0.545 0.875 0.908 0.866 0.654 0.909 0.938 0.894

II. p = 0.95

Model 1

x = (0.50, 0.50) 0.560 0.633 0.796 0.939 0.710 0.735 0.850 0.989

x = (0.50, 0.75) 0.625 0.631 0.820 0.951 0.741 0.702 0.866 0.982

x = (0.75, 0.75) 0.545 0.454 0.714 0.878 0.665 0.556 0.763 0.923

Model 2

x = (0.50, 0.50) 0.562 0.766 0.891 0.936 0.695 0.815 0.924 0.981

x = (0.50, 0.75) 0.631 0.826 0.939 0.945 0.766 0.856 0.955 0.965

x = (0.75, 0.75) 0.649 0.928 0.948 0.916 0.776 0.956 0.967 0.945

All values are based on 1000 simulation runs. The sample size is 500 and the quantilerange is T = [0.2, 0.8]. In all specifications, z = (0.5, 0.5). ‘Asy’ is a conventional bandusing the asymptotic approximation with a plug-in bias estimator, ‘Asy 2’ is the sameband but ignores the bias, ‘Asy M’ is the modified band proposed in Qu and Yoon (2015),and ‘Asy R’ is the robust band with bias estimation. ‘Res’ stands for bands based onresampling. The same naming convention is applied.



I. p = 0.90

Model 1

x = (0.50, 0.50) 0.484 0.559 0.752 0.890 0.601 0.638 0.794 0.953

x = (0.50, 0.75) 0.487 0.555 0.756 0.910 0.628 0.598 0.796 0.952

x = (0.75, 0.75) 0.440 0.360 0.645 0.816 0.548 0.432 0.693 0.876

Model 2

x = (0.50, 0.50) 0.451 0.706 0.847 0.891 0.585 0.751 0.879 0.944

x = (0.50, 0.75) 0.529 0.760 0.899 0.906 0.641 0.793 0.917 0.930

x = (0.75, 0.75) 0.549 0.867 0.904 0.863 0.655 0.905 0.938 0.897

II. p = 0.95

Model 1

x = (0.50, 0.50) 0.564 0.642 0.798 0.934 0.702 0.733 0.847 0.986

x = (0.50, 0.75) 0.610 0.634 0.816 0.950 0.744 0.703 0.869 0.981

x = (0.75, 0.75) 0.541 0.460 0.710 0.886 0.671 0.553 0.772 0.924

Model 2

x = (0.50, 0.50) 0.556 0.765 0.892 0.937 0.700 0.810 0.918 0.977

x = (0.50, 0.75) 0.642 0.818 0.935 0.949 0.768 0.855 0.955 0.967

x = (0.75, 0.75) 0.650 0.927 0.945 0.918 0.774 0.961 0.972 0.941

All value are based on 1000 simulation runs. The sample size is 500 and the quantilerange is T = [0.2, 0.8]. In all specifications, z = (0.5, 0.5). ‘Asy’ is a conventional bandusing the asymptotic approximation with a plug-in bias estimator, ‘Asy 2’ is the sameband but ignores the bias, ‘Asy M’ is the modified band proposed in Qu and Yoon (2015),and ‘Asy R’ is the robust band with bias estimation. ‘Res’ stands for bands based onresampling. The same naming convention is applied.

Table 5: Lengths of 90% Uniform Confidence Bands, n = 500, Bandwidth Option 1

Models Asy Asy M Asy R Res Res M Res R

τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8

Model 1

(0.50, 0.50) 0.475 0.481 0.620 0.623 0.820 0.803 0.647 0.649 0.792 0.791 1.144 1.156

(0.081) (0.097) (0.137) (0.134) (0.147) (0.171) (0.195) (0.228) (0.233) (0.260) (0.358) (0.402)

(0.50, 0.75) 0.589 0.618 0.766 0.792 0.989 1.006 0.782 0.802 0.959 0.975 1.330 1.350

(0.104) (0.125) (0.172) (0.171) (0.208) (0.232) (0.245) (0.274) (0.295) (0.314) (0.446) (0.514)

(0.75, 0.75) 0.631 0.737 0.853 0.977 1.040 1.199 0.855 0.886 1.076 1.127 1.403 1.478

(0.105) (0.150) (0.199) (0.243) (0.233) (0.314) (0.258) (0.280) (0.327) (0.339) (0.479) (0.561)

Model 2

(0.50, 0.50) 0.536 0.522 0.686 0.671 0.930 0.874 0.712 0.686 0.862 0.835 1.279 1.245

(0.111) (0.121) (0.160) (0.151) (0.200) (0.210) (0.243) (0.250) (0.271) (0.268) (0.469) (0.464)

(0.50, 0.75) 0.625 0.637 0.788 0.797 1.015 0.999 0.812 0.795 0.975 0.955 1.303 1.285

(0.114) (0.130) (0.171) (0.171) (0.243) (0.259) (0.271) (0.263) (0.311) (0.298) (0.477) (0.515)

(0.75, 0.75) 0.684 0.702 0.817 0.825 1.006 0.986 0.907 0.885 1.040 1.008 1.281 1.250

(0.108) (0.134) (0.166) (0.173) (0.260) (0.277) (0.268) (0.296) (0.305) (0.328) (0.471) (0.516)

Averages and standard deviations (in parentheses) of the length of the 90% uniform bands. All value are based on1000 simulation runs. ‘Asy 2’ has the same length as ‘Asy’, so omitted. The same applies to ‘Res 2’.



I. p = 0.90

Model 1

x = (0.50, 0.50) 0.418 0.474 0.717 0.901 0.556 0.544 0.757 0.937

x = (0.50, 0.75) 0.474 0.568 0.748 0.882 0.558 0.588 0.761 0.921

x = (0.75, 0.75) 0.482 0.246 0.664 0.908 0.524 0.292 0.670 0.915

Model 2

x = (0.50, 0.50) 0.410 0.684 0.842 0.850 0.533 0.714 0.860 0.889

x = (0.50, 0.75) 0.446 0.696 0.856 0.846 0.530 0.714 0.894 0.884

x = (0.75, 0.75) 0.552 0.844 0.888 0.846 0.592 0.870 0.908 0.859

II. p = 0.95

Model 1

x = (0.50, 0.50) 0.539 0.576 0.770 0.953 0.679 0.668 0.827 0.970

x = (0.50, 0.75) 0.592 0.642 0.802 0.942 0.680 0.692 0.836 0.964

x = (0.75, 0.75) 0.590 0.356 0.732 0.946 0.631 0.408 0.744 0.961

Model 2

x = (0.50, 0.50) 0.516 0.750 0.888 0.902 0.652 0.775 0.905 0.932

x = (0.50, 0.75) 0.564 0.770 0.902 0.902 0.649 0.792 0.937 0.926

x = (0.75, 0.75) 0.644 0.926 0.954 0.892 0.701 0.935 0.954 0.911

All values are based on 1000 simulation runs. The sample size is 1000. See Table 3 forthe definitions of bands.

Table 7: Lengths of 90% Uniform Confidence Bands, n = 1000, Bandwidth Option 1

Models Asy Asy M Asy R Res Res M Res R

τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8 τ = 0.5 τ = 0.8

Model 1

(0.50, 0.50) 0.359 0.365 0.481 0.488 0.626 0.625 0.457 0.451 0.577 0.572 0.791 0.776

(0.048) (0.054) (0.097) (0.091) (0.087) (0.094) (0.113) (0.119) (0.146) (0.149) (0.197) (0.203)

(0.50, 0.75) 0.447 0.475 0.586 0.615 0.775 0.811 0.565 0.571 0.705 0.713 0.977 0.980

(0.064) (0.075) (0.120) (0.113) (0.131) (0.145) (0.152) (0.155) (0.194) (0.190) (0.271) (0.274)

(0.75, 0.75) 0.476 0.559 0.670 0.778 0.829 0.971 0.606 0.622 0.802 0.841 1.026 1.061

(0.060) (0.077) (0.128) (0.151) (0.118) (0.153) (0.148) (0.163) (0.200) (0.218) (0.256) (0.284)

Model 2

(0.50, 0.50) 0.396 0.392 0.528 0.525 0.693 0.673 0.491 0.472 0.620 0.600 0.863 0.834

(0.081) (0.083) (0.126) (0.114) (0.145) (0.147) (0.149) (0.153) (0.176) (0.168) (0.270) (0.267)

(0.50, 0.75) 0.462 0.479 0.603 0.616 0.770 0.781 0.572 0.562 0.717 0.701 0.945 0.927

(0.079) (0.091) (0.131) (0.124) (0.182) (0.200) (0.165) (0.165) (0.200) (0.188) (0.322) (0.320)

(0.75, 0.75) 0.489 0.507 0.595 0.610 0.732 0.733 0.612 0.602 0.723 0.706 0.887 0.877

(0.067) (0.077) (0.108) (0.106) (0.190) (0.196) (0.156) (0.166) (0.191) (0.195) (0.304) (0.314)

Averages and standard deviations (in parentheses) of the length of the 90% uniform bands. All values are based on1000 simulation runs. ‘Asy 2’ has the same length as ‘Asy’, so omitted. The same applies to ‘Res 2’.

inference on conditional quantile processes in partially...

Documents