threshold regression with nonparametric sample splitting · conditions on n in assumption a later....

Threshold Regression with Nonparametric Sample Splitting∗

Yoonseok Lee†

Syracuse UniversityYulong Wang‡

Syracuse University

January 2021

Abstract

This paper develops a threshold regression model where an unknown relationship between

two variables nonparametrically determines the threshold. We allow the observations to

be cross-sectionally dependent so that the model can be applied to determine an unknown

spatial border for sample splitting over a random field. We derive the uniform rate of

convergence and the nonstandard limiting distribution of the nonparametric threshold

estimator. We also obtain the root-n consistency and the asymptotic normality of the

regression coeffi cient estimator. Our model has broad empirical relevance as illustrated

by estimating the tipping point in social segregation problems as a function of demo-

graphic characteristics; and determining metropolitan area boundaries using nighttime

light intensity collected from satellite imagery. We find that the new empirical results are

substantially different from those in the existing studies.

Keywords: threshold regression, sample splitting, nonparametric, random field, tipping

point, metropolitan area boundary.

JEL Classifications: C14, C21, C24, R1

∗We are grateful to Xiaohong Chen, Jonathan Dingel, Bo Honoré, Sokbae Lee, Yuan Liao, Francesca Moli-nari, Ingmar Prucha, Myung Seo, Ping Yu, and participants at numerous seminar/conference presentationsfor very helpful comments. Financial supports from the Appleby-Mosher grant and the CUSE grant are highlyappreciated.†Address : Department of Economics and Center for Policy Research, Syracuse University, 426 Eggers Hall,

Syracuse, NY 13244. E-mail : [email protected]‡Address : Department of Economics and Center for Policy Research, Syracuse University, 127 Eggers Hall,

Syracuse, NY 13244. E-mail : [email protected]

1 Introduction

Sample splitting and threshold regression models have spawned a vast literature in economet-

rics and statistics. Existing studies typically specify the sample splitting criteria in a para-

metric way as whether a single random variable or a linear combination of variables crosses

some unknown threshold. See, for example, Hansen (2000), Caner and Hansen (2004), Seo

and Linton (2007), Lee, Seo, and Shin (2011), Li and Ling (2012), Yu (2012), Lee, Liao, Seo,

and Shin (2020), Hidalgo, Lee, and Seo (2019), and Yu and Fan (2020). In this paper, we

study a novel extension to consider a nonparametric sample splitting model. Such an exten-

sion leads to new theoretical results and substantially generalizes the empirical applicability

of threshold models.

Specifically, we consider a model given by

yi = x>i β0 + x>i δ01 [qi ≤ γ0 (si)] + ui (1)

for i = 1, . . . , n, where 1 [·] is the binary indicator. In this model, the marginal effect of xito yi can be different across i as (β0 + δ0) or β0 depending on whether qi ≤ γ0 (si) or not.

The threshold function γ0(·) is unknown, and the main parameters of interest are β0, δ0,

and γ0(·). The novel feature of this model is that the sample splitting is determined by anunknown relationship between two variables qi and si, and their relationship is characterized

by the nonparametric threshold function γ0(·). In contrast, the classical threshold regressionmodels assume γ0 (·) to be a constant or a linear index. Our new specification can cover

interesting cases that have not been studied. For example, we can consider the threshold to

be heterogeneous and specific to each observation i if we see γ0 (si) = γ0i; or the threshold

to be determined by the direction of some moment condition γ0(si) = E[qi|si]. Apparently,when γ0(s) = γ0 or γ0(s) = γ0s for some parameter γ0 and s 6= 0, it reduces to the standard

threshold regression model.

The new model is motivated by the following two applications: estimating potentially

heterogeneous thresholds in public economics and determining spatial sample splitting in

urban economics. The first one is about the tipping point model proposed by Schelling

(1971), who analyzes the phenomenon that a neighborhood’s white population substantially

decreases once the minority share exceeds a certain threshold, called the tipping point. Card,

Mas, and Rothstein (2008) empirically estimate the tipping point model by considering the

constant threshold regression, yi = β10 + δ101 [qi ≤ γ0] + x>2iβ20 + ui, where yi is the white

population change in a decade and qi is the initial minority share in the ith tract. The

parameters δ10 and γ0 denote the change size and the threshold, respectively. In Section

VII of Card, Mas, and Rothstein (2008), however, they find that the tipping point γ0 varies

depending on the attitudes of white residents toward the minority. This finding raises the

1

concern on the constant threshold model and motivates us to study the more general model

(1) by specifying the tipping point γ0 as a nonparametric function of local demographic

characteristics. We estimate such a tipping function in Section 6.1.

For the second application, we use the model (1) to define metropolitan area boundaries,

which is a fundamental problem in urban economics. Recently, many studies propose to

use nighttime light intensity collected from satellite imagery to define the metropolitan area.

They set an ad hoc level of light intensity as a threshold and categorize a pixel in the satellite

imagery as a part of the metropolitan area if the light intensity of that pixel is higher than

the threshold. See, for example, Rozenfeld, Rybski, Gabaix, and Makse (2011), Henderson,

Storeygard, and Weil (2012), Dingel, Miscio, and Davis (2019), and Vogel, Goldblatt, Hanson,

and Khandelwal (2019). In contrast, the model (1) can provide a data-driven guidance of

choosing the intensity threshold from the econometric perspective, if we let yi as the light

intensity in the ith pixel and (qi, si) as the location information of that pixel (more precisely,

the coordinate of a point on a rotated map as described in Section 4). In Section 6.2, we

estimate the metropolitan area of Dallas, Texas, especially its development from 1995 to

2010, and find substantially different results from the conventional approaches. To the best

of our knowledge, this is the first study to nonparametrically determine the metropolitan

area using a threshold model.

We develop a two-step estimation procedure of (1), where we estimate γ0 (·) by the localconstant least squares. Under the shrinking threshold asymptotics as in Bai (1997), Bai

and Perron (1998), and Hansen (2000), we show that the nonparametric estimator γ(·) isuniformly consistent and has a highly nonstandard limiting distribution. Based on such

distribution, we develop a pointwise specification test of γ0(s) for any given s, which enables

us to construct a confidence interval by inverting the test. Besides, the parametric part

(β>, δ>

)> is shown to satisfy the root-n asymptotic normality.

We highlight some novel technical features of the new estimator as follows. First, since

the nonparametric function γ0 (·) is inside the indicator function, technical proofs of theasymptotic results are non-standard. In particular, we establish the uniform rate of conver-

gence of γ (·), which involves substantially more complicated derivations than the standard(constant) threshold regression model. Second, we find that, unlike the standard kernel es-

timator, γ(·) is asymptotically unbiased even if the optimal bandwidth is used. Also, whenthe change size δ0 shrinks very slowly, the optimal rate of convergence of γ(·) becomes closeto the root-n rate. In the standard kernel regression, such a fast rate of convergence can

be obtained when the unknown function is infinitely differentiable, while we only require

the second-order differentiability of γ0 (·). Third, to limit the effect of estimating γ0 (·) to(β>, δ>

)>, we propose to use the observations that are suffi ciently away from the estimated

threshold in the second-step parametric estimation. The choice of this distance is obtained

by the uniform convergence rate of γ (·). Fourth, we let the variables be cross-sectionally

2

dependent by considering the strong-mixing random field as in Conley (1999) and Conley

and Molinari (2007). This generalization allows us to study nonparametric sample splitting

of spatial observations. For instance, if we let (qi, si) correspond to the geographical loca-

tion (i.e., latitude and longitude on the map), then the threshold 1 [qi ≤ γ0 (si)] identifies

the unknown border yielding a two-dimensional sample splitting. In more general contexts,

the model can be applied to identify social or economic segregation over interacting agents.

Finally, noting that 1 [qi ≤ γ0 (si)] can be considered as the special case of 1 [g0 (qi, si) ≤ 0]

when g0 is monotonically increasing in qi, we discuss how to extend the proposed method to

such a more general case that leads to a threshold contour model.

The rest of the paper is organized as follows. Section 2 sets up the model, establishes the

identification, and defines the estimator. Section 3 derives the asymptotic properties of the

estimators and develops a likelihood ratio test of the threshold function. Section 4 describes

how to extend the main model to estimate a threshold contour. Section 5 studies small

sample properties of the proposed statistics by Monte Carlo simulations. Section 6 applies

the new method to estimate the tipping point function and to determine metropolitan areas.

Section 7 concludes this paper with some remarks. The main proofs are in the Appendix,

and all the omitted proofs are collected in the supplementary material.

We use the following notations. Let→p denote convergence in probability,→d convergence

in distribution, and ⇒ weak convergence of the underlying probability measure as n → ∞.Let brc denote the biggest integer smaller than or equal to r, 1[E] the indicator function of

a generic event E, and ‖A‖ the Euclidean norm of a vector or matrix A. For any set B, let

|B| as the cardinality of B.

2 Model Setup

We assume spatial processes located on an evenly spaced lattice Λ ⊂ R2, following Conley

(1999), Conley and Molinari (2007), and Carbon, Francq, and Tran (2007).1 We consider the

threshold regression model given by (1), which is

yi = x>i β0 + x>i δ01 [qi ≤ γ0 (si)] + ui,

where the observations (yi, x>i , qi, si)> ∈ R1+dim(x)+1+1; i ∈ Λn are a triangular array ofreal random variables defined on some probability space with Λn being a fixed sequence

of finite subsets of Λ. In this setup, the cardinality of Λn, n = |Λn|, is the sample size and∑i∈Λn

denotes the summation of all observations. For readability, we postpone the regularity

1 It can be extended to an unevenly spaced lattice as in Bolthausen (1982) and Jenish and Prucha (2009)with substantially more complicated notations. (cf. footnote 9 in Conley (1999)).

3

conditions on Λn in Assumption A later. The threshold function γ0 : R→ R as well as theregression coeffi cients θ0 = (β>0 , δ

>0 )> ∈ R2 dim(x) are unknown, and they are the parameters

of interest.2 Since we consider a shrinking threshold effect, the parameter δ0 is to depend

on the sample size n as in Assumption A-(ii) below; hence δ0 and θ0 should be written as

δn0 and θn0, respectively. However, we write δ0 and θ0 for simplicity. We let Q ⊂ R and

S ⊂ R denote the supports of qi and si, respectively. Suppose the space of γ0 (s) for any s is

a compact set Γ ⊂ R.First, we establish the identification, which requires the following conditions.

Assumption ID

(i) E [ui|xi, qi, si] = 0 almost surely.

(ii) E[xix>i

]> E

[xix>i 1 [qi ≤ γ]

]> 0 for any γ ∈ Γ.

(iii) For any s ∈ S, there exists ε(s) > 0 such that ε(s) < P (qi ≤ γ0(si)|si = s) < 1− ε(s)and δ>0 E

[xix>i |qi = q, si = s

]δ0 > 0 for all (q, s) ∈ Q× S.

(iv) qi is continuously distributed with its conditional density f(q|s) satisfying that 0 < C1 <

f(q|s) < C2 <∞ for all (q, s) ∈ Γ× S and some constants C1 and C2.

Assumption ID is mild. The condition (i) excludes endogeneity, and (ii) is the full rank

condition to identify the global parameters β0 and δ0. The conditions (ii) and (iii) require

that the location of the threshold is not on the boundary of the support of qi for any s ∈S, which is inevitable for identification and has been commonly assumed in the existingthreshold literature (e.g., Hansen (2000)). If γ0(s) reaches the boundary of qi for some s ∈ S,then no observation can be generated from one side of the threshold function at this s, and

identification is failed. The second condition in (iii) assumes the coeffi cient change exists (i.e.,

δ0 6= 0). Note that it does not require E[xix>i |qi = q, si = s

]to be of full rank, and hence

qi or si can be one of the elements of xi (e.g., the threshold autoregressive model by Tong

(1983)) or a linear combination of xi. The condition (iv) requires the conditional density of

qi given any si is positive and bounded in Γ.

Under Assumption ID, the following theorem establishes the identification of the semi-

parametric threshold regression model (1).

Theorem 1 Under Assumption ID, the parameters (β>0 , δ>0 )> are the unique minimizer of

E[(yi − x>i β − x>i δ1 [qi ≤ γ])2] for any γ ∈ Γ, and the threshold function γ0 (s) is the unique

minimizer of E[(yi − x>i β0 − x>i δ01 [qi ≤ γ(si)])2|si = s] for each given s ∈ S.

2The main results of this paper can be extended to consider multi-dimensional si using multivariate kernels.However, we only consider the scalar case for the expositional simplicity. Furthermore, the results are readilygeneralized to the case where only a subset of parameters differ between regimes.

4

Given identification, we proceed to estimate this semiparametric model in two steps.

First, for given s ∈ S, we fix γ0(s) = γ and obtain β (γ; s) and δ (γ; s) by the local constant

least squares conditional on γ:

(β (γ; s)ᵀ , δ (γ; s)ᵀ)ᵀ = arg minβ,δ

Qn (β, δ, γ; s) , (2)

where

Qn (β, δ, γ; s) =∑i∈Λn

K

(si − sbn

)(yi − x>i β − x>i δ1 [qi ≤ γ]

)2(3)

for some kernel function K (·) and a bandwidth parameter bn. Then γ0(s) is estimated by

γ (s) = arg minγ∈Γn

Qn (γ; s) (4)

for given s, where Γn = Γ ∩ q1, . . . , qn and Qn (γ; s) is the concentrated sum of squares

defined as

Qn (γ; s) = Qn

(β (γ; s) , δ (γ; s) , γ; s

). (5)

To avoid any additional technical complexity, we focus on estimation of γ0(s) at s ∈ S0 ⊂ Sfor some compact interior subset S0 of the support, say the middle 70% quantiles. Note

that, given s, the nonparametric estimator γ (s) can be seen as a local version of the stan-

dard (constant) threshold regression estimator. Therefore, the computation of (4) requires

one-dimensional grid search of the threshold for only n times as in the standard threshold

regression estimation.

In the second step, we estimate the parametric components β0 and δ0. To minimize any

potential effects from the first step estimation, we estimate β0 and δ∗0 = β0 + δ0 using the

observations that are suffi ciently away from the estimated threshold. This is implemented by

considering

β = arg minβ

∑i∈Λn

(yi − x>i β

)21 [qi > γ (si) + πn]1[si ∈ S0], (6)

δ∗

= arg minδ∗

∑i∈Λn

(yi − x>i δ∗

)21 [qi < γ (si)− πn]1[si ∈ S0] (7)

for some constant πn > 0 satisfying πn → 0 as n → ∞, which is defined later. The changesize δ can be estimated as δ = δ

∗ − β.For the asymptotic behavior of the threshold estimator, the existing literature typically

assumes martingale difference arrays (e.g., Hansen (2000) and Lee, Liao, Seo, and Shin (2020))

or random samples (e.g., Yu (2012) and Yu and Fan (2020)). In this paper, we allow for cross-

5

sectional dependence by considering spatial α-mixing processes as in Bolthausen (1982) and

Conley (1999). More precisely, for any indices (or locations) i, j ∈ Λ, we define the metric

λ (i, j) = max1≤`≤dim(Λ) |i` − j`| and the corresponding norm max1≤`≤dim(Λ) |i`|, where i`denotes the `th component of i. The distance of any two subsets Λ1,Λ2 ⊂ Λ is defined as

λ(Λ1,Λ2) = infλ(i, j) : i ∈ Λ1, j ∈ Λ2. We let FΛ be the σ-algebra generated by a random

sequence(x>i , qi, si, ui

)>for i ∈ Λ and define the spatial α-mixing coeffi cient as

αk,l(m) = sup |P (A ∩B)− P (A)P (B)| : A ∈ FΛ1 , B ∈ FΛ2 , λ (Λ1,Λ2) ≥ m , (8)

where |Λ1| ≤ k and |Λ2| ≤ l. Without loss of generality, we assume αk,l(0) = 1 and αk,l(m)

is monotonically decreasing in m for all k and l.

The following conditions are imposed for deriving the asymptotic properties of our two-

step estimator. Let f (q, s) be the joint density function of (qi, si) and

D (q, s) = E[xix>i | (qi, si) = (q, s)], (9)

V (q, s) = E[xix>i u

2i | (qi, si) = (q, s)]. (10)

Assumption A

(i) The lattice Λn ⊂ R2 is infinitely countable; all the elements in Λn are located at dis-

tances at least λ0 > 1 from each other (i.e., for any i, j ∈ Λn, λ (i, j) ≥ λ0); and

limn→∞ |∂Λn| /n = 0, where ∂Λn = i ∈ Λn : ∃j 6∈ Λn with λ(i, j) = 1.

(ii) δ0 = c0n−ε for some c0 6= 0 and ε ∈ (0, 1/2);

(c>0 , β

>0

)>belongs to some compact subset

of R2 dim(x).

(iii)(x>i , qi, si, ui

)>is strictly stationary and α-mixing with the mixing coeffi cient αk,l(m)

defined in (8), which satisfies that for all k and l, αk,l(m) ≤ C1 exp(−C2m) for some

positive constants C1 and C2.

(iv) 0 < E[u2i |xi, qi, si

]<∞ almost surely.

(v) Uniformly in (q, s), there exist some finite constants ϕ > 0 and C > 0 such that

E[||xix>i ||4+ϕ|(qi, si) = (q, s)] < C and E[||xiui||4+ϕ|(qi, si) = (q, s)] < C.

(vi) γ0 : S 7→ Γ is a twice continuously differentiable function with bounded derivatives.

(vii) D (q, s), V (q, s), and f (q, s) are uniformly bounded in (q, s), continuous in q, and

twice continuously differentiable in s with bounded derivatives. For any i, j ∈ Λn, the

joint density of (qi, qj , si, sj)ᵀ is uniformly bounded above by some constant C < ∞

and continuously differentiable in all components.

6

(viii) c>0 D (γ0(s), s) c0 > 0, c>0 V (γ0(s), s) c0 > 0, and f (γ0(s), s) > 0 for all s ∈ S.

(ix) As n → ∞, bn → 0, n1−2εbn/ log n → ∞, log n/(nb2n) → 0, and nb(2+2ϕ)/(2+ϕ)n → ∞

for some ϕ > 0 given in (v).

(x) K (·) is a positive second-order kernel, which is Lipschitz, symmetric around zero, andnonincreasing on R+ and satisfies

∫K (v) dv = 1,

∫v2K (v) dv <∞.

We provide some discussions about Assumption A. First, we assume that qi and si are

continuous random variables as in the example in Section 6.1. This setup also covers the

two-dimensional “spatial structural break” model as a special case as in the example in

Section 6.2. For the latter case, we denote n1 and n2 as the numbers of rows (latitudes)

and columns (longitudes) in the grid of pixels, and normalize q and s in the way that q ∈1/n1, 2/n1, . . . , 1 and s ∈ 1/n2, 2/n2, . . . , 1. Under similar (and even weaker) regularityconditions as Assumption A, we can show that the asymptotic results in the following sections

extend to this case once we treat (qi, si)> as independently and uniformly distributed random

variables over [0, 1]2. Such similarity is also found in the standard structural break and the

threshold regression models (e.g., Proposition 5 in Bai and Perron (1998) and Theorem 1 in

Hansen (2000)).

Second, Assumption A is mild and common in the existing literature. In particular, the

condition (i) is the same as in Bolthausen (1982) to define the latent random field. Note

that λ0 can be any strictly positive value, and hence we can impose λ0 > 1 without loss of

generality. The condition (ii) adopts the widely used shrinking change size setup as in Bai

(1997), Bai and Perron (1998), and Hansen (2000) to obtain a simple limiting distribution. In

contrast, a constant change size (when ε = 0) leads to a complicated asymptotic distribution

of the threshold estimator, which depends on nuisance parameters (e.g., Chan (1993)). The

condition (iii) is required to establish the maximal inequality and uniform convergence in a

spatially dependent random field. We impose a stronger condition than Jenish and Prucha

(2009) to obtain the maximal inequality uniformly over γ and s. We could weaken this

condition such that αk,l(m) decays at a polynomial rate (e.g., αk,l(m) ≤ Cm−r for some

r > 8 and a constant C as in Carbon, Francq, and Tran (2007)) if we impose higher moment

restrictions in the condition (v). However, this exponential decay rate simplifies the technical

proofs. The conditions (iv) to (viii) are similar to Assumption 1 of Hansen (2000). The

condition (ix) imposes restrictions on the bandwidth bn, which now depends on ε and ϕ. The

condition (x) holds for many commonly used kernel functions including the Gaussian and the

uniform kernels.

Third, we assume γ0 (·) to be a function from S to Γ in Assumption A-(vi), which is

not necessarily one-to-one. For this reason, sample splitting based on 1 [qi ≤ γ0 (si)] can be

different from that based on 1 [si ≥ γ0 (qi)] for some function γ0 (·). Instead of restricting

7

γ0 (·) to be one-to-one in this paper, we presume that one knows which variables should berespectively assigned as qi and si from the context. Alternatively, we can consider a function

g0 (q, s) such that g0 is monotonically increasing in q for any s. Then, 1 [qi ≤ γ0 (si)] is

viewed as a special case of 1 [g0 (qi, si) ≤ 0] by inverting g0 (·, s) = q∗, where g0 (q∗, s) = 0.

We discuss such extension to identify a threshold contour in Section 4.

3 Asymptotic Results

We first obtain the asymptotic properties of γ (s). The following theorem derives the pointwise

consistency and the pointwise rate of convergence of γ (s) at the interior points of S.

Theorem 2 For a given s ∈ S0, under Assumptions ID and A, γ (s) →p γ0 (s) as n → ∞.

Furthermore,

γ (s)− γ0 (s) = Op

(1

n1−2εbn

)(11)

provided that n1−2εb2n does not diverge.

The pointwise rate of convergence of γ (s) depends on two parameters, ε and bn. It is

decreasing in ε like the parametric (constant) threshold case: a larger ε reduces the threshold

effect δ0 = c0n−ε and hence decreases the effective sampling information on the threshold.

Since we estimate γ0(·) using the kernel estimation method, the rate of convergence dependson the bandwidth bn as well. As in the standard kernel estimator case, a smaller bandwidth

decreases the effective local sample size, which reduces the precision of the estimator γ (s).

Therefore, in order to have a suffi ciently fast rate of convergence, we need to choose bn large

enough when the threshold effect δ0 is expected to be small (i.e., when ε is close to 1/2).

Unlike the standard kernel estimator, there seems no bias-variance trade-off in γ (s) in

(11), implying that we could improve the rate of convergence by choosing a larger bandwidth

bn. However, as we can find in Theorem 3 below, bn cannot be chosen too large to result

in n1−2εb2n → ∞, under which n1−2εbn(γ (s) − γ0 (s)) is no longer Op(1). Therefore, we can

obtain the optimal bandwidth using the restriction that n1−2εb2n does not diverge.

Under this restriction, we find the optimal bandwidth as b∗n = n−(1−2ε)/2c∗ for some

constant 0 < c∗ < ∞, which yields the optimal pointwise rate of convergence of γ (s) as

n−(1−2ε)/2. However, such a bandwidth choice is not feasible because of the unknown con-

stant c∗ and the nuisance parameter ε that are not estimable. In practice, we suggest cross

validation as we implement in Section 6, although its statistical properties need to be studied

further. Note that, when the change size δ0 shrinks very slowly with n (i.e., ε is close to 0),

the optimal rate of convergence of γ(·) is close to n−1/2. This√n-rate is obtained in the

8

standard kernel regression if the unknown function is infinitely differentiable, while we only

require the second-order differentiability of γ0 (·).The next theorem derives the limiting distribution of γ (s). We let W (·) be a two-sided

Brownian motion defined as in Hansen (2000):

W (r) = W1(−r)1 [r < 0] +W2(r)1 [r > 0] , (12)

where W1(·) and W2(·) are independent standard Brownian motions on [0,∞).

Theorem 3 Under Assumptions ID and A, for a given s ∈ S0, if n1−2εb2n → % ∈ (0,∞),

n1−2εbn (γ (s)− γ0 (s))→d ξ (s) arg maxr∈R

(W (r) + µ (r, %; s)) (13)

as n→∞, where

µ (r, %; s) = − |r|ψ0 (r, %; s) +%|γ0(s)|ξ(s)

ψ1 (r, %; s) ,

ψj (r, %; s) =

∫ |r|ξ(s)/(%|γ0(s)|)

0tjK (t) dt for j = 0, 1,

ξ (s) =κ2c>0 V (γ0 (s) , s) c0(

c>0 D (γ0 (s) , s) c0

)2f (γ0 (s) , s)

with κ2 =∫K(v)2dv and γ0 (s) is the first derivative of γ0 at s. Furthermore,

E[arg max

r∈R(W (r) + µ (r, %; s))

]= 0.

The drift term µ (r, %; s) in (13) depends on the constant 0 < % <∞, which is the limit ofn1−2εb2n = (n1−2εbn)bn, and |γ0(s)|, the steepness of γ0(·) at s. Interestingly, it resembles thetypical O(bn) boundary bias of the standard local constant estimator. However, this non-zero

drift term is not because of the typical boundary effect but because of the inequality restriction

inside the indicator function, 1 [qi ≤ γ0 (si)], which characterizes the sample splitting.

It is important to note that having this non-zero drift term in the limiting expression

does not mean that the limiting distribution of γ (s) has a non-zero mean, even when we

use the optimal bandwidth b∗n = O(n−(1−2ε)/2) satisfying n1−2εb∗2n → % ∈ (0,∞). This is

mainly because the drift function µ (r, %; s) is symmetric about zero and hence the limiting

random variable arg maxr∈R (W (r) + µ (r, %; s)) is mean zero. In general, we can show that

the random variable arg maxr∈R (W (r) + µ (r, %; s)) always has mean zero if µ (r, %; s) is a

non-random function that is symmetric about zero and monotonically decreasing fast enough.

9

Figure 1: Drift function µ (r, %; s) for different kernels (color online)

This result might be of independent research interest and is summarized in Lemma A.11

in the Appendix. Figure 1 depicts the drift function µ (r, %; s) for various kernels when

ξ(s)/ (%|γ0(s)|) = 1.

Since the limiting distribution in (13) depends on unknown components, like % and γ0(s),

it is hard to use this result for further inference. We instead suggest undersmoothing for

practical use. More precisely, if we suppose n1−2εb2n → 0 as n → ∞, then the limitingdistribution in (13) simplifies to3


(W (r)− |r|

2

)(14)

as n → ∞, which appears the same as in the parametric case in Hansen (2000) exceptfor the scaling factor n1−2εbn. The distribution of arg maxr∈R (W (r)− |r| /2) is known (e.g.,

Bhattacharya and Brockwell (1976) and Bai (1997)), which is also described in Hansen (2000,

p.581). The ξ (s) term determines the scale of the distribution at given s in the way that it

increases in the conditional variance E[u2i |xi, qi, si] and decreases in the size of the threshold

constant c0 and the density of (qi, si) near the threshold.

Even when n1−2εb2n → 0 as n → ∞, the asymptotic distribution in (14) still dependson the unknown parameter ε (or equivalently c0) in ξ (s) that is not estimable. Thus, this

result cannot be directly used for inference of γ0 (s). Alternatively, given any s ∈ S0, we can

3We let ψ0 (r, 0; s) =∫∞0K (t) dt = 1/2 and ψ1 (r, 0; s) =

∫∞0tK (t) dt <∞.

10

consider a pointwise likelihood ratio test statistic for

H0 : γ0 (s) = γ∗ (s) against H1 : γ0 (s) 6= γ∗ (s) , (15)

which is given as

LRn(s) =∑i∈Λn

K

(si − sbn

)Qn (γ∗ (s) , s)−Qn (γ (s) , s)

Qn (γ (s) , s). (16)

The following corollary obtains the limiting null distribution of this test statistic that is free

of nuisance parameters. By inverting the likelihood ratio statistic, we can form a pointwise

confidence interval for γ0 (s).

Corollary 1 Suppose n1−2εb2n → 0 as n→∞. Under the same condition in Theorem 3, for

any fixed s ∈ S0, the test statistic in (16) satisfies

LRn(s)→d ξLR (s) maxr∈R

(2W (r)− |r|) (17)

as n→∞ under the null hypothesis (15), where

ξLR (s) =κ2c>0 V (γ0 (s) , s) c0

σ2(s)c>0 D (γ0 (s) , s) c0

with σ2(s) = E[u2i |si = s

]and κ2 =

∫K(v)2dv.

When E[u2i |xi, qi, si] = E[u2

i |si], which is the case of local conditional homoskedasticity, thescale parameter ξLR (s) is simplified as κ2, and hence the limiting null distribution of LRn(s)

becomes free of nuisance parameters and the same for all s ∈ S0. Though this limiting dis-

tribution is still nonstandard, the critical values in this case can be simulated using the same

method as Hansen (2000, p.582) with the scale adjusted by κ2. More precisely, since the distri-

bution function of ζ = maxr∈R (2W (r)− |r|) is given as P(ζ ≤ z) = (1−exp(−z/2))21 [z ≥ 0],

the distribution function of ζ∗ = κ2ζ is P(ζ∗ ≤ z) = (1− exp(−z/2κ2))21 [z ≥ 0], where ζ∗ is

the limiting random variable of LRn(s) given in (17) under the local conditional homoskedas-

ticity. By inverting it, we can obtain the critical values given a choice of K(·). For instance,the critical values for the Gaussian kernel is reported in Table 1, where κ2 = (2

√π)−1 '

0.2821 in this case.

11

Table 1: Simulated Critical Values of the LR Test (Gaussian Kernel)

P(ζ∗ > cv) 0.800 0.850 0.900 0.925 0.950 0.975 0.990cv 1.268 1.439 1.675 1.842 2.074 2.469 2.988

Note: ζ∗ is the limiting distribution of LRn(s) under the local conditional homoskedasticity. The Gaussian

kernel is used.

In general, we can estimate ξLR (s) by

ξLR (s) =κ2δ>V (γ (s) , s) δ

σ2(s)δ>D (γ (s) , s) δ

,

where δ is from (6) and (7), and σ2(s), D (γ (s) , s), and V (γ (s) , s) are the standard

Nadaraya-Watson estimators. In particular, we let σ2(s) =∑

i∈Λnω1i(s)u

2i with ui =

yi − x>i β − x>i δ1 [qi ≤ γ (si)],

D (γ (s) , s) =∑i∈Λn

ω2i(s)xix>i , and V (γ (s) , s) =

∑i∈Λn

ω2i(s)xix>i u

2i ,

where

ω1i(s) =K ((si − s)/bn)∑

j∈ΛnK ((sj − s)/bn)

and ω2i(s) =K ((qi − γ (s))/b′n, (si − s)/b′′n)∑

j∈ΛnK ((qj − γ (s))/b′n, (sj − s)/b′′n)

for some bivariate kernel function K(·, ·) and bandwidth parameters (b′n, b′′n).

Finally, we show the√n-consistency of the semiparametric estimators β and δ

∗in (6)

and (7). For this purpose, we first obtain the uniform rate of convergence of γ (s).

Theorem 4 Under Assumptions ID and A,

sups∈S0|γ (s)− γ0 (s)| = Op

(log n

n1−2εbn

)

provided that n1−2εb2n does not diverge.

Apparently, the uniform consistency of γ (s) follows when log n/(n1−2εbn) → 0 as n → ∞.Based on this uniform convergence, the following theorem derives the joint limiting distrib-

ution of β and δ∗. We let θ

∗= (β

>, δ∗>

)> and θ∗0 = (β>0 , δ∗>0 )>.

12

Theorem 5 Suppose the conditions in Theorem 4 hold. If we let πn > 0 such that πn → 0

and log n/(n1−2εbn)/πn → 0 as n→∞, we have

√n(θ∗ − θ∗0

)→d N

(0,Σ∗−1

X Ω∗Σ∗−1X

)(18)

as n→∞, where

Σ∗X =

E [xix>i 1+i

]0

0 E[xix>i 1−i

] and Ω∗ = lim

n→∞1

nV ar

∑i∈Λnxiui1

+i∑

i∈Λnxiui1

−i

with 1+

i = 1[qi > γ0(si)]1[si ∈ S0] and 1−i = 1[qi < γ0(si)]1[si ∈ S0].

For the second-step estimator θ∗, we use (6) and (7), instead of the conventional plug-in

estimation, say arg minβ,δ∑

i∈Λn(yi−x>i β−x>i δ1[qi ≤ γ (si)])

21[si ∈ S0]. The reason is that

the first-step nonparametric estimator γ(·) may not be asymptotically orthogonal to the sec-ond step. Unlike the standard semiparametric literature (e.g., Assumption N(c) in Andrews

(1994)), the asymptotic effect of γ (s) to the second-step estimation is not easily derived due

to the discontinuity. The new estimation idea above, however, only uses the observations that

are little affected by the estimation error in the first step to achieve asymptotic orthogonality.

As we verify in Lemma A.17 in the Appendix, this is done by choosing a large enough πn in

(6) and (7) such that the observations that are included in the second step are outside the

uniform convergence bound of |γ (s)− γ0 (s)|. Thanks to the threshold regression structure,we can estimate the parameters on each side of the threshold even using these subsamples.

Meanwhile, we also want πn → 0 fast enough to include more observations. By doing so,

though we lose some effi ciency in finite samples, we can derive the asymptotic normality of

θ = (β>, δ>

)> that has zero mean and achieves the same asymptotic variance as if γ0(·) wasknown.

By the delta method, Theorem 5 readily yields the limiting distribution of θ = (β>, δ>

)>

as√n(θ − θ0

)→d N

(0,Σ−1

X ΩΣ−1X

)as n→∞, (19)

where

ΣX = E[ziz>i 1 [si ∈ S0]

]and Ω = lim

n→∞1

nV ar

[∑i∈Λn

ziui1 [si ∈ S0]

]

with zi = [x>i , x>i 1 [qi ≤ γ0 (si)]]

>. The asymptotic variance expressions in (18) and (19)

allow for cross-sectional dependence as they have the long-run variance (LRV) forms Ω∗ and

Ω. They can be consistently estimated by the robust estimator developed by Conley and

13

Molinari (2007) using ui = (yi − x>i β − x>i δ1[qi ≤ γ (si)])1[si ∈ S0]. The terms Σ∗X and ΣX

can be estimated by their sample analogues.

4 Threshold Contour

When we consider sample splitting over a two-dimensional space (i.e., qi and si respectively

correspond to the latitude and longitude on the map), the threshold model (1) can be gener-

alized to estimate a nonparametric contour threshold model:

yi = x>i β0 + x>i δ01 [g0 (qi, si) ≤ 0] + ui, (20)

where the unknown function g0 : Q× S 7→ R determines the threshold contour on a randomfield that yields sample splitting. An interesting example includes identifying an unknown

closed boundary over the map, such as a city boundary, and an area of a disease outbreak or

airborne pollution. In social science, it can identify a group boundary or a region in which

the agents share common demographic, political, or economic characteristics.

To relate this generalized form to the original threshold model (1), we suppose there

exists a known center at (q∗i , s∗i ) such that g0 (q∗i , s

∗i ) < 0. Without loss of generality, we can

normalize (q∗i , s∗i ) to be (0, 0) and re-center the original location variables (qi, si) accordingly.

In addition, we define the radius distance li and angle ai of the ith observation relative to

the origin as

li = (q2i + s2

i )1/2,

ai = ai Ii + (180 − ai ) IIi + (180 + ai ) IIIi + (360 − ai ) IVi,

where ai = arctan (|qi/si|), and each of (Ii, IIi, IIIi, IVi) respectively denotes the indicator

that the ith observation locates in the first, second, third, and forth quadrant.

We suppose that there is only one threshold at any angle and the threshold contour

is star-shaped. For each chosen angle a ∈ [0, 360), we rotate the original coordinate

counterclockwise and implement the least squares estimation (5) only using the observations

in the first two quadrants after rotation. It will ensure that the threshold mapping after

rotation is a well-defined function.

In particular, the angle relative to the origin is ai −a after rotating the coordinate by a

degrees counterclockwise, and the new location (after the rotation) is given as (qi (a) , si (a)),

where (qi (a)

si (a)

)=

(qi cos (a)− si sin (a)

si cos (a) + qi sin (a)

).

14

Figure 2: Illustration of rotation (color online)

After this rotation, we estimate the following nonparametric threshold model:

yi = x>i β0 + x>i δ01 [qi (a) ≤ γa (si (a))] + ui (21)

using only the observations i satisfying qi (a) ≥ 0 and in the neighborhood of si (a) = 0,

where γa (·) is the unknown threshold curve as in the original model (1) on the a-degree-rotated coordinate plane. Such reparametrization guarantees that γa (·) is always positiveand it is estimated at the origin. Figure 2 illustrates the idea of such rotation and pointwise

estimation over a bounded support so that only the red cross points are included for estimation

at different angles. Thus, the estimation and inference procedures developed in the previous

sections are directly applicable, though we expect some effi ciency loss as we only use the

subsample with qi (a) ≥ 0 at each a.

5 Monte Carlo Experiments

We examine the small sample performance of the semiparametric threshold regression esti-

mator by Monte Carlo simulations. We generate n draws from

yi = x>i β0 + x>i δ01 [qi ≤ γ0 (si)] + ui, (22)

where xi = (1, x2i)> and x2i ∈ R. We let β0 = (β10, β20)> = 0ι2 and consider three different

values of δ0 = (δ10, δ20)> = δι2 with δ = 1, 2, 3, 4, where ι2 = (1, 1)>. For the threshold

function, we let γ0 (s) = sin(s)/2. We consider the cross-sectional dependence structure in

15

Table 2: Rej. Prob. of the LR Test with i.i.d. Data

s = 0.0 s = 0.5 s = 1.0n δ = 1 2 3 4 1 2 3 4 1 2 3 4100 0.16 0.09 0.06 0.08 0.19 0.10 0.09 0.08 0.25 0.18 0.16 0.12200 0.09 0.06 0.06 0.07 0.12 0.06 0.04 0.06 0.18 0.09 0.08 0.06500 0.08 0.05 0.05 0.06 0.08 0.04 0.04 0.05 0.09 0.04 0.04 0.03

Note: Entries are rejection probabilities of the LR test (16) when data are generated from (22) with

γ0 (s) = sin(s)/2. The dependence structure is given in (23) with ρ = 0. The significance level is 5% and the

results are based on 1000 simulations.

Table 3: Rej. Prob. of the LR Test with Cross-sectionally Correlated Data

s = 0.0 s = 0.5 s = 1.0n δ = 1 2 3 4 1 2 3 4 1 2 3 4100 0.18 0.09 0.07 0.08 0.21 0.11 0.10 0.06 0.28 0.20 0.17 0.13200 0.12 0.06 0.06 0.06 0.13 0.08 0.06 0.05 0.20 0.37 0.09 0.06500 0.08 0.04 0.06 0.06 0.06 0.06 0.04 0.05 0.13 0.08 0.05 0.02

Note: Entries are rejection probabilities of the LR test (16) when data are generated from (22) with

γ0 (s) = sin(s)/2. The dependence structure is given in (23) with ρ = 1 and m = 10. The significance level is

5% and the results are based on 1000 simulations.

(x2i, qi, si, ui)> as follows:

(qi, si)

> ∼ iidN (0, I2) ;

x2i| (qi, si) ∼ iidN(0, (1 + ρ

(s2i + q2

i

))−1);

u|(xi, qi, si)ni=1 ∼ N (0,Σ) ,

(23)

where u = (u1, . . . , un)>. The (i, j)th element of Σ is Σij = ρbìjnc1[ìj < m/n], where

ìj = (si − sj)2 + (qi − qj)21/2 is the L2-distance between the ith and jth observations.

The diagonal elements of Σ are normalized as Σii = 1. This m-dependent setup follows from

the Monte Carlo experiment in Conley and Molinari (2007) in the sense that each unit can

be cross-sectionally correlated with at most 2m2 observations. Within the m distance, the

dependence decays at a polynomial rate as indicated by ρbìjnc. The parameter ρ describes the

strength of cross-sectional dependence in the way that a larger ρ leads to stronger dependence

relative to the unit standard deviation. In particular, we consider the cases with ρ = 0 (i.e.,

i.i.d. observations), 0.5, and 1. We consider the sample size n = 100, 200, and 500, and set

S0 to include the middle 70% observations of si.

First, Tables 2 and 3 report the small sample rejection probabilities of the LR test in

16

Table 4: Coverage Prob. of the Plug-in Confidence Interval

β20 β20 + δ20 δ20

n δ = 1 2 3 4 1 2 3 4 1 2 3 4100 0.85 0.87 0.90 0.89 0.82 0.89 0.88 0.88 0.83 0.88 0.89 0.90200 0.87 0.91 0.91 0.91 0.87 0.90 0.93 0.93 0.86 0.90 0.92 0.94500 0.89 0.92 0.95 0.94 0.87 0.93 0.93 0.94 0.85 0.92 0.95 0.92

Note: Entries are coverage probabilities of 95% confidence intervals for β20, β20 + δ20, and δ20. Data are

generated from (22) with γ0 (s) = sin(s)/2, where the dependence structure is given in (23) with ρ = 0.5 and

m = 3. The results are based on 1000 simulations.

Table 5: Coverage Prob. of the Plug-in Confidence Interval (w/ LRV adj.)

β20 β20 + δ20 δ20

n δ = 1 2 3 4 1 2 3 4 1 2 3 4100 0.94 0.94 0.94 0.94 0.91 0.95 0.95 0.95 0.92 0.94 0.96 0.96200 0.94 0.95 0.96 0.96 0.93 0.94 0.96 0.96 0.92 0.96 0.97 0.97500 0.94 0.95 0.98 0.97 0.92 0.96 0.97 0.96 0.91 0.97 0.97 0.96

Note: Entries are coverage probabilities of 95% confidence intervals for β20, β20+δ20, and δ20 with a small

sample adjustment of the LRV estimator. Data are generated from (22) with γ0 (s) = sin(s)/2, where the

dependence structure is given in (23) with ρ = 0.5 and m = 3. The results are based on 1000 simulations.

(16) for H0 : γ0(s) = sin(s)/2 against H1 : γ0(s) 6= sin(s)/2 at the 5% nominal level at

three different locations s = 0, 0.5, and 1. In particular, Table 2 examines the case with

no cross-sectional dependence (ρ = 0), while Table 3 examines the case with cross-sectional

dependence whose dependence decays slowly with ρ = 1 and m = 10. For the bandwidth

parameter, we normalize si and qi to have zero mean and unit standard deviation, and

choose bn = 0.5n−1/2 in the main regression. This choice is for undersmoothing so that

n1−2εb2n = n−2ε → 0. To estimate D (γ0 (s) , s) and V (γ0 (s) , s), we use the rule-of-thumb

bandwidths from the standard kernel regression satisfying b′n = O(n−1/5) and b′′n = O(n−1/6).

All the results are based on 1000 simulations. In general, the test for γ0 performs better

as (i) the sample size gets larger; (ii) the coeffi cient change gets more significant; (iii) the

cross-sectional dependence gets weaker; and (iv) the target gets closer to the mid-support of

s. When δ0 and n are large, the LR test is conservative, which is also found in the classical

threshold regression (e.g., Hansen (2000)).

Second, Table 4 shows the finite sample coverage properties of the 95% confidence intervals

for the parametric components β20, δ∗20 = β20 + δ20, and δ20. The results are based on the

same simulation design as above with ρ = 0.5 and m = 3. Regarding the tuning parameters,

we use the same bandwidth choice bn = 0.5n−1/2 as before and set the truncation parameter

17

πn = (nbn)−1/2. Unreported results suggest that choice of the constant in the bandwidth

matters particularly with small samples like n = 100, but such effect quickly decays as the

sample size gets larger. For the estimator of the LRV, we use the spatial lag order of 5

following Conley and Molinari (2007). Results with other lag choices are similar and hence

omitted. The result suggests that the asymptotic normality is better approximated with

larger samples and larger change sizes. Table 5 shows the same results with a small sample

adjustment of the LRV estimator for Ω∗ by dividing it by the sample truncation fraction∑i∈Λn

(1[qi > γ(si) + πn] + 1[qi < γ(si) − πn])1[si ∈ S0]/∑

i∈Λn1[si ∈ S0]. This ratio

enlarges the LRV estimator and hence the coverage probabilities, especially when the change

size is small. It only affects the finite sample performance as it approaches one in probability

as n→∞.

6 Applications

6.1 Tipping point and social segregation

The first application is about the tipping point problem in social segregation, which stimulates

a vast literature in labor, public, and political economics. Schelling (1971) initially proposes

the tipping point model to study the fact that the white population decreases substantially

once the minority share exceeds a certain tipping point. Card, Mas, and Rothstein (2008)

empirically estimate this model and find strong evidence for such a tipping point phenomenon.

In particular, they specify the threshold regression model as

yi = β10 + δ101 [qi ≤ γ0] + x>2iβ20 + ui,

where for tract i in a certain city, qi is the minority share in percentage at the beginning

of a certain decade, yi is the normalized white population change in percentage within this

decade, and x2i is a vector of control variables. They apply the least squares method to

estimate the tipping point γ0. For most cities and for the periods 1970-80, 1980-90, and

1990-2000, they find that white population flows exhibit the tipping-like behavior, with the

estimated tipping points ranging approximately from 5% to 20% across cities.

In Section VII of Card, Mas, and Rothstein (2008), they also find that the location of the

tipping point substantially depends on white residents’attitudes toward the minority. Specif-

ically, they first construct a city-level index that measures white attitudes and regress the

estimated tipping point from each city on this index. The regression coeffi cient is significantly

different from zero, suggesting that the tipping point is heterogeneous across cities.

18

We go one step further by considering a more flexible model in the tract level given as

yi = β10 + δ101 [qi ≤ γ0(si)] + x>2iβ20 + ui,

where γ0(·) denotes an unknown tipping point function and si denotes the attitude index.The nonparametric function γ0(·) here allows for heterogeneous tipping points across tractsdepending on the level of the attitude index si in tract i. Unfortunately, the attitude index by

Card, Mas, and Rothstein (2008) is only available at the aggregated city-level, and hence we

cannot use it to analyze the census tract-level observations. For this reason, we instead use

the tract-level unemployment rate as si to illustrate the nonparametric threshold function,

which is readily available in the original dataset. Such a compromise is far from being perfect

but can be partially justified since race discrimination has been widely documented to be

correlated with employment (e.g., Darity and Mason (1998)).

We use the data provided by Card, Mas, and Rothstein (2008) and estimate the tipping

point function γ0(·) over census tracts by the method introduced in Section 2. As in theirwork, we drop the tracts where the minority shares are above 60 percentage points and use

five control variables as x2i, including the logarithm of mean family income, the fractions of

single-unit, vacant, and renter-occupied housing units, and the fraction of workers who use

public transport to travel to work. The bandwidth is set as bn = cn−1/2 for some c > 0,

so that it satisfies the technical conditions in the previous sections, where the constant c is

chosen by the leave-one-out cross validation. In particular, we first construct the leave-one-

out estimate, γ−i (si), of γ0 (si) as in (4) without using the ith observation. Then, leaving the

ith observation out, we construct β−i and δ−i as in (6) and (7) with πn = (nbn)−1/2 using

the bandwidth bn chosen in the previous step. We choose the bandwidth that minimizes∑i∈Λn

(yi − β1,−i − δ1,−i1[qi ≤ γ−i(si)

]− x>2iβ2,−i)

21 [si ∈ S0], where S0 again includes the

middle 70% quantiles of si.

Figure 3 depicts the estimated tipping points and the 95% pointwise confidence intervals

by inverting the likelihood ratio test statistic (16) in the years 1980-90 in Chicago, Los

Angeles, and New York City, whose sample sizes are relatively large. For each city, the

constant c of the bandwidth bn = cn−1/2 chosen by the aforementioned cross validation is 3.20,

4.87, and 3.42, respectively. We make the following comments. First, the estimates of the

tipping points vary substantially in the unemployment rate within all three cities. Therefore,

the standard constant tipping point model is insuffi cient to characterize the segregation fully.

Second, the tipping points as functions of the unemployment rate do not exhibit the same

pattern across cities, reinforcing the heterogeneous tipping points in the city-level as found

in Card, Mas, and Rothstein (2008). Finally, the estimated tipping point γ (s) as a function

of s can be discontinuous, which does not contrast with Assumption A-(vi), that is, the true

function γ0 (·) is smooth. The discontinuity comes from the fact that γ (s) is obtained by

19

Figure 3: Estimate of the tipping point as a function of the unemployment rate

Panel A: Estimated tipping point function in Chicago 1980-90

Panel B: Estimated tipping point function in Los Angeles 1980-90

Panel C: Estimated tipping point function in New York City 1980-90

Note: The figure depicts the point estimates (solid) and the 95% pointwise confidence intervals (dash) of the

tipping points as a function of the unemployment rate, using the data in Chicago, Los Angeles, and New York

City in 1980-1990. The vertical axis is the estimated tipping point in percentage, and the horizontal axis is

the tract-level unemployment normalized to quantile level. Data are available from Card, Mas, and Rothstein

(2008).

20

Figure 4: Nighttime light intensity in Dallas, Texas, in 2010

Note: The figure depicts the intensity of the stable nighttime light in Dallas, TX 2010. Data are available

from https://www.ncei.noaa.gov/.

grid search and can only take values among the discrete points q1, ..., qn in finite samples.

6.2 Metropolitan area determination

The second application is about determining the boundary of a metropolitan area, which is

a fundamental question in urban economics. Recently, researchers propose to use nighttime

light intensity obtained by satellite imagery to define metropolitan areas. The intuition is

straightforward: metropolitan areas are bright at night while rural areas are dark.

Specifically, the National Oceanic and Atmospheric Administration (NOAA) collects

satellite imagery of nighttime lights at approximately 1-kilometer resolution since 1992.

NOAA further constructs several indices measuring the annual light intensity. Following

the literature (e.g., Dingel, Miscio, and Davis (2019)), we choose the “average visible, stable

lights” index that ranges from 0 (dark) to 63 (bright). For illustration, we focus on Dallas,

Texas and use the data from the years 1995, 2000, 2005, and 2010. In each year, the data are

recorded as a 240×360 grid that covers the latitudes from 32N to 34N and the longitudes

from 98.5W to 95.5W. The total sample size is 240×360=86400 each year. These data areavailable at NOAA’s website and also provided on the authors’website. Figure 4 depicts the

intensity of the stable nighttime light of the Dallas area in 2010 as an example.

Let yi be the level of nighttime light intensity and (qi, si) be the latitude and longitude

of the ith pixel, which is normalized into the equally-spaced grids on [0, 1]2. To define the

metropolitan area, existing literature in urban economics first chooses an ad hoc intensity

21

Figure 5: Kernel density estimate of nighttime light intensity, Dallas 2010

Note: The figure depicts the kernel density estimate of the strength of the stable nighttime light in Dallas,

TX 2010. Data are available from https://www.ncei.noaa.gov/.

threshold, say 95% quantile of yi, and categorizes the ith pixel as a part of the metropolitan

area if yi is larger than the threshold. See Dingel, Miscio, and Davis (2019), Vogel, Goldblatt,

Hanson, and Khandelwal (2019), and references therein. In particular, on p.3 in Dingel,

Miscio, and Davis (2019), they note that “[...] the choice of the light-intensity threshold,

which governs the definitions of the resulting metropolitan areas, is not pinned down by

economic theory or prior empirical research.”Our new approach can provide a data-driven

guidance of choosing the intensity threshold from the econometric perspective.

To this end, we first examine whether the light intensity data exhibits a clear threshold

pattern. We plot the kernel density estimates of yi in the year 2010 in Figure 5. The

bandwidth is the standard rule-of-thumb one. The estimated density exhibits three peaks at

around the intensity levels 0, 8, and 63. They respectively correspond to the rural area, small

towns, and the central metropolitan area. It shows that the threshold model is appropriate

in characterizing such a mean-shift pattern.

Now we implement the rotation and estimation method described in Section 4. In par-

ticular, we pick the center point in the bright middle area as the Dallas metropolitan center,

which corresponds to the pixel point in the 181st column from the left and the 100th row

from the bottom. Then for each a over the 500 equally-spaced grid on [0, 360], we rotate

the data by a degrees counterclockwise and estimate the model (21) with xi = 1. The

bandwidth is chosen as cn−1/2 with c = 1. Other choices of c lead to almost identical results,

given the large sample size. Figure 6 presents the estimated metropolitan areas using our

nonparametric approach (red) and the area determined by the ad hoc threshold of the 95%

22

Figure 6: Metropolitan area determination in Dallas (color online)

Note: The figure depicts the city boundary determined by either the new method or by taking the 0.95 quantile

of nighttime light strength as the threshold, using the satellite imagery data for Dallas, TX in the years 1995,

2000, 2005, and 2010. Data are available from https://www.ncei.noaa.gov/.

23

quantile of yi (black) in the years 1995, 2000, 2005, and 2010. It clearly shows the expansion

of the Dallas metropolitan area over the 15 years of the sample period.

Several interesting findings are summarized as follows. First, the estimated boundary

is highly nonlinear as a function of the angle. Therefore, any parametric threshold model

could lead to a substantially misleading result. Second, our estimated area is larger than that

determined by the ad hoc threshold, by 80.31%, 81.56%, 106.46%, and 102.09% in the years

1995, 2000, 2005, and 2010, respectively. In particular, our nonparametric estimates tend to

include some suburban areas that exhibit strong light intensity and that are geographically

close to the city center. For example, the very left stretch-out area in the estimated boundary

corresponds to Fort Worth, which is 30 miles from downtown Dallas. Residents can easily

commute by train or driving on the interstate highway 30. It is then reasonable to include Fort

Worth as a part of the metropolitan Dallas area for economic analysis. Third, given the large

sample size, the 95% confidence intervals of the boundary are too narrow to be distinguished

from the estimates and therefore omitted from the figure. Such narrow intervals apparently

exclude the boundary determined by the ad hoc method. Finally, the estimated value of

β0 + δ0 is approximately 53 in these sample periods, which corresponds to the 89% quantile

of yi in the sample. This suggests that a more proper choice of the level of light intensity

threshold is the 89% quantile of yi, instead of the 95% quantile, if one needs to choose the

light-intensity threshold to determine the Dallas metropolitan area.

7 Concluding Remarks

This paper proposes a novel approach to conduct sample splitting. In particular, we develop

a nonparametric threshold regression model where two variables can jointly determine the

unknown threshold boundary. Our approach can be easily generalized so that the sample

splitting depends on more numbers of variables, though such an extension is subject to the

curse of dimensionality, as usually observed in the kernel regression literature. The main

interest is in identifying the threshold function that determines how to split the sample.

Thus our model should be distinguished from the smoothed threshold regression model or

the random coeffi cient regression model. It instead could be seen as an unsupervised learning

tool for clustering.

This new approach is empirically relevant in broad areas studying sample splitting (e.g.,

segregation and group-formation) and heterogeneous effects over different subsamples. We

illustrate some of them with the tipping point problem in social segregation and metropol-

itan area determination using satellite imagery datasets. Though we omit in this paper,

we also estimate the economic border between Brooklyn and Queens boroughs in New York

24

City using housing prices.4 The estimated border is substantially different from the existing

administrative border, which was determined in 1931 and cannot reflect the dramatic city de-

velopment. Interestingly, the estimated border coincides with the Jackson Robinson Parkway

and the Long Island Railroad. This finding provides new evidence that local transportation

corridors could increase community segregation (cf. Ananat (2011) and Heilmann (2018)).

We list some related works, which could motivate potential theoretical extensions. First,

while we focus on the local constant estimation in this paper, one could consider the local

linear estimation using the threshold indicator 1 [qi ≤ γ1 + γ2(si − s)] in (3). Although gridsearch is very diffi cult in determining the two threshold parameters (γ1 and γ2), we could use

the MCMC algorithm developed by Yu and Fan (2020) and the mixed integer optimization

(MIO) algorithms developed by Lee, Liao, Seo, and Shin (2020). Besides the computational

challenge, however, the asymptotic derivation is more involved since we need to consider

higher-order expansions of the objective function. Second, while our nonparametric setup is

on the threshold function γ0(·), some recent literature studies the nonparametric regressionmodel with a parametric threshold, such as yi = m1(xi)+m2(xi)1[qi ≤ γ0]+ui, where m1 (·)and m2 (·) are different nonparametric functions. See, for example, Henderson, Parmeter,and Su (2017), Chiou, Chen, and Chen (2018), Yu and Phillips (2018), Yu, Liao, and Phillips

(2019), and Delgado and Hidalgo (2000).

4The result is available upon request.

25

A Appendix

Throughout the proof, we denote Ki (s) = K ((si − s)/bn) and 1i (γ) = 1 [qi ≤ γ]. We let Cand its variants such as C1 and C ′1 stand for generic positive finite constants that may varyacross lines. We also let an = n1−2εbn. All the additional lemmas in the proof assume theconditions in Assumptions ID and A hold. Omitted proofs for some lemmas are all collectedin the supplementary material.

A.1 Proof of Theorem 1 (Identification)

Proof of Theorem 1 First, for any γ(s) ∈ Γ with given s ∈ S, we define an L2-loss as

R(β, δ, γ; s) = E[(yi − x>i β − x>i δ1i (γ(si))

)2∣∣∣ si = s

]−E

[(yi − x>i β0 − x>i δ01i (γ0(si))

)2∣∣∣ si = s

].

For any γ ∈ Γ ≡ [γ, γ], we have E[R(β, δ, γ; si)] ≥ 0 from Assumption ID-(i). Since

R(β, δ, γ; s) =

E[(x>i ((β + δ)− (β0 + δ0)

)2∣∣∣ si = s

]for qi ≤ minγ(s), γ0(s);

E[(x>i (β − β0)

)2∣∣∣ si = s

]for qi > maxγ(s), γ0(s),

we have

E [R(β, δ, γ; si)] ≥ ‖(β − β0) + (δ − δ0)‖2 E[||xix>i ||1

[qi ≤ γ

]]+ ‖β − β0‖

2 E[||xix>i ||1 [qi > γ]

]> 0

when (β>, δ>)> 6= (β>0 , δ>0 )> from Assumption ID-(ii). Therefore, we have E [R(β, δ, γ; si)] =

0 only when (β>, δ>)> = (β>0 , δ>0 )>, which gives that (β>0 , δ

>0 )> are identified as the unique

minimizer of E[(yi − x>i β − x>i δ1i (γ)

)2] for any γ ∈ Γ.

Second, note that we have R(β0, δ0, γ; s) ≥ 0 for any γ(s) ∈ Γ with given s ∈ S fromAssumption ID-(i). For any γ(s) 6= γ0(s) at si = s and (β>, δ>)> = (β>0 , δ

>0 )>, however, we

have

R(β0, δ0, γ; s)

= δ>0 E[xix>i (1i (γ(si))− 1i (γ0(si)))

2∣∣∣ si = s

]δ0

= δ>0 E[xix>i 1 [minγ(si), γ0(si) < qi ≤ maxγ(si), γ0(si)]

∣∣ si = s]δ0

=

∫ maxγ(s),γ0(s)

minγ(s),γ0(s)δ>0 E

[xix>i

∣∣ qi = q, si = s]δ0f(q|s)dq

≥ C(s)P (minγ(si), γ0(si) < qi ≤ maxγ(si), γ0(si)| si = s)

> 0

from Assumptions ID-(i), (iii), and (iv), where C(s) = infq∈Q δ>0 E[xix

>i |qi = q, si = s]δ0 > 0.

Note that the last probability is strictly positive because we assume f(q|s) > 0 for any (q, s) ∈Γ×S and γ0(s) is not located on the boundary of Q as ε(s) < P(qi ≤ γ0(si)|si = s) < 1−ε(s)

26

for some ε(s) > 0. Therefore, R(β0, δ0, γ; s) = 0 only when γ(s) = γ0(s) since R(β0, δ0, γ; s) iscontinuous at γ = γ0(s) and γ is in a compact support from Assumptions ID-(ii) and (iv). Itgives that γ0(s) is identified as the unique minimizer of E[

(yi − x>i β0 − x>i δ01i (γ(si))

)2 |si =

s] for each s ∈ S.

A.2 Proof of Theorem 2 (Pointwise Convergence)

We first present a covariance inequality for strong mixing random field. Suppose Λ1 andΛ2 are finite subsets in Λn with |Λ1| = kx, |Λ2| = lx, and let X1 and X2 be random vari-ables respectively measurable with respect to the σ-algebra’s generated by Λ1 and Λ2. IfE [|X1|px ] <∞ and E [|X2|qx ] <∞ with 1/px + 1/qx + 1/rx = 1 for some constants px, qx > 1

and rx > 0, then

|Cov [X1, X2]| < 8αkx,lx(λ(Λ1,Λ2))1/rxE [|X1|px ]1/px E [|X2|qx ]

1/qx (A.1)

under Assumptions A-(i) and A-(iii). This covariance inequality is presented as Lemma 1 inthe working paper version of Jenish and Prucha (2009). The proof is also available in Halland Heyde (1980), p.277.

For a given s ∈ S0, we define

Mn(γ; s) =1

nbn

∑i∈Λn

xix>i 1i(γ)Ki(s)

Jn(γ; s) =1√nbn

∑i∈Λn

xiui1i(γ)Ki(s).

The following four lemmas give the asymptotic behavior of Mn(γ; s) and Jn(γ; s).

Lemma A.1 (Maximal inequality) For any given s ∈ S0, any η,$ > 0, and any γ1 ∈ Γ,there exists a constant C∗ such that

P

(sup

γ∈[γ1,γ1+$]

‖Jn (γ; s)− Jn (γ1; s)‖ > η

)≤ C∗$2

η4

if n is suffi ciently large.

Proof of Lemma A.1 For expositional simplicity, we only present the case of scalar xi.Let $ be such that $ ≥ (nbn)−1 and g be an integer satisfying nbn$/2 ≤ g ≤ nbn$,which always exists since nbn$ ≥ 1. Consider a fine enough grid over [γ1, γ1 +$] such thatγg = γ1 + (g − 1)$/g for g = 1, . . . , g + 1, where max1≤g≤g

(γg − γg−1

)≤ $/g. We define

hig(s) = xiuiKi (s)1[γg < qi ≤ γg+1

]and Hng(s) = (nbn)−1

∑i∈Λn|hig(s)| for 1 ≤ g ≤ g.

Then for any γ ∈[γg, γg+1

],∣∣Jn (γ; s)− Jn

(γg; s

)∣∣ ≤ √nbnHng(s)

≤√nbn |Hng(s)− E [Hng(s)]|+

√nbnE [Hng(s)]

27

and hence

supγ∈[γ1,γ1+$]

|Jn (γ; s)− Jn (γ1; s)|

≤ max2≤g≤g+1

∣∣Jn (γg; s)− Jn (γ1; s)∣∣

+ max1≤g≤g

√nbn |Hng(s)− E [Hng(s)]|+ max

1≤g≤g

√nbnE [Hng(s)]

≡ Ψ1(s) + Ψ2(s) + Ψ3(s).

We let hi(s) = xiuiKi (s)1[γg < qi ≤ γk

]for any given 1 ≤ g < k ≤ g and for fixed s. First,

for Ψ1(s), we have

E[∣∣Jn (γg; s)− Jn (γk; s)

∣∣4]=

1

n2b2n

∑i∈Λn

E[h4i (s)

]+

1

n2b2n

∑i,j∈Λn

i6=j

E[h2i (s)h

2j(s)

]+

1

n2b2n

∑i,j∈Λn

i 6=j

E[h3i (s)hj(s)

]+

1

n2b2n

∑i,j,k,l∈Λn

i 6=j 6=k 6=l

E [hi(s)hj(s)hk(s)hl(s)] +1

n2b2n

∑i,j,k∈Λn

i 6=j 6=k

E[h2i (s)hj(s)hk(s)

]≡ Ψ11(s) + Ψ12(s) + Ψ13(s) + Ψ14(s) + Ψ15(s),

where each term’s bound is obtained as follows.For Ψ11(s), since E[|xiui|4|qi, si]1[γg < qi ≤ γk] < C1 < ∞ from Assumption A-(v), we

have

1

bnE[h4i (s)

]=

1

bn

∫∫E[|xiui|4|q, v

]1[γg < q ≤ γk

]K4

(v − sbn

)f (q, v) dqdv (A.2)

≤ C1

bn

∫∫K4

(v − sbn

)f (q, v) dqdv

= C1

∫∫K4 (t) f (q, s+ bnt) dqdt

= C1(s) +O(b2n

)for some C1(s) < ∞ for a given s from Assumption A-(x), where we apply the change ofvariables t = (v − s)/bn. Hence, Ψ11(s) = O(n−1b−1

n ) = o(1). For Ψ12(s), by the covarianceinequality (A.1) with px = 2 + ϕ, qx = 2 + ϕ, rx = (2 + ϕ)/ϕ, and kx = lx = 1,

1

n2b2n

∑i,j∈Λn

i 6=j

∣∣Cov [h2i (s), h

2j(s)

]∣∣ (A.3)

≤ C2

n2b2n

∑i,j∈Λn

i 6=j

α1,1(λ(i, j))ϕ/(2+ϕ)E[∣∣h2

i (s)∣∣2+ϕ

]2/(2+ϕ)

≤ C2

n2b2n

E[∣∣h2

i (s)∣∣2+ϕ

]2/(2+ϕ) ∑i∈Λn

n−1∑m=1

∑j∈Λn

λ(i,j)∈[m,m+1)

α1,1 (m)ϕ/(2+ϕ)

28

≤ C ′2n2b2

n

E[∣∣h2

i (s)∣∣2+ϕ

]2/(2+ϕ) ∑i∈Λn

∞∑m=1

mα1,1 (m)ϕ/(2+ϕ)

≤ C ′′2

n2b(2+2ϕ)/(2+ϕ)n

(1

bnE[∣∣h2

i (s)∣∣2+ϕ

])2/(2+ϕ)

n∞∑m=1

m exp (−mϕ/(2 + ϕ))

for some ϕ > 0 and C2, C′2, C

′′2 < ∞, where b−1

n E[|h2i (s)|

2+ϕ] < ∞ as in (A.2). Note that

α1,1(λ(i, j)) ≤ α1,1(m) for λ(i, j) ∈ [m,m + 1) and |j ∈ Λn : λ(i, j) ∈ [m,m + 1)| = O(m)

for any given i ∈ Λn as in Lemma A.1.(iii) of Jenish and Prucha (2009). Furthermore, byAssumptions A-(vii) and (x),

1

bnE[h2i (s)

]=

∫∫V (q, t) 1

[γg < q ≤ γk

]K2 (t) f (q, s+ tbn) dqdt (A.4)

=

∫∫V (q, t) 1

[γg < q ≤ γk

]K2 (t) f (q, s) dqdt+O

(b2n

)≤ C3

∫1[γg < q ≤ γk

]f (q, s) dq +O

(b2n

)≤ C ′3

∣∣γk − γg∣∣+O(b2n

)for some C3, C

′3 < ∞, where the inequality is from the fact that V (q, s) and f (q, s) are

uniformed bounded and∫K2 (t) dt <∞. Thus, Assumptions A-(iii), (v), and (x) yield that

Ψ12(s) ≤ 1

n2b2n

∑i,j∈Λn

i 6=j

(E[h2i (s)

]E[h2j(s)

]+∣∣Cov [h2

i (s), h2j(s)

]∣∣) (A.5)

≤(

1

bnE[h2i (s)

])2

+C ′2

nb(2+2ϕ)/(2+ϕ)n

(1

bnE[∣∣h2

i (s)∣∣2+ϕ

])2/(2+ϕ) ∞∑m=1

m exp (−mϕ/(2 + ϕ))

≤ C ′′3(γk − γg

)2+O(n−1b−(2+2ϕ)/(2+ϕ)

n ) +O(b2n

)= C ′′3

(γk − γg

)2+ o (1)

since∑∞

m=1m exp (−mϕ/(2 + ϕ)) < ∞ for ϕ > 0 and n−1b−(2+2ϕ)/(2+ϕ)

n → 0 from As-

sumption A-(ix). Since E [hi(s)] = 0, using the same argument as (A.3) and (A.5), and theinequality (A.1) with px = 2 (2 + ϕ) /3, qx = 2 (2 + ϕ), rx = (2 + ϕ)/ϕ, and kx = lx = 1, wecan also show that

Ψ13(s) ≤ 1

n2b2n

∑i,j∈Λn

i 6=j

∣∣Cov [h3i (s), hj(s)

]∣∣≤ C4

n2b2n

∑i,j∈Λn

i 6=j

α1,1(λ(i, j))ϕ/(2+ϕ)E[∣∣h3

i (s)∣∣2(2+ϕ)/3

]3/(4+2ϕ)

E[|hj(s)|2(2+ϕ)

]1/(4+2ϕ)

≤ C ′4

nb(2+2ϕ)/(2+ϕ)n

(1

bnE[∣∣h2

i (s)∣∣(2+ϕ)

])2/(2+ϕ) ∞∑m=1

m exp (−mϕ/(2 + ϕ))

29

= O(n−1b−(2+2ϕ)/(2+ϕ)n ) = o (1) .

For Ψ14(s), let E = (i, j, k, l) : i 6= j 6= k 6= l, 0 < λ(i, j) ≤ λ(i, k) ≤ λ(i, l), and λ(j, k) ≤λ(j, l).5 Then by stationarity,

Ψ14(s) ≤ 4!

n2b2n

∑i,j,k,l∈Λn∩E

|E [hi(s)hj(s)hk(s)hl(s)]|

=2 · 4!

n2b2n


λ(i,j)≥maxλ(j,k),λ(k,l)

|Cov [hi(s), hj(s)hk(s)hl(s)]|

+4!

n2b2n


λ(j,k)≥maxλ(i,j),λ(k,l)

|Cov [hi(s)hj(s) , hk(s)hl(s)]|

+4!

n2b2n



|E [hi(s)hj(s)]E [hk(s)hl(s)]|

≡ Ψ14,1(s) + Ψ14,2(s) + Ψ14,3(s).

For Ψ14,1(s), the largest distance among all the pairs is λ (i, j). Then by the covarianceinequality (A.1) with px = 2 (2 + ϕ) /3, qx = 2 (2 + ϕ), rx = (2 + ϕ)/ϕ, kx = 1, and lx = 3,

Ψ14,1(s)

≤ C5

n2b(2+2ϕ)/(2+ϕ)n



α1,3(λ(i, j))ϕ/(2+ϕ)

(1

bnE[|hi(s)|4+2ϕ

])1/(4+2ϕ)

×(

1

bnE[|hj(s)hk(s)hl(s)|2(2+ϕ)/3

])3/(4+2ϕ)

≤ C ′5(s)

n2b(2+2ϕ)/(2+ϕ)n



α1,3(λ(i, j))ϕ/(2+ϕ)

≤ C ′5(s)

n2b(2+2ϕ)/(2+ϕ)n

∑i∈Λn

n−1∑m=1

∑j∈Λn

λ(i,j)∈[m,m+1)

∑k∈Λn

λ(j,k)≤m

∑l∈Λn

λ(k,l)≤m

α1,3(m)ϕ/(2+ϕ)

≤ C ′′5 (s)

nb(2+2ϕ)/(2+ϕ)n

∞∑m=1

m5 exp (−mϕ/(2 + ϕ))

= o(1)

since |k ∈ Λn : λ (j, k) ≤ m| = O(m2) for any given j ∈ Λn. For Ψ14,2(s), the largestdistance among all the pairs is λ (j, k). Similarly as above,

Ψ14,2(s)

≤ C6

n2b(2+2ϕ)/(2+ϕ)n



α2,2(λ(j, k))ϕ/(2+ϕ)

(1

bnE[|hi(s)hj(s)|2+ϕ

]1/(2+ϕ))

5 In the (one-dimensional) time series case, this set of indices reduces to (i, j, k, l) : 1 ≤ i < j < k < l ≤ n.

30

×(

1

bnE[|hk(s)hl(s)|2+ϕ

]1/(2+ϕ))

≤ C ′6(s)

n2b(2+2ϕ)/(2+ϕ)n



α2,2(λ(j, k))ϕ/(2+ϕ)

≤ C ′6(s)

n2b(2+2ϕ)/(2+ϕ)n

∑j∈Λn

n∑m=1

∑k∈Λn

λ(j,k)∈[m,m+1)

∑i∈Λn

λ(i,j)≤m

∑l∈Λn

λ(k,l)≤m

α2,2 (m)ϕ/(2+ϕ)

≤ C ′′6 (s)

nb(2+2ϕ)/(2+ϕ)n

∞∑m=1

m5 exp (−mϕ/(2 + ϕ))

= o(1).

For Ψ14,3(s), the largest distance among all the pairs is still λ (j, k). We let a sequence ofintegers κn = O

(b−`n)for some ` > 0 such that κn → ∞ and κ2

nbn → 0 as n → ∞. Wedecompose Ψ14,3(s) such that

Ψ14,3(s) =C7

n2b2n


λ(j,k)≥maxλ(i,j),λ(k,l)λ(i,j)≤κn,λ(k,l)≤κn


+C7

n2b2n


λ(j,k)≥maxλ(i,j),λ(k,l)λ(i,j)>κn,λ(k,l)>κn


+2C7

n2b2n


λ(j,k)≥maxλ(i,j),λ(k,l)λ(i,j)≤κn,λ(k,l)>κn


≡ Ψ′14,3(s) + Ψ′′14,3(s) + Ψ′′′14,3(s).

For Ψ′14,3(s), note that

1

b2n

E [hi(s)hj(s)]

=

∫ γk

γg

∫ γk

γg

E [(xiui) (xjuj) |qi, qj, s, s] f (qi, qj, s, s) dqidqj +O(b2n

)(A.6)

= O (1)

from Assumptions A-(v) and (vii). Hence, from the fact that |j ∈ Λn : λ (i, j) ≤ κn| =

O(κ2n) for any fixed i ∈ Λn, we obtain

Ψ′14,3(s) ≤ C7

n2b2n

∑i,j∈Λn

0<λ(i,j)≤κn

|E [hi(s)hj(s)]|∑k,l∈Λn

0<λ(k,l)≤κn

|E [hk(s)hl(s)]|

≤ C ′7(s)b2nκ

4n = o(1),

where the last line follows from the construction of κn.

31

ForΨ′′14,3(s), since E [hi(s)hj(s)] = Cov [hi(s), hj(s)], the covariance inequality (A.1) yields

Ψ′′14,3(s)

≤ C7

n2b2n

∑i,j∈Λn

λ(i,j)>κn

|E [hi(s)hj(s)]|∑k,l∈Λn

λ(k,l)>κn

|E [hk(s)hl(s)]|

≤ C ′7n2b2

n

∑i,j∈Λn

λ(i,j)>κn

α1,1 (λ (i, j))ϕ/(2+ϕ) E

[|hi(s)|2+ϕ

]1/(2+ϕ)

E[|hj(s)|2+ϕ

]1/(2+ϕ)

2

=C ′7

b2ϕ/(2+ϕ)n

E[

1

bn|hi(s)|2+ϕ

]2/(2+ϕ)1

n

∑i,j∈Λn

λ(i,j)>κn

α1,1 (λ (i, j))ϕ/(2+ϕ)

2

≤ C ′′7 (s)

b2ϕ/(2+ϕ)n

1

n

∑i∈Λn

n−1∑m=κn+1

∑j∈Λn

λ(i,j)∈[m,m+1)

α1,1 (m)ϕ/(2+ϕ)

2

≤ C ′′7 (s)

b2ϕ/(2+ϕ)n

∞∑

m=κn+1

m exp (−mϕ/(2 + ϕ))

2

.

However, for some c > 0, we have∞∑

m=κn+1

m exp (−cm) ≤∫ ∞κn

t exp (−ct) dt =1

c

(κn +

1

c

)exp (−cκn) .

As we set κn = O(b−`n)for some ` > 0, it follows that

Ψ′′14,3(s) = O

(κ2n exp (−(2ϕ/(2 + ϕ))κn)

b2ϕ/(2+ϕ)n

)= O

(exp

(− 2ϕ

2 + ϕ× 1

b`n

)× 1

b2(ϕ+`)/(2+ϕ)n

)= o (1)

since the exponential term decays faster for any `, ϕ > 0. For Ψ′′′14,3(s), by combining thearguments for bounding Ψ′14,3(s) and Ψ′′14,3(s), we obtain that

Ψ′′′14,3(s) ≤ C ′7(s)bnκ

2n

b−ϕ/(2+ϕ)n κn exp(−κnϕ/(2 + ϕ))

= o(1).

For Ψ15(s), we let E ′ = (i, j, k) : i 6= j 6= k and 0 < λ(i, j) ≤ λ(i, k) and decompose itinto

Ψ15(s) =2

n2b2n

∑i,j,k∈Λn∩E′λ(i,j)<λ(j,k)

E[h2i (s)hj(s)hk(s)

]+

2

n2b2n

∑i,j,k∈Λn∩E′λ(i,j)≥λ(j,k)

E[h2i (s)hj(s)hk(s)

]

≤ 2

n2b2n


∣∣Cov [h2i (s)hj(s), hk(s)

]∣∣

32

+2

n2b2n


∣∣Cov [h2i (s), hj(s)hk(s)

]∣∣+

2

n2b2n


∣∣E [h2i (s)

]E [hj(s)hk(s)]

∣∣≡ Ψ15,1(s) + Ψ15,2(s) + Ψ15,3(s).

Similarly as Ψ14,1(s),

Ψ15,1(s)

≤ C8

n2b2n


α2,1 (λ(j, k))ϕ/(2+ϕ) E[∣∣h2

i (s)hj(s)∣∣2(2+ϕ)/3

]3/(4+2ϕ)

E[|hk(s)|2(2+ϕ)

]1/(4+2ϕ)

≤ C ′8

n2b(2+2ϕ)/(2+ϕ)n

∑j∈Λn

n−1∑m=1

∑k∈Λn

λ(j,k)∈[m,m+1)

∑i∈Λn

λ(i,j)≤m

α2,1 (m)ϕ/(2+ϕ)

≤ C ′′8

nb(2+2ϕ)/(2+ϕ)n

∞∑m=1

m3 exp (−mϕ/(2 + ϕ))

= o (1) ,

and the same argument implies that Ψ15,2(s) = o(1) as well. For Ψ15,3(s), we let a sequenceof integers κ′n = O

(b−`

′

n

)for some `′ > 0 such that κ′n → ∞ and (κ′n)2bn → 0 as n → ∞.

Then, similarly as Ψ14,3(s), we decompose

Ψ15,3(s)

=2

n2b2n

∑i,j,k∈Λn∩E′

λ(i,j)≥λ(j,k),0<λ(j,k)≤κ′n

∣∣E [h2i (s)

]E [hj(s)hk(s)]

∣∣+

2

n2b2n

∑i,j,k∈Λn∩E′

λ(i,j)≥λ(j,k),λ(j,k)>κ′n

∣∣E [h2i (s)

]E [hj(s)hk(s)]

∣∣

≤ 2

nbn

∑i∈Λn

∣∣E [h2i (s)

]∣∣

1

nbn

∑j,k∈Λn

0<λ(j,k)≤κ′n

|E [hj(s)hk(s)]|+1

nbn

∑j,k∈Λn

λ(j,k)>κ′n

|E [hj(s)hk(s)]|

= O (1)

O((κ′n)2bn

)+O

(κ′n exp (−(ϕ/(2 + ϕ))κ′n)

bϕ/(2+ϕ)n

)= o (1) .

By combining all the results for Ψ11 to Ψ15, we thus have

E[∣∣Jn (γg; s)− Jn (γk; s)

∣∣4] ≤ C (γk − γg)2+ o (1) ≤ C (|k − g|$/g)

2+ o (1)

33

for some C <∞, and Theorem 12.2 of Billingsley (1968) yields that

P (Ψ1(s) > η) = P(

max2≤g≤g+1

∣∣Jn (γg; s)− Jn (γ1; s)∣∣ > η

)≤ C$2

η4(A.7)

for η > 0.Next, for Ψ2(s), a similar argument yields that

E[(√

nbn |Hng(s)− E [Hng(s)]|)4]≤ C ′$2/g2

for some C ′ <∞, and hence by Markov’s inequality we have

P(

max1≤g≤g

√nbn |Hng(s)− E [Hng(s)]| > η

)≤ gC

′$2/g2

η4≤ C ′$2

η4(A.8)

since g = O(nbn) by construction. Finally, to bound Ψ3(s), note that√nbnE [Hng(s)] ≤

√nbnC$/g ≤ C ′/

√nbn (A.9)

for some C,C ′ <∞, since $/g = O((nbn)−1). So the proof is complete by combining (A.7),(A.8), and (A.9).

Lemma A.2 For any fixed s ∈ S0,

Jn (γ; s)⇒ J (γ; s) ,

where J (γ; s) is a mean-zero Gaussian process indexed by γ as n→∞.

Proof of Lemma A.2 For a fixed γ, the Theorem of Bolthausen (1982) implies thatJn (γ; s) →d J (γ; s) as n → ∞ under Assumption A-(iii). Because γ is in the indicatorfunction, such pointwise convergence in γ can be generalized into any finite collection of γ toyield the finite dimensional convergence in distribution. Then the weak convergence followsfrom Lemma A.1 above and Theorem 15.5 of Billingsley (1968).

Lemma A.3

sup(γ,s)∈Γ×S0

‖Mn (γ; s)−M (γ; s)‖ →p 0,

sup(γ,s)∈Γ×S0

(nbn)−1/2 ‖Jn (γ; s)‖ →p 0

as n→∞, whereM (γ; s) =

∫ γ

−∞D(q, s)f (q, s) dq. (A.10)

Proof of Lemma A.3 For expositional simplicity, we only present the case of scalar xi.We prove the convergence of Mn (γ; s). For Jn (γ; s), since E [uixi|qi, si] = 0, the proof isidentical as Mn (γ; s) and hence omitted.

By stationarity, Assumptions A-(vii), (x), and Taylor expansion, we have

E [Mn (γ; s)] =1

bn

∫∫E[x2

i |q, v]1[q ≤ γ]K

(v − sbn

)f (q, v) dqdv (A.11)

34

=

∫∫D(q, s+ bnt)1[q ≤ γ]K (t) f (q, s+ bnt) dqdt

= M (γ; s) + b2n

∫M (q; s)1[q ≤ γ]dq

∫t2K (t) dt,

where M (q; s) = D(q, s)f (q, s) + (D(q, s) + f (q, s))/2. We let D and f denote the partialderivatives, and D and f denote the second-order partial derivatives with respect to s. Sincesups∈S0 ||M (q; s) || < ∞ for any q from Assumption A-(vii), and K (·) is a second-orderkernel, we have

sup(γ,s)∈Γ×S0

‖E [Mn (γ; s)]−M (γ; s)‖ = Op(b2n

)= op(1). (A.12)

Next, we let τn = (n log n)1/(4+ϕ) and ϕ is given in Assumption A-(v). By Markov’s andHölder’s inequalities, Assumption A-(v) gives P (x2

n > τn) ≤ Cτ−(4+ϕ)n E[|x2

n|4+ϕ] ≤ C ′ (n log n)−1

for some C,C ′ <∞. Thus∑n∈Z2

P(x2n > τn

)≤ C ′

∑n∈Z2

(n log n)−1<∞,

which yields that x2n ≤ τn almost surely for suffi ciently large n by the Borel-Cantelli lemma.

Since τn →∞ as n→∞, we have x2i ≤ τn for any i ∈ Λn and hence

sup(γ,s)∈Γ×S0

‖Mn(γ; s)−M τn(γ; s)‖ = 0 and sup

(γ,s)∈Γ×S0‖E [Mn(γ; s)]− E [M τ

n(γ; s)]‖ = 0

almost surely for suffi ciently large n, where

M τn(γ; s) =

1

nbn

∑i∈Λn

x2i1i(γ)Ki(s)1

x2i ≤ τn

. (A.13)

It follows that

sup(γ,s)∈Γ×S0

‖Mn(γ; s)− E [Mn(γ; s)]‖ ≤ sup(γ,s)∈Γ×S0

‖Mn(γ; s)−M τn(γ; s)‖ (A.14)

+ sup(γ,s)∈Γ×S0

‖M τn(γ; s)− E [M τ

n(γ; s)]‖

+ sup(γ,s)∈Γ×S0

‖E [Mn(γ; s)]− E [M τn(γ; s)]‖

and we establish sup(γ,s)∈Γ×S0 ‖Mn(γ; s)− E [Mn(γ; s)]‖ = op(1) if the second term in (A.14)is op(1). Then we conclude sup(γ,s)∈Γ×S0 ‖Mn (γ; s)−M (γ; s)‖ →p 0 as desired by combining(A.12) and (A.14).

To this end, we let mn be an integer such that mn = O(τn(n/(b3n log n))1/2) and we

cover the compact Γ × S0 by small m2n squares centered at

(γk1 , sk2

), which are defined as

Ik = (γ′, s′) :∣∣γ′ − γk1∣∣ ≤ C/mn and |s′ − sk2 | ≤ C/mn for some C < ∞. Note that

τn (n1−2εbn/ log n)1/2

(n2ε/b4n)

1/2 → ∞ as n → ∞ from Assumption A-(xi), hence mn → ∞.We then have

sup(γ,s)∈Γ×S0


n(γ; s)]‖ ≤ max1≤k1≤mn1≤k2≤mn

sup(γ,s)∈Ik


n(γ; s)]‖

35

≤ max1≤k1≤mn1≤k2≤mn

sup(γ,s)∈Ik

∥∥M τn(γ; s)−M τ

n(γk1 ; sk2)∥∥

+ max1≤k1≤mn1≤k2≤mn

sup(γ,s)∈Ik

∥∥E [M τn(γ; s)]− E

[M τ

n(γk1 ; sk2)]∥∥

+ max1≤k1≤mn1≤k2≤mn

∥∥M τn(γk1 ; sk2)− E

[M τ

n(γk1 ; sk2)]∥∥

≡ ΨM1 + ΨM2 + ΨM3.

We first decompose M τn(γ; s)−M τ

n(γk1 ; sk2) ≤M τ1n(γ, γk1 ; s, sk2) +M τ

2n(γ, γk1 ; s, sk2), where

M τ1n(γ, γk1 ; s) =

1

nbn

∑i∈Λn

x2i

∣∣1 [qi ≤ γ]− 1[qi ≤ γk1

]∣∣Ki(sk2)1[x2i ≤ τn

],

M τ2n(γ; s, sk2) =

1

nbn

∑i∈Λn

x2i1 [qi ≤ γ] |Ki(s)−Ki(sk2)|1

[x2i ≤ τn

].

Without loss of generality, we can suppose γ > γk1 in Mτ1n(γ, γk1 ; s). Since Ki(·) is bounded

from Assumption A-(x) and we only consider x2i ≤ τn, for any γ such that

∣∣γ − γk1∣∣ ≤ C/mn,∥∥M τ1n(γ, γk1 ; s)

∥∥ ≤ C1

τnnbn

∑i∈Λn

1[γk1 < qi ≤ γ

](A.15)

= C1τnb−1n P

(γk1 < qi ≤ γ

)almost surely

≤ C ′1τnb−1n m−1

n

= C ′′1

(bn log n

n

)1/2

= Oa.s.

((log n

nbn

)1/2)

for some C1, C′1, C

′′1 < ∞, where the second equality is by the uniform almost sure law of

large numbers for random fields (e.g., Jenish and Prucha (2009), Theorem 2). This boundholds uniformly in (γ, s) ∈ Ik and k1, k2 ∈ 1, . . . ,mn. Similarly, since K(·) is Lipschitzfrom Assumption A-(x),

‖M τ2n(γ; s, sk2)‖ ≤

τnnbn

∑i∈Λn

|Ki(s)−Ki(sk2)| (A.16)

≤ C2

τnb2n

|s− sk2 | ≤C ′2τnb2nmn

= Oa.s.

((log n

nbn

)1/2)

for some C2, C′2 <∞, uniformly in γ, s, k1 and k2. It follows that∥∥M τ

n(γ; s)−M τn(γk1 ; sk2)

∥∥ = Oa.s.((log n/(nbn))1/2

)

uniformly in γ, s, k1 and k2, and hence we can readily verify that both ΨM1 and ΨM2 areOa.s.((log n/(nbn))1/2). For ΨM3, we follow the same argument for bounding the Q∗3n termon pp.794-796 of Carbon, Francq, and Tran (2007). In particular, for any k1 ∈ 1, . . . ,mn,max1≤k2≤mn

||M τn(γk1 ; sk2) − E

[M τ

n(γk1 ; sk2)]|| ≤ C3 (log n/(nbn))

1/2 almost surely for someC3 < ∞. Note that γk1 shows up in the indicator function 1

[qi ≤ γk1

]only, which is

uniformly bounded by 1. The bound is hence uniform over all k1 ∈ 1, . . . ,mn andΨM3 = Oa.s.((log n/(nbn))1/2) as well. Combining the bounds for ΨM1,ΨM2, and ΨM3, we

36

have sup(γ,s)∈Γ×S0 ‖M τn(γ; s)− E [M τ

n(γ; s)]‖ = oa.s.(1) and hence complete the proof becauselog n/(nbn)→ 0 from Assumption A-(ix).

Lemma A.4 Uniformly over s ∈ S0,

∆Mn (s) ≡ 1

nbn

∑i∈Λn

xix>i 1i (γ0 (si))− 1i (γ0 (s))Ki (s) = Oa.s. (bn) . (A.17)

Proof of Lemma A.4 See the supplementary material.

Lemma A.5 For a given s ∈ S0, γ(s)→p γ0(s) as n→∞.

Proof of Lemma A.5 For given s ∈ S0, we let yi(s) = Ki(s)1/2yi, xi(s) = Ki(s)

1/2xi,ui(s) = Ki(s)

1/2ui, xi(γ; s) = Ki(s)1/2xi1i (γ), and xi(γ0(si); s) = Ki(s)

1/2xi1i (γ0(si)). Wedenote y(s), X(s), u(s), X(γ; s), and X(γ0(si); s) as their corresponding matrices of n-stacks.Then θ(γ; s) = (β(γ; s)>, δ(γ; s)>)> in (2) is given as

θ(γ; s) = (Z(γ; s)>Z(γ; s))−1Z(γ; s)>y(s), (A.18)

where Z(γ; s) = [X(s), X(γ; s)]. Therefore, since y(s) = X(s)β0 + X(γ0(si); s)δ0 + u(s) andX(s) lies in the space spanned by Z(γ; s), we have

Qn (γ; s)− u(s)>u(s) = y(s)> (In − PZ(γ; s)) y(s)− u(s)>u(s)

= −u(s)>PZ(γ; s)u(s) + 2δ>0 X(γ0(si); s)> (In − PZ(γ; s)) u(s)

+δ>0 X(γ0(si); s)> (In − PZ(γ; s)) X(γ0(si); s)δ0,

where PZ(γ; s) = Z(γ; s)(Z(γ; s)>Z(γ; s))−1Z(γ; s)> and In is the identity matrix of rankn. Note that PZ(γ; s) is the same as the projection onto [X(s) − X(γ; s), X(γ; s)], whereX(γ; s)>(X(s)−X(γ; s)) = 0. Furthermore, for γ ≥ γ0(si), xi(γ0(si); s)

>(xi(s)−xi(γ; s)) = 0

and hence X(γ0(si); s)>X(γ; s) = X(γ0(si); s)

>X(γ0(si); s). Since we can rewrite

Mn(γ; s) =1

nbn

∑i∈Λn

xi(γ; s)xi(γ; s)> and

Jn(γ; s) =1√nbn

∑i∈Λn

xi(γ; s)ui(s),

Lemma A.3 yields that

Z(γ; s)>u(s) = [X(s)>u(s), X(γ; s)>u(s)] = Op((nbn)1/2

)Z(γ; s)>X(γ0(si); s) = [X(s)>X(γ0(si); s), X(γ; s)>X(γ0(si); s)]

= [X(s)>X(γ0(si); s), X(γ0(si); s)>X(γ0(si); s)] = Op (nbn)

for given s. It follows that

Υn (γ; s) ≡ 1

an

(Qn (γ; s)− u(s)>u(s)

)(A.19)

= Op

(1

an

)+Op

(1

a1/2n

)+

1

nbnc>0 X(γ0(si); s)

> (In − PZ(γ; s)) X(γ0(si); s)c0

37

=1


> (I − PZ(γ; s)) X(γ0(si); s)c0 + op(1)

for an = n1−2εbn →∞ as n→∞. Moreover, we have

Mn (γ0(si); s) =1

nbn

∑i∈Λn

xi(γ0(si); s)xi(γ0(si); s)> (A.20)

= Mn (γ0(s); s) + ∆Mn (s)

= Mn (γ0(s); s) + oa.s. (1)

from Lemma A.4, where ∆Mn (s) is defined in (A.17). It follows that

1


> (In − PZ(γ; s)) X(γ0(si); s)c0 (A.21)

→p c>0 M(γ0(s); s)c0 − c>0 M(γ0(s); s)>M(γ; s)−1M(γ0(s); s)c0 ≡ Υ0(γ; s) <∞

uniformly over γ ∈ Γ ∩ [γ0(s),∞) as n→∞, from Lemma A.3 and Assumptions ID-(ii) andA-(viii). However,

∂Υ0(γ; s)/∂γ = c>0 M(γ0(s); s)>M(γ; s)−1D(γ, s)f(γ, s)M(γ; s)−1M(γ0(s); s)c0 ≥ 0

and∂Υ0(γ0(s); s)/∂γ = c>0 D(γ0(s), s)f(γ0(s), s)c0 > 0 (A.22)

from Assumption A-(viii), which implies that Υ0(γ; s) is continuous, non-decreasing, anduniquely minimized at γ0(s) given s ∈ S0.

We can symmetrically show that, uniformly over γ ∈ Γ∩(−∞, γ0(s)], Υ0(γ; s) in (A.21) iscontinuous, non-increasing, and uniquely minimized at γ0(s) as well. Therefore, given s ∈ S0,supγ∈Γ |Υn(γ; s)−Υ0(γ; s)| = op(1); Υ0(γ; s) is continuous and uniquely minimized at γ0(s).Since Γ is compact and γ(s) is the minimizer of Υn(γ; s), the pointwise consistency followsas Theorem 2.1 of Newey and McFadden (1994).

We let φ1n = a−1n , where an = n1−2εbn and ε is given in Assumption A-(ii). For a given

s ∈ S0 and any γ : S0 7→ Γ, we define

Tn (γ; s) =1

nbn

∑i∈Λn

(c>0 xi

)2 |1i (γ (s))− 1i (γ0 (s))|Ki (s) , (A.23)

T n(γ, s) =1

nbn

∑i∈Λn

‖xi‖2 |1i (γ (s))− 1i (γ0 (s))|Ki (s) , (A.24)

Lnj (γ; s) =1√nbn

∑i∈Λn

xijui 1i (γ (s))− 1i (γ0 (s))Ki (s) (A.25)

for j = 1, . . . ,dim(x), where xij denotes the jth element of xi.

Lemma A.6 For a given s ∈ S0, for any γ (·) : S0 7→ Γ, η(s) > 0, and ε(s) > 0, thereexist constants 0 < CT (s), CT (s), C(s), r(s) < ∞ such that if n is suffi ciently large andn1−2εb2

n → % <∞,

P(

infr(s)φ1n<|γ(s)−γ0(s)|<C(s)

Tn (γ; s)

|γ (s)− γ0 (s)| < CT (1− η(s))

)≤ ε(s), (A.26)

38

P

(sup

r(s)φ1n<|γ(s)−γ0(s)|<C(s)

T n (γ; s)

|γ (s)− γ0 (s)| > CT (1 + η(s))

)≤ ε(s), (A.27)

P

(sup


|Lnj (γ; s)|√an |γ (s)− γ0 (s)| > η(s)

)≤ ε(s) (A.28)

for j = 1, . . . ,dim(x).


For a given s ∈ S0, we let θ(γ(s)) = (β(γ(s))>, δ(γ(s))>)> and θ0 = (β>0 , δ>0 )>.

Lemma A.7 For a given s ∈ S0, nε(θ(γ(s))− θ0) = op(1).


Proof of Theorem 2 The consistency is proved in Lemma A.5 above. For given s ∈ S0,we let

Q∗n(γ(s); s) = Qn(β (γ (s)) , δ (γ (s)) , γ(s); s) (A.29)

=∑i∈Λn

yi − x>i β (γ (s))− x>i δ (γ (s))1i(γ(s))

2

Ki (s)

for any γ(·), where Qn(β, δ, γ; s) is the sum of squared errors function in (3). Consider γ(s)

such that γ (s) ∈[γ0 (s) + r(s)φ1n, γ0 (s) + C(s)

]for some 0 < r(s), C(s) <∞ that are chosen

in Lemma A.6. We let ∆i(γ; s) = 1i (γ (s))−1i (γ0 (s)); cj(γ (s)) and c0j be the jth element ofc(γ (s)) ∈ Rdim(x) and c0 ∈ Rdim(x), respectively. Then, since yi = β>0 xi + δ>0 xi1i (γ0 (si)) +ui,

Q∗n(γ(s); s)−Q∗n(γ0(s); s)

=∑i∈Λn

(δ (γ (s))

>xi

)2

∆i(γ; s)Ki (s)

−2∑i∈Λn

(yi − β (γ (s))

>xi − δ (γ (s))

>xi1i (γ0 (s))

)(δ (γ (s))

>xi

)∆i(γ; s)Ki (s)

=∑i∈Λn

(δ>0 xi

)2∆i(γ; s)Ki (s) +

∑i∈Λn

(δ (γ (s))

>xi

)2

−(δ>0 xi

)2

∆i(γ; s)Ki (s)

−2∑i∈Λn

δ>0 xiui∆i(γ; s)Ki (s)− 2∑i∈Λn

(δ (γ (s))− δ0

)>xiui∆i(γ; s)Ki (s)

−2∑i∈Λn

(β (γ (s))− β0

)>xix>i δ (γ (s)) ∆i(γ; s)Ki (s)

−2∑i∈Λn

δ>0 xix>i δ0 1i (γ0 (si))− 1i (γ0 (s))∆i(γ; s)Ki (s) (A.30)

−2∑i∈Λn

δ>0 xix>i

(δ (γ (s))− δ0

)1i (γ0 (si))− 1i (γ0 (s))∆i(γ; s)Ki (s) (A.31)

39

−2∑i∈Λn

(δ (γ (s))− δ0

)>xix>i δ (γ (s))1i (γ0 (s)) ∆i(γ; s)Ki (s) , (A.32)

where the absolute values of the last two terms in lines (A.31) and (A.32) are bounded by

2

∣∣∣∣∣∑i∈Λn

δ>0 xix>i

(δ (γ (s))− δ0

)|∆i(γ; s)|Ki (s)

∣∣∣∣∣ and

2

∣∣∣∣∣∑i∈Λn

(δ (γ (s))− δ0

)>xix>i δ (γ (s)) |∆i(γ; s)|Ki (s)

∣∣∣∣∣ ,respectively, since |1i (γ0 (s))| ≤ 1 and |1i (γ0 (si))− 1i (γ0 (s))| ≤ 1. Moreover, for the termin line (A.30), we have

1

an

∑i∈Λn

δ>0 xix>i δ0 1i (γ0 (si))− 1i (γ0 (s))∆i(γ; s)Ki (s)

≤ 1

an

∑i∈Λn

δ>0 xix>i δ0 |1i (γ0 (si))− 1i (γ0 (s))|Ki (s) = C∗n(s)bn (A.33)

for some C∗n(s) = Oa.s.(1) as in (A.20).For any vector v = (v1, . . . , vdim(v))

>, we let ‖v‖∞ = max1≤j≤dim(x) |vj|. From Lemma A.7,we also let a suffi ciently small κn(s) such that nε||θ(γ(s))−θ0|| ≤ κn(s) and κn(s)→ 0 as n→∞ for any s. Then, ‖c (γ (s))− c0‖ ≤ κn(s), ‖c(γ (s))‖ ≤ ‖c0‖+ κn(s), and ‖c (γ (s)) + c0‖ ≤2 ‖c0‖+κn(s). In addition, given Lemma A.6, there exist 0 < C(s), C(s), r(s), η(s), ε(s) <∞such that

P(


Tn (γ; s)

|γ (s)− γ0 (s)| < C(s) (1− η(s))

)≤ ε(s)

3,

P

(sup


T n (γ; s)

|γ (s)− γ0 (s)| > CT (1 + η(s))

)≤ ε(s)

3,

P

(sup


2 dim(x) ‖c0‖∞ ‖Ln (γ; s)‖∞√an |γ (s)− γ0 (s)| > η(s)

)≤ ε(s)

3

for ‖c0‖∞ <∞. For γ (s) ∈[γ0 (s) + r(s)φ1n, γ0 (s) + C(s)

], we also have

P

(sup


2C∗n(s)bn|γ (s)− γ0(s)| > η(s)

)≤ ε(s)

3

by choosing r(s) large enough, since

supr(s)φ1n<|γ(s)−γ0(s)|<C(s)

C∗n(s)bnγ (s)− γ0(s)

≤ C∗n(s)bnr(s)φ1n

= anbnC∗n(s)

r(s)<∞

almost surely provided n1−2εb2n → % <∞.

It follows that, with probability approaching to one,

Q∗n(γ(s); s)−Q∗n(γ0(s); s)

an(γ(s)− γ0(s))(A.34)

40

≥ Tn (γ; s)

γ(s)− γ0(s)− ‖c (γ (s))− c0‖ ‖c (γ (s)) + c0‖

T n(γ, s)

γ(s)− γ0(s)

−2 dim(x) ‖c0‖∞‖Ln (γ; s)‖∞√an(γ(s)− γ0(s))

− 2 dim(x) ‖c(γ (s))− c0‖∞‖Ln (γ; s)‖∞√an (γ(s)− γ0(s))

−2∥∥∥nε(β (γ (s))− β0)

∥∥∥ ‖c(γ (s))‖ T n(γ, s)

γ(s)− γ0(s)

−2C∗n(s)bn

γ(s)− γ0(s)

−2 ‖c0‖ ‖c (γ (s))− c0‖T n(γ, s)

γ(s)− γ0(s)

−2∥∥∥nε(δ (γ (s))− δ0)

∥∥∥ ‖c(γ (s))‖ T n(γ, s)

γ(s)− γ0(s)

≥ CT (s) (1− η (s))− κn(s) 2||c0||+ κn(s)CT (s) (1 + η (s))

−2 dim(x) ‖c0‖∞ η (s)− 2 dim(x)κn(s)η (s)

−2κn(s) ||c0||+ κn(s)CT (s) (1 + η (s))

−2η (s)

−2 ‖c0‖κn(s)CT (s) (1 + η (s))

−2κn(s) ||c0||+ κn(s)CT (s) (1 + η (s))

> 0

by choosing suffi ciently small κn(s) and η(s).Since we suppose an(γ(s)− γ0(s)) > 0, it implies that, for any ε(s) ∈ (0, 1) and η(s) > 0,

P(


Q∗n(γ(s); s)−Q∗n(γ0(s); s) > η(s)

)≥ 1− ε(s),

which yields P (Q∗n(γ(s); s)−Q∗n(γ0(s); s) > 0) → 1 as n → ∞ for given s ∈ S0. We sim-ilarly show the same result when γ (s) ∈

[γ0 (s)− C(s), γ0 (s)− r(s)φ1n

]. Therefore, be-

cause Q∗n(γ(s); s) − Q∗n(γ0(s); s) ≤ 0 for any s ∈ S0 by construction, it should hold that|γ (s)−γ0 (s) | ≤ r(s)φ1n with probability approaching to one; or for any ε(s) > 0 and s ∈ S0,there exists r(s) > 0 such that

P (an|γ (s)− γ0 (s) | > r(s)) < ε(s)

for suffi ciently large n, since φ1n = a−1n .

A.3 Proof of Theorem 3 and Corollary 1 (Asymptotic Distribution)

For a given s ∈ S0, we let γn (s) = γ0 (s) + r/an with some |r| <∞, where an = n1−2εbn andε is given in Assumption A-(ii). We define

A∗n (r, s) =∑i∈Λn

(δ>0 xi

)2 |1i (γn (s))− 1i (γ0 (s))|Ki (s) ,

B∗n (r, s) =∑i∈Λn

δ>0 xiui 1i (γn (s))− 1i (γ0 (s))Ki (s) .

41

Lemma A.8 Suppose n1−2εb2n → % < ∞. Then, for fixed s ∈ S0, uniformly over r in any

compact set,A∗n (r, s)→p |r| c>0 D (γ0 (s) , s) c0f (γ0 (s) , s)

andB∗n (r, s)⇒W (r)

√c>0 V (γ0 (s) , s) c0f (γ0 (s) , s)κ2

as n → ∞, where κ2 =∫K2(v)dv and W (r) is the two-sided Brownian Motion defined in

(12).

Proof of Lemma A.8 First, for A∗n (r, s), we consider the case with r > 0. We let∆i(γn; s) = 1i (γn (s)) − 1i (γ0 (s)), hi(r, s) =

(c>0 xi

)2∆i(γn; s)Ki (s), and recall that δ0 =

c0n−ε = c0(an/ (nbn))1/2. By change of variables and Taylor expansion, Assumptions A-(v),

(viii), and (x) imply that

E [A∗n (r, s)] =annbn

∑i∈Λn

E [hi(r, s)] (A.35)

= an

∫∫ γ0(s)+r/an

γ0(s)

E[(c>0 xi

)2 |q, s+ bnt]K (t) f (q, s+ bnt) dqdt

= rc>0 D (γ0 (s) , s) c0f (γ0 (s) , s) +O

(1

an+ b2

n

),

where the third equality holds under Assumption A-(vi). Next, we have

V ar [A∗n (r, s)] =a2n

n2b2n

V ar

[∑i∈Λn

hi(r, s)

](A.36)

=a2n

nb2n

V ar [hi(r, s)] +a2n

n2b2n

∑i,j∈Λn

i 6=j

Cov [hi(r, s), hj(r, s)]

≡ ΨA1(r, s) + ΨA2(r, s).

Taylor expansion and Assumptions A-(vii), (viii), and (x) lead to

ΨA1(r, s) =annbn

(anbnE[(c>0 xi

)4∆i(γn; s)K2

i (s)])− 1

n

(anbnE[(c>0 xi

)2∆i(γn; s)Ki (s)

])2

= O

(annbn

+1

n

)= O

(n−2ε +

1

n

)since ∆i(γn; s)2 = ∆i(γn; s) for r > 0, where each moment term is bounded as in (A.35). ForΨA2, we define a sequence of integers κn = O

(n`)for some ` > 0 such that κn → ∞ and

κ2n/n→ 0, and decompose

ΨA2(r, s) =a2n

n2b2n

∑i,j∈Λn

0<λ(i,j)≤κn

Cov [hi(r, s), hj(r, s)] +a2n

n2b2n

∑i,j∈Λn

λ(i,j)>κn

Cov [hi(r, s), hj(r, s)]

= Ψ′A2(r, s) + Ψ′′A2(r, s).

42

Then, since

Cov

[anbnhi(r, s),

anbnhj(r, s)

]≤ r2E

[(c>0 xi

)2 (c>0 xj

)2 |γ0 (s) , γ0 (s) , s, s]f (γ0 (s) , γ0 (s) , s, s) + o (1)

using a similar argument as in (A.6) and (A.35), similarly as the proof of Ψ′14,3(s) in LemmaA.1, we have

Ψ′A2(r, s) ≤ Cr2κ2n/n = o (1)

for some C < ∞. Furthermore, by the covariance inequality (A.1) and Assumption A-(iii),we have

|Ψ′′A2(r, s)| ≤ C ′

n2

(anbn

) 2+2ϕ2+ϕ ∑

i,j∈Λn

λ(i,j)>κn

α1,1 (λ (i, j))ϕ/(2+ϕ) E

[anbn|hi(r, s)|2+ϕ

]2/(2+ϕ)

≤ C ′′

n

(anbn

) 2+2ϕ2+ϕ ∑

i∈Λn

n−1∑m=κn+1

∑j∈Λn

λ(i,j)∈[m,m+1)

α1,1 (m)ϕ/(2+ϕ)

≤ C ′′

n

(anbn

) 2+2ϕ2+ϕ

∞∑m=κn+1

m exp (−mϕ/(2 + ϕ))

= O(n((1−2ε)(2+2ϕ)/(2+ϕ))−1κn exp(−κnϕ/(2 + ϕ))

)= o (1) ,

similarly as the proof of Ψ′′14,3(s) in Lemma A.1, because E[(an/bn) |hi(r, s)|2+ϕ] is boundedas in (A.35) and we set κn such that κn = O(n`) for ` > 0. Hence, the pointwise con-vergence of A∗n (r, s) is obtained. Furthermore, since A∗n(r, s) is monotonically increasingin r and the limit function rc>0 D (γ0 (s) , s) c0f (γ0 (s) , s) is continuous in r, the conver-gence holds uniformly on any compact set. Symmetrically, we can show that E [A∗n (r, s)] =

−rc>0 D (γ0 (s) , s) c0f (γ0 (s) , s) + O (a−1n + b2

n) when r < 0. The uniform convergence alsoholds in this case using the same argument as above, which completes the proof for A∗n (r, s).

Next, for B∗n (r, s), Assumption ID-(i) leads to E [B∗n (r, s)] = 0. We let hi(r, s) =

c>0 xiui∆i(γn; s)Ki (s) and write

V ar[B∗n (r, s)] =anbnV ar[hi(r, s)] +

annbn

∑i,j∈Λn

i 6=j

Cov[hi(r, s), hj(r, s)]

≡ ΨB1(r, s) + ΨB2(r, s).

As in (A.35), we have

ΨB1(r, s) = |r| c>0 V (γ0 (s) , s) c0f (γ0(s), s)

∫K2(v)dv +O

(1

an+ b2

n

),

which is nonsingular for |r| > 0 from Assumption A-(viii). For ΨB2(r, s), we define a sequenceof integers κ′n = O

(n`′)for some `′ > 0 such that κ′n → ∞ and (κ′n)2/n1−2ε → 0, and

43

decompose

ΨB2(r, s) =annbn

∑i,j∈Λn

0<λ(i,j)≤κn

Cov[hi(r, s), hj(r, s)] +annbn

∑i,j∈Λn

λ(i,j)>κn

Cov[hi(r, s), hj(r, s)]

≡ Ψ′B2(r, s) + Ψ′′B2(r, s).

Then similarly as Ψ′A2 and Ψ′′A2 above, we have

|Ψ′B2(r, s)| ≤ Cr2(κ′n)2 × bnan

= O

((κ′n)2

n1−2ε

)= o(1),

|Ψ′′B2(r, s)| ≤ C ′(anbn

)ϕ/(2+ϕ) ∞∑m=κn+1

m exp (−mϕ/(2 + ϕ))

= C ′n(1−2ε)ϕ/(2+ϕ)κ′n exp(−κ′nϕ/(2 + ϕ)) = o(1)

for some C,C ′ <∞. By combining these results, we have

V ar[B∗n (r, s)] = |r| c>0 V (γ0 (s) , s) c0f (γ0(s), s)κ2 + o (1)

with κ2 =∫K2(v)dv, and by the CLT for stationary and mixing random field (e.g., Bolthausen

(1982) and Jenish and Prucha (2009)), we have

B∗n (r, s)⇒W (r)√c>0 V (γ0 (s) , s) c0f (γ0 (s) , s)κ2

as n → ∞, where W (r) is the two-sided Brownian Motion defined in (12). This pointwiseconvergence in r can be extended to any finite-dimensional convergence in r by the factthat for any r1 < r2, Cov [B∗n (r1, s) , B

∗n (r2, s)] = V ar [B∗n (r1, s)] + o (1), which is because

(1i (γ0 + r2/an)− 1i (γ0 + r1/an))1i (γ0 + r1/an) = 0. The tightness follows from a simi-lar argument as Jn(γ; s) in Lemma A.1 and the desired result follows by Theorem 15.5 inBillingsley (1968).

For a given s ∈ S0, we let θ (γ0 (s)) = (β (γ0 (s))>, δ (γ0 (s))

>)>. Recall that θ0 =

(β>0 , δ>0 )> and θ (γ (s)) = (β (γ (s))

>, δ (γ (s))

>)>.

Lemma A.9 For a given s ∈ S0,√nbn(θ (γ (s))−θ0) = Op(1) and

√nbn(θ (γ (s))−θ (γ0 (s))) =

op(1), if n1−2εb2n → % <∞ as n→∞.


Proof of Theorem 3 From Theorem 2, we define a random variable r∗(s) such that

r∗(s) = an(γ (s)− γ0 (s)) = arg maxr∈R

Q∗n(γ0(s); s)−Q∗n

(γ0(s) +

r

an; s

),

where Q∗n(γ(s); s) is defined in (A.29). We let ∆i(s) = 1i (γ0 (s) + r/an) − 1i (γ0 (s)). Wethen have

∆Q∗n(r; s) (A.37)

44

= Q∗n(γ0(s); s)−Q∗n(γ0(s) +

r

an; s

)= −

∑i∈Λn

(δ (γ (s))

>xi

)2

|∆i(s)|Ki (s)

+2∑i∈Λn

(yi − β (γ (s))

>xi − δ (γ (s))

>xi1i (γ0 (s))

)(δ (γ (s))

>xi

)∆i(s)Ki (s)

≡ −An(r; s) + 2Bn(r; s).

For An(r; s), Lemmas A.8 and A.9 yield

An(r; s) = A∗n (r, s) + op (1) (A.38)

since δ (γ (s))−δ0 = Op((nbn)−1/2). Similarly, for Bn(r; s), since yi = β>0 xi+δ>0 xi1i (γ0(si))+

ui, δ (γ (s))− δ0 = Op((nbn)−1/2), and β (γ (s))− β0 = Op((nbn)−1/2), we have

Bn(r; s) (A.39)

=∑i∈Λn

(ui + δ>0 xi 1i (γ0 (si))− 1i (γ0 (s)) −

(β (γ (s))− β0

)>xi

−(δ (γ (s))− δ0

)>xi1i (γ0 (s))

)δ (γ (s))

>xi∆i(s)Ki (s)

=∑i∈Λn

uiδ (γ (s))>xi∆i(s)Ki (s)

+∑i∈Λn

δ>0 xi 1i (γ0 (si))− 1i (γ0 (s)) δ (γ (s))>xi∆i(s)Ki (s)

−∑i∈Λn

(β (γ (s))− β0

)>xi +

(δ (γ (s))− δ0

)>xi1i (γ0 (s))

δ (γ (s))

>xi∆i(s)Ki (s)

=∑i∈Λn

uiδ>0 xi∆i(s)Ki (s) +

∑i∈Λn

δ>0 xi 1i (γ0 (si))− 1i (γ0 (s)) δ>0 xi∆i(s)Ki (s) + op(1)

= B∗n (r, s) +B∗∗n (r, s) + op (1) ,

where we let

B∗∗n (r, s) ≡∑i∈Λn

δ>0 xi 1i (γ0 (si))− 1i (γ0 (s)) δ>0 xi∆i(s)Ki (s) .

In Lemma A.10 below, we show that, if n1−2εb2n → % ∈ (0,∞),

B∗∗n (r, s)→p |r| c>0 D (γ0 (s) , s) c0f (γ0 (s) , s)

1

2−K0 (r, %; s)

+ %c>0 D (γ0 (s) , s) c0f (γ0 (s) , s) |γ0(s)| K1 (r, %; s)

as n→∞, where γ0 (·) is the first derivative of γ0(·) and Kj (r, %; s) =∫ |r|/(%|γ0(s)|)

0tjK (t) dt

for j = 0, 1.From Lemma A.8, it follows that

∆Q∗n(r; s) = −A∗n (r, s) + 2B∗∗n (r, s) + 2B∗n (r, s)

45

⇒ −|r| c>0 D (γ0 (s) , s) c0f (γ0 (s) , s)

+ |r| c>0 D (γ0 (s) , s) c0f (γ0 (s) , s) 1− 2K0 (r, %; s)+2%c>0 D (γ0 (s) , s) c0f (γ0 (s) , s) |γ0(s)| K1 (r, %; s)

+2W (r)√c>0 V (γ0 (s) , s) c0f (γ0 (s) , s)κ2

= −2 |r| `D(s)K0 (r, %; s) + 2%`D(s) |γ0(s)| K1 (r, %; s) (A.40)

+2W (r)√`V (s),

where

`D(s) = c>0 D (γ0 (s) , s) c0f (γ0 (s) , s) ,

`V (s) = c>0 V (γ0 (s) , s) c0f (γ0 (s) , s)κ2.

However, if we let ξ(s) = `V (s)/`2D(s) > 0 and r = ξ(s)ν, we have

arg maxr∈R

(2W (r)

√`V (s)− 2 |r| `D(s)K0 (r, %; s) + 2%`D(s) |γ0(s)| K1 (r, %; s)

)= ξ(s) arg max

ν∈R

(W (ξ(s)ν)

√`V (s)− |ξ(s)ν| `D(s)K0 (ξ(s)ν, %; s) + %`D(s) |γ0(s)| K1 (ξ(s)ν, %; s)

)= ξ(s) arg max

ν∈R

(W (ν)

`V (s)

`D(s)− |ν| `V (s)

`D(s)K0 (ξ(s)ν, %; s) + %

`V (s)

`D(s)· |γ0(s)|ξ(s)

K1 (ξ(s)ν, %; s)

)= ξ(s) arg max

ν∈R

(W (ν)− |ν| K0 (ξ(s)ν, %; s) + %

|γ0(s)|ξ(s)

K1 (ξ(s)ν, %; s)

)similar to the proof of Theorem 1 in Hansen (2000). By Theorem 2.7 of Kim and Pollard(1990), it follows that (rewriting ν as r)


(W (r)− |r|ψ0 (r, %; s) + %

|γ0(s)|ξ(s)

ψ1 (r, %; s)

)as n→∞, where

ψj (r, %; s) =

∫ |r|ξ(s)/(%|γ0(s)|)

0

tjK (t) dt

for j = 0, 1. Finally, letting

µ (r, %; s) = − |r|ψ0 (r, %; s) + %|γ0(s)|ξ(s)

ψ1 (r, %; s) , (A.41)

E [arg maxr∈R (W (r) + µ (r, %; s))] = 0 follows from Lemmas A.11 and A.12 below.

Lemma A.10 For a given s ∈ S0, let r be the same term used in Lemma A.8. If n1−2εb2n →

% ∈ (0,∞), uniformly over r in any compact set,

B∗∗n (r, s) ≡∑i∈Λn

δ>0 xi 1i (γ0 (si))− 1i (γ0 (s)) δ>0 xi∆i(s)Ki (s)

→p |r| c>0 D (γ0 (s) , s) c0f (γ0 (s) , s)

1

2−K0 (r, %; s)

+ %c>0 D (γ0 (s) , s) c0f (γ0 (s) , s) |γ0(s)| K1 (r, %; s)

46

as n→∞, where γ0 (·) is the first derivatives of γ0(·) and

Kj (r, %; s) =

∫ |r|/(%|γ0(s)|)

0

tjK (t) dt

for j = 0, 1.


Lemma A.11 Let τ = arg maxr∈R (W (r) + µ(r)), where W (r) is a two-sided Brownianmotion in (12) and µ(r) is a continuous and symmetric function satisfying: µ(0) = 0,µ(−r) = µ(r), µ(r)/r1/2+ε is monotonically decreasing to −∞ on [r,∞) for some r > 0

and ε > 0. Then, E[τ ] = 0.


Lemma A.12 For any given % < ∞ and s ∈ S0, µ (r, %; s) in (A.41) satisfies conditions inLemma A.11.


Proof of Corollary 1 Under H0 : γ0 (s) = γ∗ (s), we write

LRn(s) =1

nbn

∑i∈Λn

K

(si − sbn

)× Qn (γ∗ (s) , s)−Qn (γ (s) , s)

(nbn)−1Qn (γ (s) , s).

From (A.19) and (A.21), we have

1

nbnQn (γ (s) , s) =

1

nbn

∑i∈Λn

u2iKi (s) + op(1)→p E

[u2i |si = s

]fs (s)

as n → ∞, where fs (s) is the marginal density of si. In addition, from Theorem 3 andLemmas A.3 and A.9, we have

Qn (γ0 (s) , s)−Qn (γ (s) , s)

= Q∗n (γ0 (s) , s)−Q∗n (γ (s) , s)

+(θ (γ (s))− θ (γ0 (s))

)>Z(γ0(s); s)Z(γ0(s); s)>

(θ (γ (s))− θ (γ0 (s))

)= Q∗n (γ0 (s) , s)−Q∗n (γ (s) , s) + op(1),

where Z(γ; s) is defined in Lemma A.5. Similar to Theorem 2 of Hansen (2000), the rest ofthe proof follows from the change of variables and the continuous mapping theorem becausethe limiting expression in (A.40) and (nbn)−1

∑i∈Λn

Ki (s) →p fs (s) by the standard resultof the kernel density estimator.

A.4 Proof of Theorem 4 (Uniform Convergence)

We let φ2n = log n/an, where an = n1−2εbn and ε is given in Assumption A-(ii). We alsodefine Gn(S0; Γ) as a class of cadlag and piecewise constant functions S0 7→ Γ with at most

47

n discontinuity points. Recall that Tn (γ; s), T n (γ; s), and Lnj (γ; s) are defined in (A.23),(A.24), and (A.25), respectively; sups∈S0 |γ (s)− γ0 (s)| is bounded since γ (s) ∈ Γ, a compactset, for any s ∈ S0.

Lemma A.13 There exist constants C∗ and C∗ such that for any γ (·) ∈ Gn(S0; Γ)

sups∈S0|Tn (γ; s)− E [Tn (γ; s)]| ≤ C∗

(sups∈S0|γ (s)− γ0 (s)| log n

nbn

)1/2

sups∈S0

∣∣T n (γ; s)− E[T n (γ; s)

]∣∣ ≤ C∗(

sups∈S0|γ (s)− γ0 (s)| log n

nbn

)1/2

almost surely if n1−2εb2n → % <∞.


Lemma A.14 There exists some constant CL such that for any γ (·) ∈ Gn(S0; Γ) and anyj = 1, . . . ,dim(x)

sups∈S0|Lnj (γ; s)| = CL

(sups∈S0|γ (s)− γ0 (s)| log n

)1/2

almost surely if n1−2εb2n → % <∞.


Lemma A.15 For any γ (·) ∈ Gn(S0; Γ), η > 0, and ε > 0, there exist constants C, r, CT ,and CT such that if n

1−2εb2n → % <∞ and n is suffi ciently large,

P

(inf

rφ2n<sups∈S0 |γ(s)−γ0(s)|<C

sups∈S0 Tn (γ; s)

sups∈S0 |γ (s)− γ0 (s)| < CT (1− η)

)≤ ε, (A.42)

P

(sup


sups∈S0 T n (γ; s)

sups∈S0 |γ (s)− γ0 (s)| > CT (1 + η)

)≤ ε, (A.43)

and for j = 1, . . . ,dim(x)

P

(sup


sups∈S0 |Lnj (γ; s)|√an sups∈S0 |γ (s)− γ0 (s)| > η

)≤ ε. (A.44)


Lemma A.16 nε sups∈S0 ||θ(γ(s))− θ0|| = op(1).


48

Proof of Theorem 4 Note that γ(·) belongs to Gn(S0; Γ). For Q∗n(·; ·) defined in (A.29),since sups∈S0 (Q∗n(γ(s); s)−Q∗n(γ0(s); s)) ≤ 0 by construction, it suffi ces to show that asn→∞,

P(

sups∈S0Q∗n(γ(s); s)−Q∗n(γ0(s); s) > 0

)→ 1

for any γ (·) ∈ Gn(S0; Γ) such that sups∈S0 |γ (s)− γ0 (s)| > rφ2n where r is chosen in LemmaA.15.

To this end, consider γ (·) such that rφ2n ≤ sups∈S0 |γ (s)− γ0 (s)| ≤ C for some 0 <

r,C <∞. Then, similarly as (A.34) and using Lemmas A.15 and A.16, we have

Q∗n(γ(s); s)−Q∗n(γ0(s); s)

an sups∈S0 |γ(s)− γ0(s)|

≥ Tn (γ; s)

sups∈S0 |γ(s)− γ0(s)| −2 dim(x) ‖c0‖∞ ‖Ln (γ; s)‖∞√an sups∈S0 |γ(s)− γ0(s)| −

2C∗n(s)bnsups∈S0 |γ(s)− γ0(s)| + op(1)

> 0

for suffi ciently large n and small η(s), where all the notations are the same as in (A.34). Notethat the C∗n(s) term in (A.33) satisfies sups∈S0 C

∗n (s) = Oa.s.(1) from A.4, and

suprφ2n<|γ(s)−γ0(s)|<C

sups∈S0 C∗n (s) bn

sups∈S0 |γ(s)− γ0(s)| <sups∈S0 C

∗n(s)bn

rφ2n

=sups∈S0 C

∗n(s)

r

(anbnlog n

)= oa.s.(1)

given anbn → % <∞. Thus, we have

P

(sup

rφ2n<|γ(s)−γ0(s)|<C

2 sups∈S0 C∗(s)bn

sups∈S0 |γ (s)− γ0 (s)| > η

)≤ ε

3

when n is suffi ciently large. Therefore, for any ε ∈ (0, 1) and η > 0,

P

(inf

rφ2n<sups∈S0 |γ(s)−γ0(s)|<Csups∈S0Q∗n(γ(s); s)−Q∗n(γ0(s); s) > η

)≥ 1− ε,

which completes the proof by the same argument as Theorem 2.

A.5 Proof of Theorem 5 (Asymptotic Normality of θ)

Proof of Theorem 5 We let 1S0 = 1[si ∈ S0] and consider a sequence of positive constantsπn → 0 as n→∞. Then,

√n(β − β0

)=

(1

n

∑i∈Λn

xix>i 1 [qi > γ (si) + πn]1S0

)−1

×

1√n

∑i∈Λn

xiui1 [qi > γ0 (si) + πn]1S0

49

+1√n

∑i∈Λn

xiui 1 [qi > γ (si) + πn]− 1 [qi > γ0 (si) + πn]1S0

+1√n

∑i∈Λn

xix>i δ01 [qi ≤ γ0 (si)]1 [qi > γ (si) + πn]1S0

≡ Ξ−1

β0 Ξβ1 + Ξβ2 + Ξβ3 (A.45)

and

√n(δ∗− δ∗0

)=

(1

n

∑i∈Λn

xix>i 1 [qi < γ (si)− πn]1S0

)−1

×

1√n

∑i∈Λn

xiui1 [qi < γ0 (si)− πn]1S0

+1√n

∑i∈Λn

xiui 1 [qi < γ (si)− πn]− 1 [qi < γ0 (si)− πn]1S0

+1√n

∑i∈Λn

xix>i δ∗01 [qi > γ0 (si)]1 [qi < γ (si)− πn]1S0

≡ Ξ−1

δ0 Ξδ1 + Ξδ2 + Ξδ3 , (A.46)

where Ξβ2, Ξβ3, Ξδ2, and Ξδ3 are all op(1) from Lemma A.17 below, provided φ2n/πn → 0 asn→∞. Therefore,

√n(θ∗− θ∗0

)=

(Ξβ0 00 Ξδ0

)−1(Ξβ1

Ξδ1

)+ op (1)

and the desired result follows once we establish that

Ξβ0 →p E[xix>i 1 [qi > γ0 (si)]1S0

], (A.47)

Ξδ0 →p E[xix>i 1 [qi < γ0 (si)]1S0

], (A.48)

and (Ξβ1

Ξδ1

)→d N

(0, limn→∞

1

nV ar

[( ∑i∈Λn

xiui1 [qi > γ0 (si)]1S0∑i∈Λn

xiui1 [qi < γ0 (si)]1S0

)])(A.49)

as n→∞.First, by Assumptions A-(v) and (ix), (A.47) can be readily verified since we have

1

n

∑i∈Λn

xix>i 1 [qi > γ (si) + πn]1S0

=1

n

∑i∈Λn

xix>i 1 [qi > γ0 (si) + πn]1S0

+1

n

∑i∈Λn

xix>i 1 [qi > γ (si) + πn]− 1 [qi > γ0 (si) + πn]1S0

=1

n

∑i∈Λn

xix>i 1 [qi > γ0 (si) + πn]1S0 +Op (φ2n)

50

with πn → 0 as n→∞. More precisely, given Theorem 4, we consider γ (s) in a neighborhoodof γ0 (s) with uniform distance at most rφ2n for some large enough constant r. We define anon-random function γ (s) = γ0 (s)+rφ2n. Then, on the event E

∗n = sups∈S0 |γ (s)− γ0 (s)| ≤

rφ2n,

E[xix>i 1 [qi > γ (si) + πn]− 1 [qi > γ0 (si) + πn]1S0

](A.50)

≤ E[xix>i 1 [qi > γ (si) + πn]− 1 [qi > γ0 (si) + πn]1S0

]=

∫S0

∫ γ(v)+πn

γ0(v)+πn

D (q, v) f (q, v) dqdv

=

∫S0D (γ0 (v) , v) f (γ0 (v) , v) (γ (v)− γ0 (v)) + op (φ2n) dv

≤ rφ2n

∫D (γ0 (v) , v) f (γ0 (v) , v) dv

= Op (φ2n) = op (1)

from Theorem 4, Assumptions A-(v), (vii), and (ix). (A.48) can be verified symmetrically. Us-ing a similar argument, since E [xiui1 [qi > γ0 (si)]1S0 ] = E [xiui1 [qi < γ0 (si)]1S0 ] = 0 fromAssumption ID-(i), the asymptotic normality in (A.49) follows by the Theorem of Bolthausen(1982) under Assumption A-(iii), which completes the proof.

Lemma A.17 When φ2n → 0 as n→∞, if we let πn > 0 such that πn → 0 and φ2n/πn → 0

as n→∞, then it holds that Ξβ2, Ξβ3, Ξδ2, and Ξδ3 in (A.45) and (A.46) are all op(1).


References

Ananat, E. O. (2011): “The Wrong Side(s) of the Tracks: The Causal Effects of RacialSegregation on Urban Poverty and Inequality,”American Economic Journal: Applied Eco-nomics, 3(2), 34—66.

Andrews, D. W. K. (1994): “Asymptotics for Semiparametric Econometric Models viaStochastic Equicontinuity,”Econometrica, 62(1), 43—72.

Bai, J. (1997): “Estimation of a Change Point in Multiple Regressions,”Review of Economicsand Statistics, 79, 551—563.

Bai, J., and P. Perron (1998): “Estimating and Testing Linear Models with MultipleStructural Changes,”Econometrica, 66, 47—78.

Bhattacharya, P. K., and P. J. Brockwell (1976): “The Minimum of an AdditiveProcess with Applications to Signal Estimation and Storage Theory,”Z. Wahrsch. Verw.Gebiete,, 37, 51—75.

Billingsley, P. (1968): Convergence of Probability Measure. Wiley, New York.

51

Bolthausen, E. (1982): “On the Central Limit Theorem for Stationary Mixing RandomFields,”The Annuals of Probability, 10(4), 1047—1050.

Caner, M., and B. E. Hansen (2004): “Instrumental Variable Estimation of a ThresholdModel,”Econometric Theory, 20, 813—843.

Carbon, M., C. Francq, and L. T. Tran (2007): “Kernel Regression Estimation forRandom Fields,”Journal of Statistical Planning and Inference, 137(3), 778—798.

Card, D., A. Mas, and J. Rothstein (2008): “Tipping and the Dynamics of Segregation,”Quarterly Journal of Economics, 123(1), 177—218.

Chan, K. S. (1993): “Consistency and Limiting Distribution of the Least Squares Estimatorof a Threshold Autoregressive Model,”Annals of Statistics, 21, 520—533.

Chiou, Y., M. Chen, and J. Chen (2018): “Nonparametric Regression with MultipleThresholds: Estimation and Inference,”Journal of Econometrics, 206, 472—514.

Conley, T. G. (1999): “GMM Estimation with Cross Sectional Dependence,” Journal ofEconometrics, 92, 1—45.

Conley, T. G., and F. Molinari (2007): “Spatial Correlation Robust Inference withErrors in Location or Distance,”Journal of Econometrics, 140(1), 76—96.

Darity, W. A., and P. L. Mason (1998): “Evidence on Discrimination in Employment:Codes of Color, Codes of Gender,”Journal of Economic Perspectives, 12(2), 63—90.

Delgado, M. A., and J. Hidalgo (2000): “Nonparametric Inference on Structural Break,”Journal of Econometrics, 96(1), 113—144.

Dingel, J. I., A. Miscio, and D. R. Davis (2019): “Cities, Lights, and Skills in DevelopingEconomics,”Journal of Urban Economics, forthcoming.

Hall, P., and C. C. Heyde (1980): Martingale Limit Theory and its Applications. Acad-emic Press, New York.

Hansen, B. E. (2000): “Sample Splitting and Threshold Estimation,” Econometrica, 68,575—603.

Heilmann, K. (2018): “Transit Access and Neighborhood Segregation. Evidence from theDallas Light Rail System,”Regional Science and Urban Economics, 73, 237—250.

Henderson, D. J., C. F. Parmeter, and L. Su (2017): “Nonparametric ThresholdRegression: Estimation and Inference,”Working Paper.

Henderson, J. V., A. Storeygard, and D. N. Weil (2012): “Measuring EconomicGrowth from Outer Space,”American Economic Review, 102(2), 994—1028.

Hidalgo, J., J. Lee, and M. H. Seo (2019): “Robust Inference for Threshold RegressionModels,”Journal of Econometrics, 210, 291—309.

52

Jenish, N., and I. R. Prucha (2009): “Central Limit Theorems and Uniform Laws ofLarge Numbers for Arrays of Random Fields,”Journal of Econometrics, 150, 86—98.

Kim, J., and D. Pollard (1990): “Cube Root Asymptotics,” Annals of Statistics, 18,191—219.

Lee, S., Y. Liao, M. H. Seo, and Y. Shin (2020): “Factor-driven Two-regime Regression,”Annuals of Statistics, forthcoming.

Lee, S., M. H. Seo, and Y. Shin (2011): “Testing for Threshold Effects in RegressionModels,”Journal of the American Statistical Association, 106(493), 220—231.

Li, D., and S. Ling (2012): “On the Least Squares Estimation of Multiple-Regime ThresholdAutoregressive Models,”Journal of Econometrics, 167, 240—253.

Newey, W. K., and D. McFadden (1994): Large Sample Estimation and HypothesisTesting. volume 4 of Handbook of Econometricschap. 36, pp. 2111—2245. Elsevier.

Rozenfeld, H. D., D. Rybski, X. Gabaix, and H. A. Makse (2011): “The Area andPopulation of Cities: New Insights from a Different Perspective on Cities,” AmericanEconomic Review, 101(5), 2205—25.

Schelling, T. C. (1971): “Dynamic Models of Segregation,”Journal of Mathematical So-ciology, 1(2), 143—186.

Seo, M. H., and O. Linton (2007): “A Smooth Least Squares Estimator for ThresholdRegression Models,”Journal of Econometrics, 141(2), 704—735.

Tong, H. (1983): Threshold Models in Nonlinear Time Series Analysis (Lecture Notes inStatistics No. 21). New York: Springer-Verlag.

Vogel, K. B., R. Goldblatt, G. Hanson, and A. K. Khandelwal (2019): “Detect-ing Urban Markets with Satellite Imagery: An Application to India,” Journal of UrbanEconomics, forthcoming.

Yu, P. (2012): “Likelihood Estimation and Inference in Threshold Regression,”Journal ofEconometrics, 167, 274—294.

Yu, P., and X. Fan (2020): “Threshold Regression with a Threshold Boundary,”Journalof Business & Economic Statistics, forthcoming.

Yu, P., Q. Liao, and P. Phillips (2019): “Inferences and Specification Testing in Thresh-old Regression with Endogeneity,”Working Paper.

Yu, P., and P. Phillips (2018): “Threshold Regression with Endogeneity,” Journal ofEconometrics, 203, 50—68.

53

Supplementary Material for “Threshold Regression withNonparametric Sample Splitting”

By Yoonseok Lee and Yulong Wang

January 2021

This supplementary material contains omitted proofs of some technical lemmas.

Proof of Lemma A.4 For expositional simplicity, we only present the case of scalar xi.Similarly as (A.11), we have

E [∆Mn (s)] (B.1)

=

∫∫D(q, s+ bnt) 1 [q < γ0 (s+ bnt)]− 1 [q < γ0 (s)]K(t)dqdt

=

∫T +(s)

∫ γ0(s+bnt)

γ0(s)

D(q, s+ bnt)K(t)dqdt+

∫T −(s)

∫ γ0(s)

γ0(s+bnt)

D(q, s+ bnt)K(t)dqdt

≡ Ψ+M(s) + Ψ−M(s),

whereD(q, s+bnt) = D(q, s+bnt)f(q, s+bnt) and we denote T +(s) = t : γ0(s) ≤ γ0 (s+ bnt)and T −(s) = t : γ0(s) > γ0 (s+ bnt). We consider three cases of γ0(s) = ∂γ0 (s) /∂s > 0,γ0(s) < 0, and γ0(s) = 0 separately, which are well-defined from Assumption A-(vi).

First, we suppose γ0(s) > 0. We choose a positive sequence tn → ∞ such that tnbn → 0

as n → ∞. It follows that for any fixed ε > 0, tnbn ≤ ε if n is suffi ciently large and henceT +(s)∩t : |t| ≤ tn becomes [0, tn] since γ0 (·) is continuous. The mean value theorem gives

Ψ+M(s) (B.2)

=

∫ tn

0

∫ γ0(s+bnt)

γ0(s)

D(q, s+ bnt)K(t)dqdt+

∫|t|>tn

∫ γ0(s+bnt)

γ0(s)


≤bnD(γ0 (s) , s)γ0 (s)

∫ tn

0

tK(t)dt+O(b2n)

+O (bn)

∫ ∞tn

tK(t)dt

= bnD(γ0 (s) , s)γ0 (s)

∫ ∞0

tK(t)dt+O(b2n),

where D(q, s + bnt) < ∞ and γ0 (s) < ∞ from Assumptions A-(vi) and (vii). Note thatAssumption A-(x) impliesK (t) t−(2+η) → 0 for some η > 0 as t→∞ and hence

∫∞tntK(t)dt→

0 as tn →∞. Similarly,

Ψ−M(s) =

∫ 0

−tn

∫ γ0(s)

γ0(s+bnt)

D(q, s+ bnt)K(t)dqdt+ o(bn)

= −D(γ0 (s) , s)γ0 (s) bn

∫ 0

−∞tK(t)dt+O(b2

n),

1

which yields E [∆Mn (s)] = D(γ0 (s) , s)γ0 (s) bn +O(b2n). When γ0(s) < 0, we can symmetri-

cally show that E [∆Mn (s)] = −D(γ0 (s) , s)γ0 (s) bn +O(b2n).

Second, we suppose γ0(s) = 0 and s is the local maximizer. Then, T +(s) ∩ t : |t| ≤ tnbecomes empty and hence

Ψ+M(s) = 0 +

∫|t|>tn

∫ γ0(s+bnt)

γ0(s)

D(q, s+ bnt)K(t)dqdt = o(bn).

However, T −(s) ∩ t : |t| ≤ tn becomes t : |t| ≤ tn in this case and hence

Ψ−M(s) =

∫ ∞−∞

∫ γ0(s)

γ0(s+bnt)


= −D(γ0 (s) , s)γ0 (s) bn

∫ ∞−∞

tK(t)dt+O(b2n) = O(b2

n)

since∫∞−∞ tK(t)dt = 0. When γ0(s) = 0 and s is the local minimizer, we can similarly show

that

Ψ+M(s) =

∫ ∞−∞

∫ γ0(s+bnt)

γ0(s)

D(q, s+ bnt)K(t)dqdt (B.3)

= D(γ0 (s) , s)γ0 (s) bn

∫ ∞−∞

tK(t)dt+O(b2n) = O(b2

n)

butΨ−M(s) = o(bn). By combining these results, we have E [∆Mn (s)] = D(γ0 (s) , s)|γ0 (s) |bn+

O(b2n) for a given s ∈ S0, and hence

sups∈S0

E [∆Mn (s)] = O(bn)

since sups∈S0 D(γ0 (s) , s)|γ0 (s) | <∞ from Assumptions A-(vi) and (vii).The desired uniform convergence result then follows if sups∈S0 ||∆Mn (s)−E [∆Mn (s)] || =

o(bn) almost surely, which can be shown as Theorem 2.2 in Carbon, Francq, and Tran (2007)(see also Section 3 in Tran (1990) and Section 5 in Carbon, Tran, and Wu (1997)). Similarlyas the proof of (A.14) in Lemma A.3, we let τn = (n log n)

1/(4+ϕ) and define

∆M τn(s) =

1

nbn

∑i∈Λn

x2i∆i(si, s)Ki(s)1τn

as in (A.13), where ∆i(si, s) = 1i (γ0 (si)) − 1i (γ0 (s)) and 1τn = 1 x2i ≤ τn. We also let

mn be an integer such that mn = O(τnn1−2ε/b2

n) and we cover the compact S0 by small mn

intervals centered at sk, which are defined as Ik = s′ : |s′ − sk| ≤ C/mn for some C <∞.Then,

sups∈S0‖∆M τ

n(s)− E [M∆τn(s)]‖ ≤ max

1≤k≤mn

sups∈Ik‖∆M τ

n(s)−∆M τn(sk)‖

+ max1≤k≤mn

sups∈Ik‖E [∆M τ

n(s)]− E [∆M τn(sk)]‖

+ max1≤k≤mn

‖∆M τn(sk)− E [∆M τ

n(sk)]‖

≡ Ψ∆M1 + Ψ∆M2 + Ψ∆M3. (B.4)

2

However,

‖∆M τn(s)−∆M τ

n(sk)‖ ≤1

nbn

∑i∈Λn

x2i |∆i(si, s)−∆i(si, sk)|Ki(sk)1τn

+1

nbn

∑i∈Λn

x2i |∆i(si, s)| |Ki(s)−Ki(sk)|1τn

=1

nbn

∑i∈Λn

x2i |1i (γ0 (sk))− 1i (γ0 (s))|Ki(sk)1τn

+1

nbn

∑i∈Λn

x2i |1i (γ0 (si))− 1i (γ0 (s))| |Ki(s)−Ki(sk)|1τn

= Oa.s.

(τnb2nmn

)= Oa.s.

(1

n1−2ε

)as in (A.15) and (A.16), and hence Ψ∆M1 = Ψ∆M2 = oa.s.(bn) as n1−2εbn →∞. We also haveΨ∆M3 = oa.s.(bn) as proved below,6 which completes the proof.

Proof of Ψ∆M3 = oa.s.(bn): We let

Zτi (s) = (nbn)−1

(c>0 xi)2∆i(si, s)Ki (s)1τn − E[(c>0 xi)

2∆i(si, s)Ki (s)1τn ]

and apply the blocking technique as in Carbon, Francq, and Tran (2007), p.788. For i =

(i1, i2) ∈ Λn ⊂ R2, let n1 and n2 are the numbers of grids in two dimensions, then |Λn| =

n = n1n2. Without loss of generality, we assume n` = 2wr` for ` = 1, 2, where w and r` areconstants to be specified later. For j = (j1, j2), define

U [1](j; s) =

(2j1+1)w∑i1=2j1w+1

(2j2+1)w∑i2=2j2w+1

Zτi (s), (B.5)

U [2](j; s) =

(2j1+1)w∑i1=2j1w+1

2(j2+1)w∑i2=(2j2+1)w+1

Zτi (s),

U [3](j; s) =

2(j1+1)w∑i1=(2j1+1)w+1

(2j2+1)w∑i2=2j2w+1

Zτi (s),

U [4](j; s) =

2(j1+1)w∑i1=(2j1+1)w+1

2(j2+1)w∑i2=(2j2+1)w+1

Zτi (s),

and define four blocks as

B[h](s) =

r1−1∑j1=0

r2−1∑j2=0

U [h](j; s) for h = 1, 2, 3, 4,

6Unlike the Lemma A.3, We cannot directly use the results for Q∗3n in Carbon, Francq, and Tran (2007)here. This is because O((logn/(nbn))1/2) is not necassarily o(bn) without further restrictions.

3

so that∑

i∈ΛnZτi (s) =

∑4

h=1 B[h](s) and Ψ∆M3 = max1≤k≤mn

∣∣∣∑4

h=1 B[h](sk)∣∣∣. Since these

four blocks have the same number of summands, it suffi ces to show max1≤k≤mn

∣∣B[1](sk)∣∣ =

oa.s.(bn). To this end, we show that for some εn = o(bn),

P(

max1≤k≤mn

∣∣B[1](sk)∣∣ > εn

)≤

mn∑k=1

P(∣∣B[1](sk)

∣∣ > εn)

(B.6)

≤ mn sups∈S0

P(∣∣B[1](s)

∣∣ > εn)

= O(n−c)

for some c > 1 and hence∑∞

n=1 P(max1≤k≤mn

∣∣B[1](sk)∣∣ > εn

)< ∞. Then the almost sure

convergence is obtained by the Borel-Cantelli lemma.For any s ∈ S0, B[1](s) is the sum of r = r1r2 = n/ (2w2) of U [1](j; s)’s. In addition,

U [1](j; s) is measurable with the σ-field generated by Zτi (s) with i belonging to the set

i = (i1, i2) : 2j`w + 1 ≤ i` ≤ (2j` + 1)w for ` = 1, 2.

These sets are separated by a distance of at least w. We enumerate the random variablesU [1](j; s) and the corresponding σ-fields with which they are measurable in an arbitrarymanner, and refer to those U [1](j; s)’s as U1(s), U2(s), . . . , Ur(s). By the uniform almost surelaw of large numbers in random fields (e.g., Theorem 2 in Jenish and Prucha (2009)) and thefact that E [Ki (s) b−1

n ] ≤ C, we have that for any t = 1, . . . , r and s ∈ S0,

|Ut(s)| ≤Cw2τnn

(1

w2bn

(2j1+1)w∑i1=2j1w+1

(2j2+1)w∑i2=2j2w+1

|∆i(γ; s)|Ki (s)

)(B.7)

≤ Cw2τnn

(1

w2

(2j1+1)w∑i1=2j1w+1

(2j2+1)w∑i2=2j2w+1

Ki (s) b−1n

)

=C ′w2τn

n

almost surely from (B.5), for some C,C ′ <∞, where the last equality is obtained similarly as(B.2). From Lemma 3.6 in Carbon, Francq, and Tran (2007), we can approximate7 Ut(s)rt=1

by another sequence of random variables U∗t (s)rt=1 that satisfies (i) elements of U∗t (s)rt=1

are independent, (ii) U∗t (s) has the same distribution as Ut(s) for all t = 1, . . . , r, and (iii)r∑t=1

E [|U∗t (s)− Ut(s)|] ≤ rC ′′n−1w2τnαw2,w2(w) (B.8)

for some C ′′ < ∞. Recall that αw2,w2(w) is the α-mixing coeffi cient defined in (8). Then, itfollows that

P(B[1](s) > εn

)≤ P

(r∑t=1

|U∗t (s)− Ut(s)| > εn

)+ P

(∣∣∣∣∣r∑t=1

U∗t (s)

∣∣∣∣∣ > εn

)(B.9)

7This approximation is reminiscient of the Berbee’s lemma (Berbee (1987)) and is based on Rio (1995),who studies the time series case. It can also be found as Lemma 4.5 in Carbon, Tran, and Wu (1997).

4

for any given s ∈ S0, and hence in view of (B.6) and (B.9)

P(

max1≤k≤mn

∣∣B[1](sk)∣∣ > εn

)≤ mn sup

s∈S0P

(r∑t=1

|U∗t (s)− Ut(s)| > εn

)

+mn sups∈S0

P

(∣∣∣∣∣r∑t=1

U∗t (s)

∣∣∣∣∣ > εn

)≡ PU1 + PU2. (B.10)

First, we let εn = O((log n/n)1/2). By Markov’s inequality, (B.8), and Assumption A-(iii),we have

PU1 ≤ mn

rC ′′n−1w2τnαw2,w2(w)

εn≤ C1n

κ1(log n)κ2 × exp(−C ′1nκ3)(n1−2εbn)

2

for some κ1, κ2, κ3 > 0 and C1, C′1 <∞. Recall that we chosemn = O(τnn

1−2ε/b2n), n = 4w2r,

and τn = (n log n)1/(4+ϕ). Hence PU1 = O(exp(−nκ3))→ 0 as n→∞, since the second term

in the last inequality diminishes faster than the polynomial order.Second, we now choose an integer w such that

w = (n/ (Cwτnλn))1/2 ,

λn = (n log n)1/2

for some large positive constant Cw. Note that, substituting λn and τn into w gives

w = O

[ n12−

14+ϕ

(log n)1

4+ϕ+ 12

]1/2 ,

which diverges as n → ∞ for ϕ > 0. Since U∗t (s) has the same distribution as Ut(s),|U∗t (s)| is also uniformly bounded by C ′n−1τnw

2 almost surely for all t = 1, . . . , r from (B.7).Therefore, |λnU∗t (s)| ≤ 1/2 for all t if Cw is chosen to be large enough. Using the inequalityexp(v) ≤ 1 + v + v2 for |v| ≤ 1/2, we have exp(λnU

∗t (s)) ≤ 1 + λnU

∗t (s) + λ2

nU∗t (s)2. Hence

E[exp(λnU∗t (s))] ≤ 1 + λ2

nE[U∗t (s)2

]≤ exp

(λ2nE[U∗t (s)2

])(B.11)

since E [U∗t (s)] = 0 and 1 + v ≤ exp(v) for v ≥ 0. Using the fact that P(X > c) ≤E[exp(Xa)]/ exp(ac) for any random variable X and nonrandom constants a and c, andthat U∗t (s)rt=1 are independent, we have

P

(∣∣∣∣∣r∑t=1

U∗t (s)

∣∣∣∣∣ > εn

)

= P

(r∑t=1

λnU∗t (s) > λnεn

)+ P

(−

r∑t=1

λnU∗t (s) > λnεn

)

≤E[exp

(λn∑r

t=1U∗t (s)

)]+ E

[exp

(−λn

∑r

t=1U∗t (s)

)]exp(λnεn)

≤ 2 exp(−λnεn) exp

(λ2n

r∑t=1

E[U∗t (s)2

])(B.12)

5

by (B.11). However, using the same argument as in (B.1) above, we can show that

E[U∗t (s)2

]≤

∑1≤i1≤w1≤i2≤w

E[Zτi (s)2

]+

∑i 6=j

1≤i1,i2≤w1≤j1,j2≤w

Cov[Zτi (s), Zτj (s)

]≤ C2w

2

n2

for some C2 < ∞, which does not depend on s given Assumptions A-(v) and (x). It followsthat (B.12) satisfies

sups∈S0

P

(∣∣∣∣∣r∑t=1

U∗t

∣∣∣∣∣ > εn

)≤ 2 exp

(−λnεn +

C2λ2nrw

2

n2

)(B.13)

= 2 exp(−λnεn + C2λ

2nn−1).

We choose εn = C∗λ−1n log n = C∗(log n/n)1/2 for some C∗ > 0 and have

−λnεn + C2λ2nn−1 = −C∗ log n+ C2 log n = − (C∗ − C2) log n.

Therefore, in view of (B.13), we have

mn sups∈S0

P

(∣∣∣∣∣r∑t=1

U∗t

∣∣∣∣∣ > C∗√

log n

n

)≤ 2mn

nC∗−C2=

C3(log n)κ4

nκ5(n1−2εbn)2

for some C3 <∞, κ4 > 0, and κ5 > 1 by choosing C∗ suffi ciently large. Since n1−2εbn →∞,we have PU2 = O(n−κ5) → 0 as n → ∞. Therefore, the desired result follows since εn =

O((log n/n)1/2) = o(bn) from Assumption A-(ix) and PU1 + PU2 = O(n−c) for some c > 1.

Proof of Lemma A.6 For a given s ∈ S0, we first show (A.26). We consider the case withγ(s) > γ0(s), and the other direction can be shown symmetrically. Since c>0 D(·, s)c0f (·, s) iscontinuous at γ0(s) and c>0 D(γ0(s), s)c0f (γ0(s), s) > 0 from Assumptions A-(vii) and (viii),there exists a suffi ciently small C(s) > 0 such that

`D(s) = inf|γ(s)−γ0(s)|<C(s)

c>0 D(γ(s), s)c0f (γ(s), s) > 0. (B.14)

By the mean value expansion and the fact that Tn (γ; s) = c>0 (Mn (γ(s); s)−Mn (γ0(s); s))c0,we have

E [Tn (γ; s)] =

∫ ∫ γ(s)

γ0(s)

E[(c>0 xi

)2 |q, s+ bnt]f(q, s+ bnt)K (t) dqdt

= (γ(s)− γ0(s)) c>0 D(γ(s), s)c0f (γ(s), s)

for some γ(s) ∈ (γ0(s), γ(s)), which yields

E [Tn (γ; s)] ≥ (γ (s)− γ0 (s)) `D(s). (B.15)

Furthermore, if we let ∆i(γ; s) = 1i (γ (s))−1i (γ0 (s)) and Zn,i(s) =(c>0 xi

)2∆i(γ; s)Ki (s)−

E[(c>0 xi

)2∆i(γ; s)Ki (s)], using a similar argument as (A.3), we have

E[(Tn (γ; s)− E [Tn (γ; s)])

2]

(B.16)

6

=1

n2b2n

∑i∈Λn

E[Z2n,i(s)

]+

1

n2b2n

∑i,j∈Λn,i6=j

Cov[Zn,i(s), Zn,j(s)]

≤ C1 (s)

nbn(γ (s)− γ0 (s))

for some C1(s) <∞.We suppose n is large enough so that r(s)φ1n ≤ C(s). Similarly as Lemma A.7 in Hansen

(2000), we set γg for g = 1, 2, ..., g+ 1 such that, for any s ∈ S0, γg (s) = γ0 (s) + 2g−1r(s)φ1n,where g is an integer satisfying γg (s)− γ0 (s) = 2g−1r(s)φ1n ≤ C(s) and γg+1 (s)− γ0 (s) >

C(s). Then Markov’s inequality and (B.16) yield that for any fixed η(s) > 0,

P

(max1≤g≤g

∣∣∣∣∣ Tn(γg; s

)E[Tn(γg; s

)] − 1

∣∣∣∣∣ > η(s)

)(B.17)

≤ P

(max1≤g≤g

∣∣∣∣∣Tn(γg; s

)− E

[Tn(γg; s

)]E[Tn(γg; s

)] ∣∣∣∣∣ > η(s)

)

≤ 1

η2(s)

g∑g=1

E[(Tn(γg; s

)− E

[Tn(γg; s

)])2]

∣∣E [Tn (γg; s)]∣∣2≤ 1

η2(s)

g∑g=1

C1 (s) (nbn)−1

(γ (s)− γ0 (s))

|(γ (s)− γ0 (s)) `D(s)|2

≤ 1

η2(s)

g∑g=1

C1 (s) (nbn)−1

2g−1r(s)φ1n`2D(s)

≤ C1 (s)

η2(s)r(s)`2D(s)

∞∑g=1

1

2g−1× 1

n2ε

≤ ε(s)

for any ε(s) > 0. From eq. (33) of Hansen (2000), for any γ (s) such that r(s)φ1n ≤ γ (s) −γ0 (s) ≤ C(s), there exists some g∗ satisfying γg∗ (s)−γ0 (s) < γ (s)−γ0 (s) < γg∗+1 (s)−γ0 (s),and then

Tn (γ; s)

|γ (s)− γ0 (s)| ≥Tn(γg∗ ; s

)E[Tn(γg∗ ; s

)] × E[Tn(γg∗ ; s

)]∣∣γg∗+1 (s)− γ0 (s)∣∣ (B.18)

≥

1− max1≤g≤g

∣∣∣∣∣ Tn(γg; s

)E[Tn(γg; s

)] − 1

∣∣∣∣∣

E[Tn(γg∗ ; s

)]∣∣γg∗+1 (s)− γ0 (s)∣∣

≥ (1− η (s))

∣∣γg∗ (s)− γ0 (s)∣∣ `D(s)∣∣γg∗+1 (s)− γ0 (s)

∣∣from (B.15), where |γg∗ (s)−γ0 (s) |/|γg∗+1 (s)−γ0 (s) |`D(s) is some finite non-zero constantby construction. Hence, in view of (B.18), we can find CT (s) <∞ such that

P(


Tn (γ; s)

|γ(s)− γ0 (s)| < CT (s)(1− η(s))

)≤ ε(s).

7

The proof for (A.27) is similar to that for (A.26) and hence omitted.We next show (A.28). Without loss of generality we assume xi is a scalar, and so is

Ln(γ; s). Similarly as (B.16), we have

E[|Ln (γ; s)|2

]≤ C2 (s) |γ(s)− γ0(s)| (B.19)

for some C2(s) < ∞. By defining γg in the same way as above, Markov’s inequality and(B.19) yields that for any fixed η(s) > 0,

P

(max1≤g≤g

∣∣Ln (γg; s)∣∣√an(γg (s)− γ0 (s)

) > η(s)

4

)(B.20)

≤ 16

η2(s)

∞∑g=1

E[Ln(γg, s

)2]

an∣∣γg (s)− γ0 (s)

∣∣2≤ 16

η2(s)

∞∑g=1

C2 (s)

an∣∣γg (s)− γ0 (s)

∣∣≤ 16C2 (s)

η2(s)r(s)

∞∑g=1

1

2g−1

since an = φ−11n . This probability is arbitrarily close to zero if r(s) is chosen large enough. It

is worth to note that (B.20) provides the maximal (or sharp) rate of φ1n as a−1n because we

need an∣∣γg (s)− γ0 (s)

∣∣ = O(φ1nan) = O(1) as n → ∞. This φ1nan = O(1) condition alsosatisfies (B.17).

Similarly, from Lemma A.1, we have

P

(max1≤g≤g

supγg(s)≤γ(s)≤γg+1(s)

∣∣Ln (γ; s)− Ln(γg; s

)∣∣√an(γg(s)− γ0 (s)

) >η (s)

4

)(B.21)

≤g∑g=1

P

(sup

γg(s)≤γ(s)≤γg+1(s)


)∣∣ > √an (γg (s)− γ0 (s)) η (s)

4

)

≤∞∑g=1

C3 (s)∣∣γg+1(s)− γg(s)

∣∣2η4 (s) a2

n

∣∣γg (s)− γ0 (s)∣∣4

≤ C ′3 (s)

η4 (s) r(s)2

for some C3(s), C ′3(s) < ∞, where γg (s) = γ0 (s) + 2g−1r(s)φ1n. This probability is alsoarbitrarily close to zero if r(s) is chosen large enough. Since

supr(s)φ1n<|γ(s)−γ0(s)|<C(s)

|Ln (γ; s)|√an (γ(s)− γ0 (s))

(B.22)

≤ 2 max1≤g≤g

∣∣Ln (γg; s)∣∣√an(γg (s)− γ0 (s)

)+2 max

1≤g≤gsup



)∣∣√an(γg(s)− γ0 (s)

) ,8

(B.20) and (B.21) yield

P

(sup


|Ln (γ; s)|√an (γ(s)− γ0 (s))

> η (s)

)

≤ P

(2 max

1≤g≤g

∣∣Ln (γg; s)∣∣√an(γg (s)− γ0 (s)

) > η (s)

2

)

+P

(2 max

1≤g≤gsup



)∣∣√an(γg(s)− γ0 (s)

) >η (s)

2

)≤ ε(s)

for any ε(s) > 0 if we pick r(s) suffi ciently large.

Proof of Lemma A.7 Using the same notations in Lemma A.5, (A.18) yields

nε(θ(γ(s))− θ0

)(B.23)

=

1

nbnZ(γ(s); s)>Z(γ(s); s)

−1

×nε

nbnZ(γ(s); s)>u(s)− nε

nbnZ(γ(s); s)>

(Z(γ(s); s)− Z(γ0(si); s)

)θ0

≡ Θ−1

A1(s) ΘA2(s)−ΘA3(s) .

We let M(s) ≡∫∞−∞D(q, s)f (q, s) dq <∞. For the denominator ΘA1(s), we have

ΘA1(s) =

((nbn)−1

∑i∈Λn

xix>i Ki(s) Mn (γ(s); s)

Mn (γ(s); s) Mn (γ(s); s)

)(B.24)

→p

(M(s) M (γ0(s); s)

M (γ0(s); s) M (γ0(s); s)

),

where M (γ; s) <∞ is defined in (A.10), which is continuously differentiable in γ. Note that|Mn (γ(s); s) −M (γ0(s); s) | ≤ |Mn (γ(s); s) −M (γ(s); s) | + |M (γ(s); s) −M (γ0(s); s) | =

op(1) from Lemma A.3 and the pointwise consistency of γ(s) in Lemma A.5. In addition,(nbn)−1

∑i∈Λn

xix>i Ki (s) →p M(s) from the standard kernel estimation result. Note that

the probability limit of ΘA1(s) is positive definite since both M(s) and M (γ0(s); s) arepositive definite and

M(s)−M (γ0(s); s) =

∫ ∞γ0(s)

D(q, s)f (q, s) dq > 0

for any γ0(s) ∈ Γ from Assumption A-(viii).For the numerator part ΘA2(s), we have ΘA2(s) = Op(a

−1/2n ) = op(1) because

1√nbn

Z(γ(s); s)>u(s) =

((nbn)−1/2

∑i∈Λn

xiuiKi(s)Jn (γ(s); s)

)= Op (1) (B.25)

from Lemma A.3 and the pointwise consistency of γ(s) in Lemma A.5. Note that the standard

9

kernel estimation result gives (nbn)−1/2∑

i∈ΛnxiuiKi (s) = Op(1). Moreover, we have

ΘA3(s) =

((nbn)−1

∑i∈Λn

xix>i c0 1i (γ(s))− 1i (γ0 (si))Ki (s)

(nbn)−1∑

i∈Λnxix>i c01i (γ(s)) 1i (γ(s))− 1i (γ0 (si))Ki (s)

)(B.26)

and

1

nbn

∑i∈Λn

c>0 xix>i 1i (γ(s))− 1i (γ0 (si))Ki (s) (B.27)

≤ ‖c0‖ ‖Mn (γ(s); s)−Mn (γ0(si); s)‖≤ ‖c0‖ ‖Mn (γ(s); s)−Mn (γ0(s); s)‖+Op(bn)= op(1),

where the second inequality is from (A.20) and the last equality is because Mn (γ; s) →p

M (γ; s) from Lemma A.3, which is continuous in γ and γ(s)→p γ0(s) in Lemma A.5. Since

1

nbn

∑i∈Λn

xix>i c01i (γ(s)) 1i (γ(s))− 1i (γ0 (si))Ki (s) (B.28)

≤ ‖c0‖ ‖Mn (γ(s); s)−Mn (γ0(si); s)‖ = op(1)

from (B.27), we have ΘA3(s) = op(1) as well, which completes the proof.

Proof of Lemma A.9 For the first result, using the same notations in Lemma A.5, wewrite √

nbn

(θ (γ (s))− θ0

)=

1

nbnZ(γ(s); s)>Z(γ(s); s)

−1

×

1√nbn

Z(γ(s); s)>u(s)− 1√nbn

Z(γ(s); s)>(Z(γ(s); s)− Z(γ0(si); s)

)θ0

≡ Θ−1

B1(s) ΘB2(s)−ΘB3(s)

similarly as (B.23). For the denominator, since ΘB1(s) = ΘA1(s) in (B.23), then Θ−1B1(s) =

Op(1) from (B.24). For the numerator, we first have ΘB2(s) = Op(1) from (B.25). For ΘB3(s),similarly as (B.26),

ΘB3(s) =

(a−1/2n

∑i∈Λn

n−εxix>i δ0 1i (γ(s))− 1i (γ0 (si))Ki (s)

a−1/2n

∑i∈Λn

n−εxix>i δ01i (γ(s)) 1i (γ(s))− 1i (γ0 (si))Ki (s)

).

However, since γ(s) = γ0(s) + r(s)φ1n for some r(s) bounded in probability from Theorem 2,similarly as (A.35), we have

E

[∑i∈Λn

n−εδ>0 xix>i 1i (γ(s))− 1i (γ0 (si))Ki (s)

]

≤ an

∣∣∣∣∣∫∫ maxγ0(s+bnt),γ0(s)+r(s)φ1n

minγ0(s+bnt),γ0(s)+r(s)φ1nE[xix>i c0|q, s+ bnt

]K (t) f (q, s+ bnt) dqdt

∣∣∣∣∣10

≤ an

∣∣∣∣∣∫∫ maxγ0(s)+r(s)φ1n,γ0(s)

minγ0(s)+r(s)φ1n,γ0(s)E[xix>i c0|q, s+ bnt


∣∣∣∣∣+an

∣∣∣∣∣∫∫ maxγ0(s+bnt),γ0(s)

minγ0(s+bnt),γ0(s)E[xix>i c0|q, s+ bnt


∣∣∣∣∣= anφ1n |r(s)| |D (γ0 (s) , s) c0| f (γ0 (s) , s) +O(anbn)

= O(1)

as anφ1n = 1 and anbn = n1−2εb2n → % <∞. We also have

V ar

[∑i∈Λn

n−εxix>i δ0 1i (γ(s))− 1i (γ0 (si))Ki (s)

]= O(n−2ε) = o(1),

similarly as (A.36). Therefore, from the same reason as (B.28), we haveΘB3(s) = Op(a−1/2n ) =

op(1), which completes the proof.For the second result, given the same derivation for Θ−1

B1(s) and ΘB3(s) above, it suffi cesto show that

1√nbn

Z(γ(s); s)>u(s)− 1√nbn

Z(γ0(s); s)>u(s) = op(1),

which is implied by Lemma A.1.

Proof of Lemma A.10 First, we consider the case with r > 0. For a fixed s ∈ S0, we have

1[q ≤ γ0 (s) + r/an]− 1[q ≤ γ0 (s)] 1[q ≤ γ0 (s+ bnt)]− 1[q ≤ γ0 (s)]

=

1 [γ0 (s) < q ≤ γ0 (s+ bnt)] if γ0 (s+ bnt) ≤ γ0 (s) + r/an,1 [γ0 (s) < q ≤ γ0 (s) + r/an] otherwise.

Denote Dc0(q, s) = c>0 D(q, s)c0f (q, s). Then

E [B∗∗n (r, s)]

= an

∫∫c>0 D(q, s+ bnt)c0 1[q ≤ γ0 (s) + r/an]− 1[q ≤ γ0 (s)]

×1[q ≤ γ0 (s+ bnt)]− 1[q ≤ γ0 (s)]K (t) f (q, s+ bnt) dqdt

= an

∫T ∗1 (r;s)

∫ γ0(s+bnt)

γ0(s)

Dc0(q, s+ bnt)K (t) dqdt

+an

∫T ∗2 (r;s)

∫ γ0(s)+r/an

γ0(s)


≡ B∗∗n1(r, s) +B∗∗n2(r, s),

where

T ∗1 (r; s) = t : γ0 (s) < γ0 (s+ bnt) ∩ t : γ0 (s+ bnt) ≤ γ0 (s) + r/an ,T ∗2 (r; s) = t : γ0 (s) < γ0 (s+ bnt) ∩ t : γ0 (s) + r/an < γ0 (s+ bnt) .

Note that γ0 (s) < γ0 (s) + r/an always holds for r > 0. Similarly as in the proof of LemmaA.4, we let a positive sequence tn →∞ such that tnbn → 0 as n→∞. Since

∫∞tntK(t)dt→ 0

11

by Assumption A-(x) with tn →∞, both T ∗1 (r; s) ∩ t : |t| > tn and T ∗2 (r; s) ∩ t : |t| > tnbecomes negligible as tn →∞ using the same argument in (B.2). It follows that

B∗∗n1(r, s) = an

∫T1(r;s)

∫ γ0(s+bnt)

γ0(s)

Dc0(q, s+ bnt)K (t) dqdt+ o (anbn) ,

B∗∗n2(r, s) = an

∫T2(r;s)

∫ γ0(s)+r/an

γ0(s)

Dc0(q, s+ bnt)K (t) dqdt+ o (anbn) ,

where

T1(r; s) = T ∗1 (r; s) ∩ t : |t| ≤ tn,T2(r; s) = T ∗2 (r; s) ∩ t : |t| ≤ tn.

Recall that anbn = n1−2εb2n → % < ∞ and hence o (anbn) = o(1). We consider three cases of

γ0(s) > 0, γ0(s) < 0, and γ0(s) = 0 separately.First, we suppose γ0(s) > 0. For any fixed ε > 0, it holds tnbn ≤ ε if n is suffi ciently

large. Therefore, for both T1(r; s) and T2(r; s), γ0 (s) < γ0 (s+ bnt) requires that t > 0 forsuffi ciently large n. Furthermore, γ0 (s+ bnt) < γ0 (s) + r/an implies that t < r/ (anbnγ0(s))

for some s ∈ [s, s+ bnt], where 0 < r/ (anbnγ0(s)) < ∞. Therefore, T1(r; s) = t : t > 0 andt < r/ (anbnγ0(s)) for suffi ciently large n. It follows that, by Taylor expansion,

B∗∗n1(r, s) = an

∫ r/(anbnγ0(s))

0

∫ γ0(s+bnt)

γ0(s)


= anbnDc0(γ0(s), s)γ0(s)

∫ r/(anbnγ0(s))

0

tK (t) dt+ anbnO (bn)

= %Dc0(γ0(s), s)γ0(s)K1 (r, %; s) + o(1)

for suffi ciently large n, since anbn = n1−2εb2n → % < ∞ and s → s as n → ∞. Similarly,

since γ0 (s) + r/an < γ0 (s+ bnt) implies t > r/ (anbnγ0(s)) for some s ∈ [s, s+ bnt], we haveT2(r; s) = t : t > 0 and t > r/ (anbnγ0(s)). Hence,

B∗∗n2(r, s) = an

∫ tn

r/(anbnγ0(s))

∫ γ0(s)+r/an

γ0(s)


= rDc0(γ0(s), s)

∫ tn

r/(anbnγ0(s))

K (t) dt+O(bn)

= rDc0(γ0(s), s)

1

2−K0 (r, %; s)

+ o(1)

for suffi ciently large n. Recall that |K0 (r, %; s)| ≤ 1/2 and |K1 (r, %; s)| ≤ 1/2.When γ0(s) < 0, −∞ < r/ (anbnγ0(s)) < 0 and we can similarly derive

B∗∗n1(r, s) = an

∫ 0

r/(anbnγ0(s))

∫ γ0(s+bnt)

γ0(s)


= −%Dc0(γ0(s), s)γ0(s)K1 (r, %; s) + o (1) ,

B∗∗n2(r, s) = an

∫ r/(anbnγ0(s))

−tn

∫ γ0(s)+r/an

γ0(s)


12

= rDc0(γ0(s), s)

1

2−K0 (r, %; s)

+ o (1) .

When γ0(s) = 0, it suffi ces to consider γ0(s) as the local minimum, so that γ0(t) ≤ 0 fort ∈ [s− ε, s] and γ0(t) ≥ 0 for t ∈ [s, s+ ε] for some small ε. In this case, based on the sameargument as (B.3),

T1(r; s) = t : γ0 (s+ bnt) ≤ γ0 (s) + r/an ∩ t : |t| ≤ tn,T2(r; s) = t : γ0 (s) + r/an < γ0 (s+ bnt) ∩ t : |t| ≤ tn.

Therefore, for suffi ciently large n,

B∗∗n1(r, s) = an

∫ tn

0

∫ γ0(s+bnt)

γ0(s)


= −%Dc0(γ0(s), s)γ0(s)

∫ ∞0

tK (t) dt+ o (1) = o (1) ,

B∗∗n2(r, s) = an

∫ r/(anbnγ0(s))

−tn

∫ γ0(s)+r/an

γ0(s)


= rDc0(γ0(s), s)

1

2−K0 (r, %; s)

+ o (1) = o (1)

since K0 (r, %; s) = 1/2 when γ0(s) = 0.By combining all three cases and the symmetric argument for r < 0, we have

E [B∗∗n (r, s)] = |r| Dc0(γ0(s), s)

1

2−K0 (r, %; s)

+ %Dc0(γ0(s), s) |γ0(s)| K1 (r, %; s) + o (1) .

Furthermore, since |B∗∗n (r, s)| ≤∑

i∈Λn(δ>0 xi)

2 |1i (γ0 (s) + r/an)− 1i (γ0 (s))|Ki (s), we haveV ar [B∗∗n (r, s)] = O(n−2ε) = o(1) from (A.36), which establishes the pointwise convergencefor each r. The tightness follows from a similar argument as in Lemma A.1 and the desiredresult follows by Theorem 15.5 in Billingsley (1968).

Proof of Lemma A.11 Define Wµ(r) = W (r) + µ(r), τ+ = arg maxr∈R+ Wµ(r), andτ− = arg maxr∈R−Wµ(r). The process Wµ(·) is a Gaussian process, and hence Lemma 2.6 ofKim and Pollard (1990) implies that τ+ and τ− are unique almost surely. Recall that wedefine W (r) = W1(−r)1[r < 0] +W2(r)1[r > 0], where W1(·) and W2(·) are two independentstandard Wiener processes defined on R+. We claim that

E[τ+] = −E[τ−] <∞, (B.29)

which gives the desired result.The equality in (B.29) follows directly from the symmetry (i.e., P(τ+ ≤ t) = P(τ− ≥ −t)

for any t > 0) and the fact that W1 is independent of W2. Now, we focus on r > 0 and showthat E[τ+] <∞. First, for any r > 0,

P (Wµ(r) ≥ 0) = P (W2(r) ≥ −µ(r)) = P(W2(r)√

r≥ −µ(r)√

r

)= 1− Φ

(−µ(r)√

r

),

where Φ(·) denotes the standard normal distribution function. Since the sample path ofWµ(·)

13

is continuous, for some r > 0, we then have

E[τ+] =

∫ ∞0

1− P

(τ+ ≤ r

)dr

=

∫ r

0

P(τ+ > r

)dr +

∫ ∞r

P(τ+ > r

)dr

≤ C1 +

∫ ∞r

P(Wµ(τ+) ≥ 0 and τ+ > r

)dr

≤ C1 +

∫ ∞r

P (Wµ(r) ≥ 0) dr

= C1 +

∫ ∞r

(1− Φ

(−µ(r)√

r

))dr (B.30)

for some C1 < ∞, where the first inequality is because Wµ(τ+) = maxr∈R+ Wµ(r) ≥ 0 givenWµ(0) = 0, and the second inequality is because P (Wµ(r) ≥ 0) is monotonically decreasingto zero on [r,∞) by assumption. The second term in (B.30) can be bounded as follows.Using the change of variables t = rε, integral by parts, and the condition that r−(1/2+ε)µ(r)

monotonically decreases to −∞ on [r,∞) for some ε > 0, we have∫ ∞r

(1− Φ

(−µ(r)√

r

))dr ≤ C2

∫ ∞r

(1− Φ (rε)) dr

= C3

∫ ∞r1/ε

(1− Φ (t)) dt1/ε

= C4 + C5

∫ ∞r1/ε

t1/εφ(t)dt <∞

for some Cj < ∞ for j = 2, 3, 4, 5, where φ(·) denotes the standard normal density func-tion and we use limt→∞ t

1/ε (1− Φ (t)) = 0. The same result can be obtained for r < 0

symmetrically, which completes the proof.

Proof of Lemma A.12 For given (%, s), we simply denote µ(r) = µ (r, %; s). Then, forthe kernel functions satisfying Assumption A-(x), it is readily verified that µ(0) = 0, µ(r) iscontinuous in r, and µ(r) is symmetric about zero. To check the monotonically decreasingcondition, for r > 0, we write

µ(r) = −r∫ rC1

0

K(t)dt+ C2

∫ rC1

0

tK(t)dt,

where C1 and C2 are some positive constants depending on (%, |γ0(s)| , ξ(s)) from (A.41). Weconsider the two possible cases.

First, if K(·) has a bounded support, say [−r, r] for some 0 < r < ∞, then µ(r) =

−rC3 + C4 for r > r and some 0 < C3, C4 < ∞. Thus, µ(r)r−(1/2+ε) is monotonicallydecreasing to −∞ on [r,∞) for any ε ∈ (0, 1/2).

Second, if K(·) has an unbounded support,

µ(r)r−((1/2)+ε) = −r1/2−ε∫ rC1

0

K(t)dt+ r−(1/2+ε)C2

∫ rC1

0

tK(t)dt,

14

which goes to −∞ as r → ∞ since∫ rC1

0tK(t)dt ≤

∫∞0tK(t)dt < ∞ and

∫ rC10

K(t)dt > 0.We can verify the monotonicity since

∂

∂r

µ(r)r−((1/2)+ε)

= −

(1

2− ε)r−(1/2+ε)

∫ rC1

0

K(t)dt− r1/2−εC1K(C1r)

−(

1

2+ ε

)r−(3/2+ε)C2

∫ rC1

0

tK(t)dt+ r1/2−εC21C2K(C1r)

= −r−(1/2+ε)

(1

2− ε)∫ rC1

0

K(t)dt+ rK(C1r)(C1 − C2

1C2

)−(

1

2+ ε

)r−3/2−εC2

∫ rC1

0

tK(t)dt

by the Leibniz integral rule. For r > r for some large enough r and ε ∈ (0, 1/2), thisderivative is strictly negative because (1/2 − ε)

∫ rC10

K(t)dt > 0 and limr→∞ rK(r) = 0,which proves µ(r)r−((1/2)+ε) is monotonically decreasing on [r,∞). The case with r < 0

follows symmetrically.

Proof of Lemma A.13 We only prove the first results for Tn (γ; s) because the proof forT n (γ; s) is identical. We define

φ3n = ‖γ − γ0‖∞log n

nbn,

where ‖γ − γ0‖∞ = sups∈S0 |γ (s)− γ0 (s)|, which is bounded since γ (s) ∈ Γ, a compact set,for any s. In addition, when ‖γ − γ0‖∞ = 0, Tn (γ; s) = 0 and hence the result trivially holds.So we suppose ‖γ − γ0‖∞ > 0 without loss of generality. Similar to the proof of Lemma A.3,we let τn = (n log n)1/(4+ϕ) with ϕ given in Assumption A-(v) and

T τn (γ, s) =1

nbn

∑i∈Λn

(c>0 xi

)2 |∆i(γ; s)|Ki (s)1τn , (B.31)

where ∆i(γ; s) = 1i (γ (s))− 1i (γ0 (s)) and 1τn = 1[(c>0 xi

)2 ≤ τn]. The triangular inequalitygives that

sups∈S0|Tn (γ; s)− E [Tn (γ; s)]| ≤ sup

s∈S0|T τn (γ; s)− Tn(γ; s)| (B.32)

+ sups∈S0|E [T τn (γ; s)]− E [Tn (γ; s)]|

+ sups∈S0|T τn (γ; s)− E [T τn (γ; s)]|

≡ PT1 + PT2 + PT3,

and we bound each of the three terms as follows.First, we show PT1 = 0 almost surely if n is suffi ciently large. By Markov’s and Hölder’s

inequalities,

P((c>0 xi

)2 |∆i(γ; s)| > τn

)≤ Cτ−(4+ϕ)

n E[∥∥x2

i

∥∥4+ϕ]≤ C ′ (n log n)

−1

for some C,C ′ <∞ from Assumption A-(v) and the fact that |∆i(γ; s)| ≤ 1. Then, as in the

15

proof of Lemma A.3, the Borel-Cantelli lemma implies that(c>0 xn

)2 |∆n(γ; s)| ≤ τn almost

surely for suffi ciently large n. Since τn → ∞, we have(c>0 xi

)2 |∆i(γ; s)| ≤ τn almost surelyfor all i ∈ Λn with suffi ciently large n. The desired results hence follows.

Second, we show PT2 ≤ C∗φ1/23n almost surely for some C∗ < ∞ if n is suffi ciently large.

For any s ∈ S0,

|E [T τn (γ; s)]− E [Tn (γ; s)]| (B.33)

≤ b−1n E

[∣∣∣(c>0 xi)21 [minγ0(s), γ(s) < qi ≤ maxγ0(s), γ(s)]Ki (s) (1− 1τn)

∣∣∣]≤

∫ ∫ maxγ0(s),γ(s)

minγ0(s),γ(s)E[(c>0 xi

)2(1− 1τn) |q, s+ bnt

]f(q, s+ bnt)K(t)dqdt

≤ τ−(3+ϕ)n

∫ ∫ maxγ0(s),γ(s)

minγ0(s),γ(s)E[(c>0 xi

)2(4+ϕ) |q, s+ bnt]f(q, s+ bnt)K(t)dqdt

≤ Cτ−(3+ϕ)n ‖γ − γ0‖∞

for some C < ∞, where E[(c>0 xi

)2(4+ϕ) |q, s]f(q, s) is uniformly bounded over (q, s) by As-sumptions A-(v) and (vii); and we use the inequality∫

|a|>τnafA (a) da ≤ τ−(3+ϕ)

n

∫|a|>τn

|a|4+ϕfA (a) da ≤ τ−(3+ϕ)

n E[A4+ϕ

]for a generic random variable A. Hence, the desired result follows since

τ−(3+ϕ)n ‖γ − γ0‖∞

φ1/23n

= ‖γ − γ0‖1/2

∞b1/2n

n3+ϕ4+ϕ−

12 (log n)

3+ϕ4+ϕ+ 1

2

= o (1) ,

where ‖γ − γ0‖∞ is bounded.Finally, we show PT3 ≤ C∗φ

1/23n almost surely for some C∗ < ∞ if n is suffi ciently large,

which follows similarly as the proof of Lemma A.4. To this end, we partition the compact S0

into mn-number of intervals Ik = [sk, sk+1) for k = 1, . . . ,mn. We choose the integer mn > n

such that |sk+1 − sk| ≤ C/mn = C ′b2nτ−1n φ

1/23n for all k and for some C,C ′ <∞. In addition,

since we let γ(·) be a cadlag and piecewise constant function with at most n discontinuitypoints, which is less than mn, Theorem 28.2 in Davidson (1994) entails that we can choosethese finite partitions such that

sups∈Ik|γ(s)− γ(sk)| = 0 (B.34)

for each k. Then we have

sups∈S0|T τn (γ; s)− E [T τn (γ; s)]| ≤ max

1≤k≤mn

sups∈Ik|T τn (γ; s)− T τn (γ; sk)| (B.35)

+ max1≤k≤mn

sups∈Ik|E [T τn (γ; s)]− E [T τn (γ; sk)]|

+ max1≤k≤mn

|T τn (γ; sk)− E [T τn (γ; sk)]|

≡ ΨT1 + ΨT2 + ΨT3.

Below we show ΨT1, ΨT2, and ΨT3 are all Oa.s.(φ1/23n ).

Part 1: ΨT1 and ΨT2 are both Oa.s.(φ1/23n ). Similarly as ΨM1 term in Lemma A.3, we first

16

decompose T τn (γ; s)− T τn (γ; sk) ≤ T τ1n (γ; s, sk) + T τ2n (γ; s, sk), where

T τ1n (γ; s, sk) =1

nbn

∑i∈Λn

(c>0 xi

)2 |∆i(γ; s)−∆i(γ; sk)|Ki (sk)1τn ,

T τ2n (γ; s, sk) =1

nbn

∑i∈Λn

(c>0 xi

)2 |∆i(γ; s)| |Ki (s)−Ki (sk)|1τn .

Without loss of generality, we can suppose γ(s) > γ(sk) and γ0(s) > γ0(sk) in T τ1n (γ; s, sk).Since Ki(·) is bounded from Assumption A-(x) and we only consider x2

i ≤ τn,

|T τ1n (γ; s, sk)| ≤1

nbn

∑i∈Λn

(c>0 xi

)21 [γ0(s) < qi ≤ γ0(sk)]Ki (sk)1τn

+1

nbn

∑i∈Λn

(c>0 xi

)21 [γ(s) < qi ≤ γ(sk)]Ki (sk)1τn

≤ C1τnbn

P (γ0(s) < qi ≤ γ0(sk)) +C1τnbn

P (γ(s) < qi ≤ γ(sk)) almost surely

≤ C ′1τnb−1n sup

s∈Ik|s− sk|+ 0

≤ C ′1τnb−1n m−1

n

= C ′′1 bnφ1/23n

for C1, C′1, C

′′1 < ∞, where the second second equality is by the uniform almost sure law

of large numbers for random fields (e.g., Jenish and Prucha (2009), Theorem 2); the thirdinequality is since γ0(·) is continuously differentiable, qi is continuous, and (B.34). Hence,T τ1n (γ; s, sk) = Oa.s.(φ

1/23n ), which holds uniformly in s ∈ Ik and k ∈ 1, . . . ,mn. Similarly,

since K(·) is Lipschitz from Assumption A-(x) and |∆i(γ; s)| ≤ 1,

|T τ2n (γ; s, sk)| ≤τnnbn

∑i∈Λn

|Ki(s)−Ki(sk2)|

≤ C2

τnb2n

|s− sk2 | ≤C ′2τnb2nmn

= Oa.s.

(φ

1/23n

)for some C2, C

′2 < ∞, uniformly in s and k. Hence, ΨT1 = Oa.s.(φ

1/23n ) and we can readily

verify that ΨT2 = Oa.s.(φ1/23n ) similarly.

Part 2: ΨT3 = Oa.s.(φ1/23n ). We let

Zτi (s) = (nbn)−1

(c>0 xi)2∆i(γ; s)Ki (s)1τn − E[(c>0 xi)

2∆i(γ; s)Ki (s)1τn ]

and apply the similar proof as Ψ∆M3 in Lemma A.4 above. In particular, construct theblock B[1](sk) in the same fashion as (B.6). Then, it suffi ces to show max1≤k≤mn

∣∣B[1](sk)∣∣ =

Oa.s.(φ1/23n ) as n → ∞. Using the same notations as in Lemma A.4, by the uniform almost

sure law of large numbers for random fields, we have that for any t = 1, . . . , r and s ∈ S0,

|Ut(s)| ≤C3w

2τnnbn

(1

w2

(2j1+1)w∑i1=2j1w+1

(2j2+1)w∑i2=2j2w+1

|∆i(γ; s)|)≤ C3w

2τn ‖γ − γ0‖∞nbn

(B.36)

almost surely from (B.5), for some C3 <∞. We also approximate Ut(s)rt=1 by a version of

17

independent random variables U∗t (s)rt=1 that satisfiesr∑t=1

E [|U∗t (s)− Ut(s)|] ≤ rC3 (nbn)−1w2τn ‖γ − γ0‖∞ αw2,w2(w). (B.37)

Then, similar to (B.10), for some positive C∗ <∞,

P(

max1≤k≤mn

∣∣B[1](sk)∣∣ > C∗φ

1/23n

)≤ mn sup

s∈S0P

(r∑t=1

|U∗t (s)− Ut(s)| > C∗φ1/23n

)

+mn sups∈S0

P

(∣∣∣∣∣r∑t=1

U∗t (s)

∣∣∣∣∣ > C∗φ1/23n

)≡ PU1 + PU2.

For PU1,

PU1 ≤ mn

rC3 (nbn)−1w2τn ‖γ − γ0‖∞ αw2,w2(w)

C∗φ1/23n

≤ C ′3τ 2n

b3nφ3n

‖γ − γ0‖∞ exp(−C ′′3w)

≤ C ′′′3 nκ1(log n)κ2 exp(−C ′′′′3 nκ3)

(n1−2εbn)2

for some κ1, κ2, κ3 > 0 and C ′3, C′′3 , C

′′′3 , C

′′′′3 < ∞. Hence PU1 = O(exp(−nκ3)) → 0 as

n→∞. Recall that we chose mn = O(τn/(b2nφ

1/23n )), n = 4w2r, and τn = (n log n)

1/(4+ϕ).For PU2, using the same argument as (A.5) in Lemma A.1, we can show that

E[U∗t (s)2

]=

∑1≤i1≤w1≤i2≤w

E[Zτi (s)2

]+

∑i6=j

1≤i1,i2≤w1≤j1,j2≤w

Cov[Zτi (s), Zτj (s)

]≤ C4w

2

n2bn‖γ − γ0‖∞

for some C4 < ∞, which does not depend on s given Assumptions A-(v) and (x). We nowchoose an integer w such that

w = (nbn/ (Cwτnλn))1/2 ,

λn = (nbn log n)1/2

for some large positive constant Cw. Note that, substituting λn and τn into w gives

w = O

[ n14−

14+ϕ

(log n)34+ 1

4+ϕ

×(nb2

n

log n

)1/4]1/2

,which diverges as n → ∞ for ϕ > 0 and from Assumption A-(ix). From (B.36), we have|λnU∗t (s)/ ‖γ − γ0‖

1/2

∞ | < 1/2 by choosing Cw large enough, and hence

sups∈S0

P

(∣∣∣∣∣r∑t=1

U∗t (s)

∣∣∣∣∣ > C∗φ1/23n

)= sup

s∈S0P

(∣∣∣∣∣r∑t=1

λnU∗t (s)

‖γ − γ0‖1/2

∞

∣∣∣∣∣ > C∗(

log n

nbn

)1/2)

18

≤ 2 exp

(−C∗λn

(log n

nbn

)1/2

+C4λ

2nrw

2

n2bn

)

= 2 exp

(−C∗λn

(log n

nbn

)1/2

+ C4λ2n(nbn)−1

)= 2 exp (−C∗ log n+ C ′4 log n)

for some C4, C′4 <∞ as in (B.12) and (B.13). It follows that

mn sups∈S0

P

(∣∣∣∣∣r∑t=1

U∗t (s)

∣∣∣∣∣ > C∗φ1/23n

)≤ 2mn

nC∗−C′4

=C5(log n)κ4

nκ5(n1−2εbn)3/2

for some C5 < ∞, κ4 > 0, and κ5 > 1 by choosing C∗ suffi ciently large. Therefore,PU2 = O(n−κ5) → 0 as n → ∞. Since PU1 + PU2 = O(n−c) for some c > 1, we have∑∞

n=1 P(max1≤k≤mn

∣∣B[1](sk)∣∣ > C∗φ

1/23n ) < ∞ and hence we obtain the desired result by the

Borel-Cantelli lemma.

Proof of Lemma A.14 Since the proof is similar to that in Lemma A.13, we only highlightthe different parts. Without loss of generality we assume xi is a scalar, so as Ln(γ; s). As in(B.31), we let

Lτn (γ; s) =1√nbn

∑i∈Λn

xiui∆i(γ; s)Ki (s)1τn ,

where ∆i(γ; s) = 1i (γ (s)) − 1i (γ0 (s)) and 1τn = 1[|xiui| ≤ τn] with τn = (n log n)1/(4+ϕ).

Since E[Lτn(γ, s)] = 0, we write

sups∈S0|Ln (γ; s)|

≤ sups∈S0|Lτn (γ; s)− Ln(γ; s)|+ sup

s∈S0|Lτn (γ; s)|

≡ PL1 + PL2.

Using the same argument as PT1 in the proof of Lemma A.13, we have

P (|xiui| |∆i(γ; s)| > τn) ≤ Cτ−(4+ϕ)n E

[‖xiui‖4+ϕ

]≤ C ′ (n log n)

−1

for some C,C ′ <∞. Then the Borel-Cantelli lemma implies that |xiui| |∆n(γ; s)| ≤ τn almostsurely for suffi ciently large n. Since τn →∞, we have |xiui| |∆i(γ; s)| ≤ τn almost surely forall i ∈ Λn with suffi ciently large n, which yields PL1 = 0 almost surely for a suffi ciently largen.

For PL2, we let φ3n = ‖γ − γ0‖∞ log n and write

sups∈S0|Lτn (γ; s)| ≤ max

1≤k≤mn

sups∈Ik|Lτn (γ; s)− Lτn (γ; sk)|+ max

1≤k≤mn

|Lτn (γ; sk)|

≡ ΨL1 + ΨL2,

for some integer mn = O(τn/(b2nφ

1/2

3n )). We let Zτi (s) = (nbn)−1/2xiui∆i(γ; s)Ki (s)1τn , andwe choose w = ((nbn) / (Cwτnλn))1/2 for some large positive constant Cw and λn = (log n)1/2.Then, the rest of the proof follows identically as PT3 in the proof of Lemma A.13.

19

Proof of Lemma A.15 We first show (A.42). Similarly as the proof of Lemma A.6,we consider the case with γ(s) > γ0(s), and the other direction can be shown symmet-rically. We suppose n is large enough so that rφ2n ≤ C for some r, C ∈ (0,∞) andsups∈S0 (γ (s)− γ0 (s)) ∈

[rφ2n, C

]. We also let

` = infs∈S0

`D(s) > 0

where `D(s) is defined in (B.14). Then, from (B.15), we have

sups∈S0

E [Tn (γ; s)] ≥ ` sups∈S0

(γ (s)− γ0 (s)) . (B.38)

For any ε > 0 and for any γ (·) such that sups∈S0 (γ (s)− γ0 (s)) ∈[rφ2n, C

], Lemma A.13

and (B.38) implies that with probability approaching to one


sups∈S0 |γ (s)− γ0 (s)|

≥sups∈S0 E [Tn (γ; s)]− sups∈S0 |Tn (γ; s)− E [Tn (γ; s)]|

sups∈S0 |γ (s)− γ0 (s)|

≥sups∈S0 E [Tn (γ; s)]

sups∈S0 |γ (s)− γ0 (s)| −(sups∈S0 |γ (s)− γ0 (s)| (log n/n)

)1/2

sups∈S0 |γ (s)− γ0 (s)|

≥ `− (log n/n)1/2

rφ1/22n

≥ `− r−1n−ε.

Since ` > 0 and r−1n−ε → 0 as n→∞, we thus can find CT <∞ such that

P

(inf



sups∈S0 |γ (s)− γ0 (s)| < CT (1− η)

)≤ ε.

for any ε, η > 0. The proof for (A.43) is similar and hence omitted.For (A.44), without loss of generality we assume xi is a scalar, and so is Ln(γ; s). We set

γg for g = 1, 2, ..., g + 1 such that, for any s ∈ S0, γg (s) = γ0 (s) + 2g−1rφ2n, where g is aninteger satisfying sups∈S0(γg (s)− γ0 (s)) = 2g−1rφ2n ≤ C and sups∈S0(γg+1 (s)− γ0 (s)) > C.Then Lemma A.14 yield that for any η > 0,

P

(max1≤g≤g

sups∈S0∣∣Ln (γg; s)∣∣√

an sups∈S0(γg (s)− γ0 (s)

) > η

4

)(B.39)

≤g∑g=1

P

(sups∈S0

∣∣Ln (γg; s)∣∣√an sups∈S0

(γg (s)− γ0 (s)

) > η

4

)

≤ 4

η

g∑g=1

CL (φ2n log n)1/2

√an2g−1rφ2n

≤ C ′Lηr

∞∑g=1

1

2(g−1)

for some CL, C ′L < ∞. This probability is arbitrarily close to zero if r is chosen large

20

enough. Following a similar discussion after (B.20), this result also provides the maximal(or sharp) rate of φ2n as log n/an because we need (log n/an)/φ2n = O(1) but φ2n → 0 aslog n/an → 0 with n → ∞. For a given g, we define Γg as the collection of γ (s) satisfyingr2g−1φ2n < sups∈S0 |γ (s)− γ0 (s)| < r2gφ2n. By a similar argument as (B.39), we have

P

(max1≤g≤g

supγ∈Γg

sups∈S0∣∣Ln (γ; s)− Ln

(γg; s

)∣∣√an sups∈S0

(γg(s)− γ0 (s)

) >η

4

)≤ C ′′Lηr

(B.40)

for some C ′′L <∞, which is arbitrarily close to zero if r is chosen large enough. From (B.22),and by combining (B.39) and (B.40), we thus have

P

(sup


sups∈S0 |Ln (γ; s)|√an sups∈S0 (γ (s)− γ0 (s))

> η

)

≤ P

(2 max

1≤g≤g

sups∈S0∣∣Ln (γg; s)∣∣√

an sups∈S0(γg (s)− γ0 (s)

) > η

2

)

+P

(2 max

1≤g≤gsupγ∈Γg

sups∈S0∣∣Ln (γ; s)− Ln

(γg; s

)∣∣√an sups∈S0

(γg(s)− γ0 (s)

) >η

2

)≤ ε

for any ε, η > 0 if r is chosen suffi ciently large.

Proof of Lemma A.16 Given Lemma A.3 and the uniform convergence of the standardkernel estimators, the desired result can be obtained similarly as the proof of Lemma A.7,provided that we have sups∈S0 |γ(s)− γ0(s)| →p 0 as n→∞. To this end, recall that γ(s) isthe minimizer of Υn(γ; s) in (A.19) and γ0(s) is the minimizer of Υ0(γ; s) in (A.21) for anygiven s ∈ S0. See Lemma A.5 for the definitions of Υn(γ; s) and Υ0(γ; s).

Suppose γ(s) is not uniformly consistent, implying that there exist η > 0 and ε > 0 suchthat for any N ∈ N, there exists n > N satisfying

P(

sups∈S0|γ(s)− γ0(s)| > η

)= P

(sups∈S0

(γ(s)− γ0(s)) > η

)+ P

(sups∈S0

(γ(s)− γ0(s)) < −η)> ε

or simply

P(

sups∈S0

(γ(s)− γ0(s)) > η

)> ε (B.41)

without loss of generality. From (A.22), we can define C ∈ (0,∞) such that

infs∈S0

∂Υ0(γ0(s); s)

∂γ> C > 0,

and hence the mean value theorem yields

Υ0(γ(s), s)−Υ0(γ0(s), s) =∂Υ0(γ(s), s)

∂γ(γ (s)− γ0(s))

21

> C (γ (s)− γ0(s))

for suffi ciently large n, where γ(s) is between γ(s) and γ0(s). Therefore,

P(

sups∈S0Υ0(γ(s), s)−Υ0(γ0(s), s) > Cη

)(B.42)

> P(

infs∈S0

∂Υ0(γ0(s); s)

∂γsups∈S0

(γ(s)− γ0(s)) > Cη

)= P

(sups∈S0

(γ(s)− γ0(s)) > η

)> ε

from (B.41).However, by construction, Υn(γ(s), s)−Υn(γ0(s), s) ≤ 0 for every s ∈ S0, which implies

sups∈S0Υn(γ(s), s)−Υn(γ0(s), s) ≤ 0 almost surely. (B.43)

Furthermore, using the triangular inequality and the uniform convergence result in LemmaA.3, we can verify that

sup(r,s)∈Γ×S0

|Υn(r, s)−Υ0(r, s)| →p 0 (B.44)

as n→∞ from the proof of Lemma A.5. From (B.43) and (B.44), we thus have

P(

sups∈S0Υ0(γ(s), s)−Υ0(γ0(s), s) > Cη

)≤ P

(sups∈S0Υ0(γ(s), s)−Υn(γ(s), s) > Cη/3

)+P(

sups∈S0Υn(γ(s), s)−Υn(γ0(s), s) > Cη/3

)+P(

sups∈S0Υn(γ0(s), s)−Υ0(γ0(s), s) > Cη/3

)≤ (ε∗/3) + (ε∗/3) + (ε∗/3) = ε∗

for any ε∗ > 0 if n is suffi ciently large. It contradicts to (B.42) by choosing ε∗ ≤ ε, hence theuniform consistency should hold.

Proof of Lemma A.17 We prove Ξβ2 = op(1) and Ξβ3 = op(1). The results for Ξδ2 andΞδ3 can be shown symmetrically. For expositional simplicity, we present the case of scalar xi.

For Ξβ2: Note that γ(·) belongs to Gn(S0; Γ). We define intervals Ik for k = 1, . . . , n,which are centered at the discontinuity points of γ(s) with length `n such that `n → 0 asn → ∞. Without loss of generality, we choose `n = O(n−3). Then, we can interpolate oneach Ik and define γ(s) as a smooth version of γ(s), which satisfies

P(

sups∈S0|γ (s)− γ (s)| > ε

)≤ P

(max

1≤k≤nsups∈Ik|γ (s)− γ (s)| > ε

)≤ ε (B.45)

for any ε > 0, if n is suffi ciently large. Since sups∈S0 |γ (s)− γ0(s)| = op(1) in the proof of

22

Lemma A.16, we have

sups∈S0|γ (s)− γ0 (s)| ≤ sup

s∈S0|γ (s)− γ (s)|+ sup

s∈S0|γ (s)− γ0 (s)| = op(1) (B.46)

from (B.45).Now we define

Gn(γ) =1√n

∑i∈Λn

xiui1 [qi > γ(si) + πn]1S0 ,

and then

Ξβ2 = Gn(γ)−Gn(γ0)

= Gn(γ)−Gn(γ)+ Gn(γ)−Gn(γ0)≡ ΨG1 + ΨG2.

First, for ΨG1, let ∆πi (γ, γ) = 1 [qi > γ(si) + πn] − 1 [qi > γ(si) + πn]. By construction,

|∆πi (γ, γ)| ≤ 1[si ∈ Ik for some k]. Therefore, by the Cauchy-Schwarz inequality and As-

sumptions A-(v) and A-(viii),

E [|ΨG1|] ≤ n1/2E [|xiui| |∆πi (γ, γ)|1S0 ]

≤ n1/2E[(xiui)

2]1/2 E [(1[si ∈ Ik for some k]1S0)

2]1/2

≤ C1n1/2 (P [si ∈ Ik ∩ S0 for some k])

1/2

≤ C ′1n1/2n−3/2 = o(1)

for some C1, C′1 <∞. Hence, ΨG1 = op(1).

Second, for ΨG2, we let 1τ = 1[|xiui| ≤ τ ] for some τ < ∞. Then, for any ε′ > 0 andγ : S0 7→ Γ,

P

(1√n

∑i∈Λn

xiui1 [qi > γ(si) + πn] (1− 1τ )1S0 > ε′

)

≤ ε−2 1

nE

(∑i∈Λn

xiui1 [qi > γ(si) + πn] (1− 1τ )1S0

)2

≤ Cε−2E[(xiui)

21 [|xiui| > τ ]

]≤ Cε−2E

[(xiui)

4]1/2

(P [|xiui| > τ ])1/2

≤ Cε−2τ−2E[(xiui)

4]

for some C < ∞, where we apply the Markov’s and the Cauchy-Schwarz inequalities. FromAssumption A-(v), by choosing τ suffi ciently large, this probability can be arbitrarily small.Hence,

Gn(γ) =1√n

∑i∈Λn

xiui1 [qi > γ(si) + πn]1τ1S0 + op(1)

for suffi ciently large n and we simply consider |xiui| ≤ τ almost surely in what follows.

23

We let F∗ be the class of functions xu1 [q > γ(s) + πn] for γ ∈ C2[S0], where C2[S0]

denotes the family of twice-continuously differentiable functions defined on S0. Using Theo-rem 2.5.6 in der Vaart and Wellner (1996), we establish that F∗ is P-Donsker, which requiresthree elements: an entropy bound, a maximal inequality, and the chaining argument. Forthe entropy bound, by Corollaries 2.7.2 and 2.7.3 in der Vaart and Wellner (1996) (with theirr = d = 1 and α = 2), F∗ has the same bracketing number (up to a constant) as that forthe collection of subgraphs of C2[S0], so that logN[] (ε,F∗, || · ||∞) ≤ Cε−1/2, where || · ||∞denotes the uniform norm. For the maximal inequality, since we consider |xu| ≤ τ , Corol-lary 3.3 in Valenzuela-Domíguez, Krebs, and Franke (2017) gives the Bernstein inequalityfor spatial lattice processes with exponentially decaying α-mixing coeffi cients. This satisfiesthe conditions in Lemma 2.2.10 in der Vaart and Wellner (1996), which implies that for anyfinite collection of functions γ1, . . . , γm ∈ C2[S0],

E[

max1≤k≤m

Gn(γk)

]≤ C ′

(log(1 +m) +

√log(1 +m)

)(B.47)

for C ′ < ∞. For the chaining argument, the same analysis in der Vaart and Wellner(1996), pp.131-132 applies with the following two changes: their envelope function F is|xu|, which satisfies E [F 2] < ∞; and their inequality (2.5.5) is implied by (B.47) withm = logN[] (ε,F∗, || · ||∞). Note that the spatial dependence only shows up in derivingthe maximal inequality but not the entropy or the chaining argument.

Since Donsker implies stochastic equicontinuity, it follows that Gn(·) satisfies, for everypositive ηn → 0,

supsups∈S0 |γ(s)−γ′(s)|≤ηn

|Gn(γ)−Gn(γ′)| →p 0

as n→∞. Therefore, ΨG2 = op(1) since sups∈S0 |γ(s)− γ0(s)| = op(1) from (B.46).For Ξβ3: On the event E∗n that sups∈S0 |γ(s)− γ0(s)| ≤ φ2n, we have

E [|Ξβ3|] =1√n

∑i∈Λn

E[∣∣x2

i δ0

∣∣1 [qi ≤ γ0(si)]1 [qi > γ(si) + πn]1S0]

≤ n1/2−εCE [1 [qi ≤ γ0(si)]1 [qi > γ(si) + πn]1S0 ]

≤ n1/2−εCE [1 [qi ≤ γ0(si)]1 [qi > γ0(si)− φ2n + πn]1S0 ]

= n1/2−εC

∫S0

∫I(q;s)

f(q, s)dqds

for some 0 < C < ∞, where I(q; s) = q : q ≤ γ0(s) and q > γ0(s) − φ2n + πn. Sincewe define πn > 0 such that φ2n/πn → 0, it holds that πn − φ2n > 0 for suffi ciently large n.Therefore, I(q; s) becomes empty for all s when n is suffi ciently large. The desired resultfollows from Markov’s inequality and the fact that P (E∗n) > 1− ε for any ε > 0.

References

Berbee, H. (1987): "Convergence Rates in the Strong Law for Bounded Mixing Sequences",Probability Theory and Related Fields, 74, 255-270

Carbon, M., L. T. Tran, and B. Wu (1997): "Kernel Density Estimation for RandomFields", Statistics and Probability Letters, 36, 115-125

24

Carbon, M., C. Francq, and L. T. Tran (2007): "Kernel Regression Estimation forRandom Fields," Journal of Statistical Planning and Inference, 137, 778-798

Davidson, J. (1994): "Stochastic Limit Theory", Oxford University Press

Hansen, B. E. (2000): "Sample Splitting and Threshold Estimation," Econometrica, 68,575—603.

Jenish, N. and I. R. Prucha (2009): "Central Limit Theorems and Uniform Laws ofLarge Numbers for Arrays of Random Fields," Journal of Econometrics, 150, 86-98.

Rio, E. (1995): "The Functional Law of the Iterated Logarithm for Stationary StronglyMixing Sequences," Annals of Probability, 23(3), 1188-1203.

Tran, L. T. (1990): "Kernel Density Estimation on Random Fields," Journal of Multi-variate Analysis, 34, 37-53.

Valenzuela-Domínguez, E., J. T. N. Krebs, and J. E. Franke (2017): A BernsteinInequality for Spatial Lattice Processes, Working Paper

van der Vaart, A. W. and J. A. Wellner (1996): "Weak Convergence and EmpiricalProcesses with Applications to Statistics," Springer

25

threshold regression with nonparametric sample splitting · conditions on n in assumption a later....

Documents