improving covariate balancing propensity score: a doubly

47
Improving Covariate Balancing Propensity Score: A Doubly Robust and Efficient Approach * Jianqing Fan *‡ Kosuke Imai †‡ Han Liu *‡ Yang Ning * Xiaolin Yang *† Princeton University * Department of Operations Research and Financial Engineering Department of Politics Center for Statistics Machine Learning June 14, 2016 Abstract Inverse probability of treatment weighting (IPTW) is a popular method for estimating causal effects in many disciplines. However, empirical studies show that the IPTW estimators can be sensitive to the misspecification of propensity score model. To address this problem, several researchers have proposed new methods to estimate propensity score by directly optimizing the balance of pre-treatment covariates. While these methods appear to empirically perform well, little is known about their theoretical properties. This paper makes two main contributions. First, we conduct a theoretical investigation of one such methodology, the Covariate Balancing Propensity Score (CBPS) recently proposed by Imai and Ratkovic (2014). We characterize the asymptotic bias and efficiency of the CBPS-based IPTW estimator under both arbitrary and local model misspecification as well as correct specification for general balancing functions. Based on this finding, we address an open problem in the literature on how to optimally choose the covariate balancing function for the CBPS methodology. Second, motivated by the form of the optimal covariate balancing function, we further propose a new IPTW estimator by generalizing the CBPS method. We prove that the proposed estimator is consistent if either the propensity score model or the outcome model is correct. In addition to this double robustness property, we also establish that the proposed estimator is semiparametrically efficient when both the propensity score and outcome models are correctly specified. Unlike the standard doubly robust estimators, however, the proposed methodology does not require the estimation of outcome model. To relax the parametric assumptions on the propensity score model and the outcome model, we further consider a sieve estimation approach to estimate the treatment effect. A new “nonparametric double robustness” phenomenon is observed. Our simulations show that the proposed estimator has better finite sample properties than the standard estimators. Key words: Average treatment effect, causal inference, double robustness, model misspecifi- cation, semiparametric efficiency, sieve estimation * Supported by NSF grants DMS-1206464 and DMS-1406266 and NIH grants R01-GM072611-12 and R01- GM100474-04. 1

Upload: others

Post on 05-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Improving Covariate Balancing Propensity Score:

A Doubly Robust and Efficient Approach ∗

Jianqing Fan∗‡ Kosuke Imai†‡ Han Liu∗‡ Yang Ning∗ Xiaolin Yang∗†

Princeton University∗ Department of Operations Research and Financial Engineering

† Department of Politics ‡ Center for Statistics Machine Learning

June 14, 2016

Abstract

Inverse probability of treatment weighting (IPTW) is a popular method for estimating causal

effects in many disciplines. However, empirical studies show that the IPTW estimators can be

sensitive to the misspecification of propensity score model. To address this problem, several

researchers have proposed new methods to estimate propensity score by directly optimizing the

balance of pre-treatment covariates. While these methods appear to empirically perform well,

little is known about their theoretical properties. This paper makes two main contributions.

First, we conduct a theoretical investigation of one such methodology, the Covariate Balancing

Propensity Score (CBPS) recently proposed by Imai and Ratkovic (2014). We characterize

the asymptotic bias and efficiency of the CBPS-based IPTW estimator under both arbitrary

and local model misspecification as well as correct specification for general balancing functions.

Based on this finding, we address an open problem in the literature on how to optimally choose

the covariate balancing function for the CBPS methodology. Second, motivated by the form

of the optimal covariate balancing function, we further propose a new IPTW estimator by

generalizing the CBPS method. We prove that the proposed estimator is consistent if either the

propensity score model or the outcome model is correct. In addition to this double robustness

property, we also establish that the proposed estimator is semiparametrically efficient when

both the propensity score and outcome models are correctly specified. Unlike the standard

doubly robust estimators, however, the proposed methodology does not require the estimation

of outcome model. To relax the parametric assumptions on the propensity score model and the

outcome model, we further consider a sieve estimation approach to estimate the treatment effect.

A new “nonparametric double robustness” phenomenon is observed. Our simulations show that

the proposed estimator has better finite sample properties than the standard estimators.

Key words: Average treatment effect, causal inference, double robustness, model misspecifi-

cation, semiparametric efficiency, sieve estimation

∗Supported by NSF grants DMS-1206464 and DMS-1406266 and NIH grants R01-GM072611-12 and R01-

GM100474-04.

1

1 Introduction

Suppose that we have a random sample of n units from a population of interest. For each unit i,

we observe (Ti, Yi,Xi), where Xi ∈ Rd is a d-dimensional vector of pre-treatment covariates, Ti is a

binary treatment variable, and Yi is an outcome variable. In particular, Ti takes 1 if unit i receives

the treatment and is equal to 0 if unit i belongs to the control group. The observed outcome can

be written as Yi = Yi(1)Ti + Yi(0)(1− Ti), where Yi(1) and Yi(0) are the potential outcomes under

the treatment and control conditions, respectively. This notation implicitly requires the stable unit

treatment value assumption (Rubin, 1990). In addition, throughout this paper, we also assume the

strong ignorability of the treatment assignment (Rosenbaum and Rubin, 1983):

Yi(1), Yi(0) ⊥⊥ Ti |Xi and 0 < P(Ti = 1 |Xi) < 1. (1.1)

Furthermore, without loss of generality, we assume that the conditional expectation functions

of potential outcomes can be written as,

E(Yi(0) |Xi) = K(Xi) and E(Yi(1) |Xi) = K(Xi) + L(Xi) (1.2)

for some functions K(·) and L(·), which represent the conditional mean of the potential outcome

under the control condition and the conditional average treatment effect, respectively. Under this

setting, we are interested in estimating the average treatment effect (ATE),

µ = E(Yi(1)− Yi(0)) = E(L(Xi)). (1.3)

The propensity score is defined as the conditional probability of treatment assignment (Rosen-

baum and Rubin, 1983),

π(Xi) = P(Ti = 1 |Xi). (1.4)

In practice, since Xi can be high dimensional, the propensity score is usually parameterized by a

model πβ(Xi) where β is a q-dimensional vector of parameters. A popular choice is the logistic

regression model, i.e., πβ(Xi) = exp(X>i β)/1 + exp(X>i β). Once the parameter β is estimated

(e.g., by the maximum likelihood estimator β), the Horvitz-Thompson estimator (Horvitz and

Thompson, 1952), which is based on the inverse probability of treatment weighting (IPTW), can

be used to obtain an estimate of the ATE (Robins et al., 1994),

µβ

=1

n

n∑i=1

(TiYi

πβ

(Xi)− (1− Ti)Yi

1− πβ

(Xi)

). (1.5)

Despite its popularity, researchers have found that the estimators based on IPTW are partic-

ularly sensitive to the misspecification of propensity score model (e.g., Kang and Schafer, 2007).

To overcome this problem, several researchers have recently proposed to estimate the propensity

score by optimizing covariate balance rather than maximizing the accuracy of predicting treatment

assignment (e.g., Tan, 2010; Hainmueller, 2012; Graham et al., 2012; Imai and Ratkovic, 2014;

Chan et al., 2015). In this paper, we focus on the Covariate Balancing Propensity Score (CBPS)

2

methodology proposed by Imai and Ratkovic (2014). In spite of its simplicity, several scholars

independently found that the CBPS performs well in practice (e.g., Wyss et al., 2014; Frolich et al.,

2015). The method can also be extended for the analysis of longitudinal data (Imai and Ratkovic,

2015) and general treatment regimes (Fong et al., 2015). In this paper, we conduct a theoretical

investigation of the CBPS. Given the similarity between the CBPS and some other methods, our

theoretical analysis may also provide new insights for understanding these related methods.

The CBPS method estimates the parameters of the propensity score model, β, by solving the

following m-dimensional estimating equation,

gβ(T ,X) =1

n

n∑i=1

gβ(Ti,Xi) = 0,

where gβ(Ti,Xi) =

(Ti

πβ(Xi)− 1− Ti

1− πβ(Xi)

)f(Xi) (1.6)

for some covariate balancing function f(·) : Rd → Rm, if the number of equations m is equal to the

number of parameters q. Imai and Ratkovic (2014) point out that the common practice of fitting

a logistic model is equivalent to balancing the score function with f(Xi) = π′β(Xi) =

∂πβ(Xi)∂β .

They find that choosing f(Xi) = Xi, which balances the first moment between the treatment and

control groups, significantly reduces the bias of the estimated ATE. Some researchers also choose

f(Xi) = (Xi X2i ) in their applications. This guarantees that the treatment and control groups have

an identical sample mean of f(Xi) after weighting by the estimated propensity score. If m > q,

then β can be estimated by optimizing the covariate balance by the generalized method of moments

(GMM) method (Hansen, 1982):

β = argminβ∈Θ

gβ(T ,X)> W gβ(T ,X), (1.7)

where Θ is the parameter space for β in Rq and W is an (m×m) positive definite weighting matrix,

which we assume in this paper does not depend on β. Alternatively, the empirical likelihood method

can be used (Owen, 2001; Fong et al., 2015). Once the estimate of β is obtained, we can estimate

the ATE using the IPTW estimator in (1.5).

The main idea of the CBPS and other related methods is to directly optimize the balance of

covariates between the treatment and control groups so that even when the propensity score model

is misspecified we still obtain a reasonable balance of the covariates between the treatment and

control groups. However, one open question remains in this literature: How shall we choose the

covariate balancing function f(Xi)? In particular, if the propensity score model is misspecified,

this problem becomes even more important. Although some researchers have proposed the non-

parametric propensity score estimators to alleviate this problem (e.g., Hirano et al., 2003; Chan

et al., 2015), these methods are potentially difficult to apply in practice even when the number of

pre-treatment covariates is moderate.

This paper makes two main contributions. First, we conduct a thorough theoretical study of

the CBPS-based IPTW estimator with the general balancing function f(·). We characterize the

asymptotic bias of the CBPS-based IPTW estimator under both arbitrary and local misspecification

3

of the propensity score model. We also study the efficiency of the IPTW estimator under correct

specification and local model misspecification. Based on these findings, we show how to optimally

choose the covariate balancing function f(Xi) for the CBPS methodology (Section 2). In particular,

we show that once the covariate balancing function is chosen in this way, the CBPS-based IPTW

estimator achieves a double robustness property: the estimator is consistent if either the propensity

score model or the outcome model is correct.

However, the optimal choice of f(Xi) requires the knowledge of the propensity score, which

is unknown. Thus, the application of the CBPS method with the optimal f(Xi) is limited in

practice. To address this issue, our second contribution is to propose a new IPTW estimator by

generalizing the CBPS method. We show that the IPTW estimator based on the improved CBPS

(iCBPS) method retains the double robustness property. We also show that the proposed estimator

is semiparametrically efficient when both the propensity score and outcome models are correctly

specified (Section 3). Different from the CBPS method with the optimal f(Xi), the proposed

iCBPS method does not require the knowledge of the propensity score and is easy to implement

in practice. In addition, unlike the standard doubly robust estimators (Robins et al., 1994), the

proposed iCBPS method does not require the estimation of outcome model. Our simulation study

shows that the proposed estimator outperforms the standard estimators (Section 3.2).

To relax the parametric assumptions on the propensity score model and the outcome model, we

further extend the proposed iCBPS method to the nonparametric/semiparametric settings, by using

a sieve estimation approach (Newey, 1997; Chen, 2007). In Section 4, we establish a unified semi-

parametric efficiency result for the IPTW estimator under many nonparametric/semiparametric

settings, including the fully nonparametric model, additive model and partially linear model. Our

result provides a more comprehensive theoretical framework than the existing nonparametric lit-

erature (e.g., Hirano et al., 2003; Chan et al., 2015), which usually assumes the propensity score

model is fully nonparametric and therefore suffers from the curse of dimensionality. In addition, our

theoretical results require weaker technical assumptions. For instance, in the fully nonparametric

setting, the theory in Hirano et al. (2003) and Chan et al. (2015) requires s/d > 7 and s/d > 13,

respectively, where s is the smoothness parameter of some function class and d = dim(Xi). In com-

parison, we only require s/d > 3/4, which is significantly weaker than the existing conditions. To

prove this result, we exploit the matrix Bernstein’s concentration inequalities (Tropp, 2015) and a

Bernstein-type concentration inequality for U-statistics (Arcones, 1995). Similar tools from recent

random matrix theory have been used by Hansen (2014); Chen and Christensen (2015); Belloni

et al. (2015) to study the optimal rate of convergence for sieve-based least square estimation. How-

ever, unlike the sieve-based least square estimator, our sieve estimator of the propensity score does

not have a closed form and this leads to extra technical challenges for the establishment of the con-

sistency and rate of convergence. Finally, in this nonparametric setting, we observe an interesting

phenomenon that our unified semiparametric efficiency result holds if either the propensity score

model or the outcome model is approximated reasonably well. This can be viewed as a nonpara-

metric version of the double robustness property. To the best of our knowledge, this phenomenon

does not appear in the existing literature. The last section provides concluding remarks.

4

2 Consequences of Model Misspecification

Our theoretical investigation starts by examining the consequences of model misspecification for the

CBPS-based IPTW estimator. For this, we first derive the asymptotic bias of the IPTW estimator

under local misspecification of the propensity score model and show that we can eliminate the

bias by carefully choosing the covariate balancing function f(Xi) such that it spans K(Xi) + (1−πβo(Xi))L(Xi), where βo is the limiting value of β, i.e., β

p−→ βo. In other words, we want

to choose the covariate balancing function f(Xi) such that it spans the weighted average of the

two conditional mean functions of potential outcomes, i.e., there exists an α ∈ Rm such that

α>f(Xi) = πβo(Xi)E(Yi(0) | Xi) + (1− πβo(Xi))E(Yi(1) | Xi). This result is further extended to

arbitrary model misspecification.

The above result implies that when balancing covariates, for any given unit we should give a

greater weight to the determinants of the potential outcome that is less likely to be realized. For

example, if a unit is less likely to be treated, then it is more important to balance the covariates

that influence the mean potential outcome under the treatment condition. In contrast, if a unit is

more likely to be assigned to the treatment group, then the covariates that determine the potential

outcome under the control condition become more important. We also show that even when the

propensity score is correctly specified, this choice of covariate balancing function is optimal, enabling

the resulting estimator to attain the semiparametric efficiency bound.

2.1 Bias under Model Misspecification

While researchers can avoid gross model misspecification through careful model fitting, in practice

it is often difficult to nail down the exact specification. The prominent simulation study of Kang

and Schafer (2007), for example, is designed to illustrate this phenomenon. We therefore consider

the consequences of local misspecification of propensity score model. In particular, we assume that

the true propensity score π(Xi) is related to the working model πβ(Xi) through the exponential

tilt for some β∗,

π(Xi) = πβ∗(Xi) exp(ξ u(Xi;β∗)) (2.1)

where u(Xi;β∗) is a function determining the direction of misspecification and ξ represents the

magnitude of misspecification. We assume ξ = o(1) and ξ−1n−1/2 = O(1), as n → ∞ so that the

true propensity score π(Xi) is in a local neighborhood of the working model πβ∗(Xi). Under this

local model misspecification setting, we derive the asymptotic bias and variance of the CBPS-based

IPTW estimator in (1.5). The next theorem gives the expression of the asymptotic bias.

Theorem 2.1 (Asymptotic Bias under Local Misspecification). If the propensity score model is

locally misspecified as in (2.1), under Assumption A.1 in Appendix A, the bias of the IPTW

estimator defined in (1.5) is given by E(µβ

)− µ = Bξ + o(ξ), where

B =

E[u(Xi;β

∗)K(Xi) + L(Xi)(1− πβ∗(Xi))1− πβ∗(Xi)

]+H∗y (H∗>f W∗H∗f )−1H∗>f W∗E

(u(Xi;β

∗)f(Xi)

1− πβ∗(Xi)

), (2.2)

5

and W∗ is the limiting value of W in (1.7),

H∗y = −E(K(Xi) + (1− πβ∗(Xi))L(Xi)

πβ∗(Xi)(1− πβ∗(Xi))·∂πβ∗(Xi)

∂β

),

H∗f = −E

(f(Xi)

πβ∗(Xi)(1− πβ∗(Xi))

(∂πβ∗(Xi)

∂β

)>).

This theorem shows that the estimator µβ

has a first order bias term Bξ under the local model

misspecification. Although the expression of B looks sophisticated, the next corollary shows how

to choose the covariate balancing function to eliminate the first order bias.

Corollary 2.1 (Optimal Choice of Covariate Balancing Function under Local Misspecification).

Suppose that we choose the covariate balancing function f(X) such that α>f(X) = K(X) + (1−πβ∗(X))L(X) holds, where α ∈ Rm is a vector of arbitrary constants. In addition, assume that the

number of parameters is equal to the dimension of covariate balancing function f(Xi), i.e., m = q.

Then, under the conditions in Theorem 2.1, the IPTW estimator µβ

given in equation (1.5) is first

order unbiased, i.e., B = 0.

In the following, we consider the asymptotic bias of the IPTW estimator when the propensity

score model is arbitrarily misspecified. We first describe our results in a heuristic way. Assume that

the propensity score model is misspecified, i.e., P(Ti = 1 | Xi) = π(Xi) 6= πβ(Xi) for any β ∈ Θ.

Let βo represent the limiting value of the CBPS estimator β, i.e., assuming βp−→ βo. Then, the

asymptotic bias of µβ

is given by E(µβo)− µ where µ is the true ATE defined in (1.3). The CBPS

method ensures that βo satisfies

E(gβo(T ,X)) = E(

Tiπβo(Xi)

− 1− Ti1− πβo(Xi)

)f(Xi)

= E

π(Xi)− πβo(Xi)

πβo(Xi)(1− πβo(Xi))f(Xi)

= 0. (2.3)

Note that E(µβo) can be written as

E(µβo) = E

TiYiπβo(Xi)

− (1− Ti)Yi1− πβo(Xi)

= E

Ti(K(Xi) + L(Xi))

πβo(Xi)− (1− Ti)K(Xi)

1− πβo(Xi)

,

where the second equality follows from the strong ignorability of treatment assignment and the law

of iterated expectation. Therefore, the asymptotic bias of the IPTW estimator is

E(µβo)− µ = E[(

Tiπβo(Xi)

− 1− Ti1− πβo(Xi)

)K(Xi) + (1− πβo(Xi))L(Xi)

].

Thus, by equation (2.3), we can eliminate the asymptotic bias of µβ

under arbitrary model mis-

specification by choosing the covariate balancing function f(Xi) such that α>f(Xi) = K(Xi) +

(1 − πβo(Xi))L(Xi) holds for some α ∈ Rm. In other words, µβ

remains consistent for µ even if

the propensity score model is misspecified. Under regularity conditions in Appendix A, this result

can be proved by applying the similar argument in Theorem 3.1 of Section 3. For simplicity, we

6

refer the details to the proof of Theorem 3.1 in Appendix C. Finally, we note that the choice of the

covariate balancing function under arbitrary model misspecification is the same as that under local

misspecification, and hence extends the result in Corollary 2.1 under local model misspecification

to the setting of arbitrary misspecification.

2.2 Efficiency Consideration

We next study how the choice of different covariate balancing functions affects the efficiency of the

IPTW estimator. We first consider the case where the propensity score model is correctly specified.

We further show that the efficiency result also applies to the case of local misspecification studied

above.

Let β∗ be the true value of the parameter β in the propensity score model πβ(Xi). We first

derive the asymptotic distribution of the CBPS-based IPTW estimator under correctly specified

propensity score model. In this case, the estimator is asymptotically unbiased and follows the nor-

mal distribution regardless of the choice of covariate balancing function. The asymptotic variance

of the estimator, however, depends on this choice.

Theorem 2.2 (Asymptotic Properties under Correct Specification). Suppose that the propensity

score model is correctly specified and β is obtained through equation (1.7). Let µ be the true

treatment effect and W∗ be the limiting value of W in (1.7). Let

µβ∗(Ti, Yi,Xi) =TiYi

πβ∗(Xi)− (1− Ti)Yi

1− πβ∗(Xi).

Under Assumption A.1 in Appendix A, µβ

satisfies

√n(µ

β− µ)

d−→ N(0, H∗>ΣH∗), (2.4)

where H∗ = (1,H∗>y ) and

Σ =

(Σµ Σ>µβ

Σµβ Σβ

),

with

Σµ = Var(µβ∗(Ti, Yi,Xi)

)= E

(Yi(1)2

πβ∗(Xi)+

Yi(0)2

1− π∗β(Xi)− (E(Yi(1))− E(Yi(0)))2

),

Σβ = (H∗>f W∗H∗f )−1H∗>f W∗Var(gβ∗(Ti,Xi))W∗H∗f (H∗>f W∗H∗f )−1,

Σµβ = −(H∗>f W∗H∗f )−1H∗>f W∗Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)),

7

where gβ∗(Ti,Xi) is defined in (1.6). Under the model in equation (1.2), we have

H∗y = −E(K(Xi) + (1− πβ∗(Xi))L(Xi)

πβ∗(Xi)(1− πβ∗(Xi))·∂πβ∗(Xi)

∂β

),

H∗f = −E

(f(Xi)

πβ∗(Xi)(1− πβ∗(Xi))

(∂πβ∗(Xi)

∂β

)>),

Var(gβ∗(Ti,Xi)) = E(

f(Xi)f(Xi)>

πβ∗(Xi)(1− πβ∗(Xi))

),

Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)) = E[K(Xi) + (1− πβ∗(Xi))L(Xi)f(Xi)

πβ∗(Xi)(1− πβ∗(Xi))

].

To better understand this result, consider a special case where the dimension of the covariate

balancing function is equal to the number of parameters to be estimated, i.e., m = q (Imai and

Ratkovic, 2014). In this case, we can solve the optimization problem in (1.7) by setting W to a

diagonal matrix. Then, the asymptotic variance of µβ

is equal to,

Var(µβ

) ≈ Var(µβ∗) +H∗>y H∗−1f Var(gβ∗(T ,X))H∗−1

f H∗y

−2H∗>y H∗−1f Cov(µβ∗ , gβ∗(T ,X)). (2.5)

The expression (2.5) contains three parts. The first term Var(µβ∗) is the variance of the estimator

under the true value β∗. The second term is the variance of the balancing equation gβ(T ,X) under

β∗ scaled by the quadratic term H∗>y H∗−1f . The third term is the covariance between the µβ∗ and

gβ∗(T ,X) scaled by H∗>y H∗−1f .

Based on the asymptotic variance in equation (2.5), the next corollary shows that the optimal

choice of covariate balancing function derived before also results in an IPTW estimator that is

semiparametrically efficient.

Corollary 2.2 (Optimal Choice of Covariate Balancing Function under Correct Specification).

Choose any covariate balancing function f(X) such that α>f(X) = K(X) + (1 − πβ∗(X))L(X)

holds, for some constant α ∈ Rm. In addition, assume that the number of parameters is equal to

the dimension of covariate balancing function f(Xi), i.e., m = q. Then, under the conditions in

Theorem 2.2, the IPTW estimator µβ

given in equation (1.5) attains the semiparametric asymptotic

variance bound in Theorem 1 of Hahn (1998), i.e.,

Vopt = E[Var(Yi(1) |Xi)

π(Xi)+

Var(Yi(0) |Xi)

1− π(Xi)+ L(Xi)− µ2

]. (2.6)

We note that there may exist many choices for f(X), which satisfy α>f(X) = K(X) + (1 −πβ∗(X))L(X) for some α ∈ Rm. This corollary implies that, provided f(X) satisfies the above

condition, the asymptotic variance of our IPTW estimator dose not depend on the particular choice

of f(X) and is the smallest among the class of regular estimators.

Finally, we comment that this efficiency result under correctly specified model also carries

over to the locally misspecified case examined earlier. In Appendix B.5, we show that under the

8

locally misspecified propensity score model in (2.1), the IPTW estimator µβ

satisfies√n(µ

β−

µ)d−→ N(B, H∗>ΣH∗), where B is the first order bias given in equation (2.2) of Theorem 2.1.

Thus, together with Corollary 2.1, the aforementioned choice of covariate balancing function, i.e.,

α>f(Xi) = K(Xi)+(1−πβo(Xi))L(Xi), yields an asymptotically unbiased and efficient estimator

of the ATE under local misspecification.

In summary, the theoretical analysis presented in this section has shown that the optimal

covariate balancing function f(Xi) for the CBPS methodology needs to satisfy α>f(Xi) = K(Xi)+

(1− πβo(Xi))L(Xi), which leads to an asymptotically unbiased and efficient estimator of the ATE

under various scenarios. Recall that K(Xi) is the conditional mean of the potential outcome under

the control group and L(Xi) is the conditional average treatment effect. Both K(·) and L(·) can be

estimated by imposing additional parametric assumptions. However, the optimal choice of f(Xi)

also requires the knowledge of the propensity score model πβo(Xi), where the limiting value βo

of β depends on the choice of f(Xi) itself. Thus, to construct the optimal covariate balancing

function f(Xi), one needs some prior knowledge of f(Xi) to estimate the propensity score model in

(1.6). This “chicken-and-egg” relationship between the optimal covariate balancing function f(Xi)

and propensity score model πβo(Xi) makes the implementation of the existing CBPS method with

the optimal f(Xi) difficult in practice. To address this issue, in the following section we propose a

new IPTW estimator by generalizing the CBPS method, such that the optimal covariate balancing

function does not depend on the knowledge of the propensity score model πβo(Xi).

3 The Improved CBPS Methodology

Recall that the optimal covariate balancing function f(Xi) for the CBPS methodology needs to

satisfy α>f(Xi) = K(Xi) + (1 − πβo(Xi))L(Xi). Notice that the asymptotic bias of the IPTW

estimator with the optimal covariate balancing function f(Xi) can be decomposed into two terms

in the following manner,

E(µβo)− µ = E[(

Tiπβo(Xi)

− 1− Ti1− πβo(Xi)

)K(Xi) + (1− πβo(Xi))L(Xi)

]= E

[(Ti

πβo(Xi)− 1− Ti

1− πβo(Xi)

)K(Xi) +

(Ti

πβo(Xi)− 1

)L(Xi)

]. (3.1)

Our main idea is to minimize the magnitudes of both terms to eliminate the bias of the IPTW

estimator. Motivated by this observation, we propose to balance the first term and second term

separately. This leads to the following set of estimating functions:

gβ(T ,X) =

(g1β(T ,X)

g2β(T ,X)

), (3.2)

where g1β(T ,X) = n−1∑n

i=1 g1β(Ti,Xi) and g2β(T ,X) = n−1∑n

i=1 g2β(Ti,Xi) with

g1β(Ti,Xi) =

(Ti

πβ(Xi)− 1− Ti

1− πβ(Xi)

)h1(Xi),

g2β(Ti,Xi) =

(Ti

πβ(Xi)− 1

)h2(Xi),

9

for some functions h1(·) : Rd → Rm1 and h2(·) : Rd → Rm2 with m1 + m2 = m. Note that, as

seen in the asymptotic bias of the IPTW estimator in (3.1), h1(Xi) in g1β(T ,X) aims to recover

the conditional mean function of the potential outcome under the control condition, i.e., K(Xi),

whereas h2(Xi) in g2β(T ,X) aims to recover the conditional mean function of the treatment effect,

i.e., L(Xi). It is easily seen that g1β(T ,X) is the same as the existing covariate balancing moment

function in (1.6), which balances the covariates h1(Xi) between the treatment and control groups.

More importantly, unlike the existing CBPS method, we introduce a new set of functions g2β(T ,X)

which matches the weighted covariates h2(Xi) in the treatment group to the unweighted covariates

h2(Xi) in the control group, because g2β(T ,X) = 0 can be rewritten as∑Ti=1

1− πβ(Xi)

πβ(Xi)h2(Xi) =

∑Ti=0

h2(Xi).

As seen in the derivation of (3.1), the auxiliary “covariate-imbalance” estimating function g2β(T ,X)

is the key to remove the dependence of the optimal covariate balancing function f(Xi) on the

propensity score model πβo(Xi). Thus, our method is an improved version of the CBPS method

(iCBPS). Given the estimating functions in (3.2), we can estimate β by the GMM estimator β in

(1.7). Similarly, the ATE is estimated by the IPTW estimator µβ

in (1.5). The implementation of

the proposed iCBPS method (e.g., the choice of h1(·) and h2(·)) will be discussed in later sections.

3.1 Theoretical Properties

We now derive the theoretical properties of the IPTW estimator given in (1.5) based on the proposed

iCBPS method. In particular, we will show that the proposed estimator is doubly robust and

semiparametrically efficient. The following set of assumptions are imposed for the establishment of

double robustness.

Assumption 3.1. The following regularity conditions are assumed.

1. There exists a positive definite matrix W∗ such that Wp−→W∗.

2. The minimizer βo = argminβ∈Θ E(gβ(T ,X))>W∗E(gβ(T ,X)) is unique.

3. βo ∈ int(Θ), where Θ is a compact set.

4. πβ(X) is continuous in β.

5. There exists a constant 0 < c0 < 1/2 such that with probability tending to one, c0 ≤ πβ(X) ≤1− c0, for any β ∈ int(Θ).

6. E|Y (1)|2 <∞ and E|Y (0)|2 <∞.

7. G∗ := E(∂g(βo)/∂β) exists where g(β) = (g1β(T ,X)>, g2β(T ,X)>)> and there is a q-

dimensional function C(X) and a small constant r > 0 such that supβ∈Br(βo) |∂πβ(X)/∂βk| ≤Ck(X) for 1 ≤ k ≤ q, and E(|h1j(X)|Ck(X)) < ∞ for 1 ≤ j ≤ m1, 1 ≤ k ≤ q and

E(|h2j(X)|Ck(X)) <∞ for 1 ≤ j ≤ m2, 1 ≤ k ≤ q, where Br(βo) is a ball in Rq with radius

r and center βo.

10

Conditions 1-4 of Assumption 3.1 are the standard conditions for consistency of the GMM

estimator (Newey and McFadden, 1994). Condition 5 is commonly used in the missing data and

causal inference literature, which essentially says the propensity score cannot be too close to 0 and

1 (Robins et al., 1994, 1995). Conditions 6-7 are technical conditions that enable us to apply the

dominated convergence theorem. Note that, supβ∈Br(βo) |∂πβ(X)/∂βk| ≤ Ck(X) in Condition 7

is a local condition in the sense that it only requires the existence of an envelop function C(X)

around a small neighborhood of βo.

We now establish the double robustness of the proposed estimator under Assumption 3.1.

Theorem 3.1 (Double Robustness). Under Assumption 3.1, the proposed iCBPS-based IPTW

estimator µβ

is doubly robust. That is, µβ

p−→ µ if at least one of the following two conditions

holds:

1. The propensity score model is correctly specified, i.e., P(Ti = 1 |Xi) = πβ∗(Xi);

2. The functions K(·) and L(·) lie in the linear space spanned by the functions M1h1(·) and

M2h2(·) respectively, where M1 ∈ Rq×m1 and M2 ∈ Rq×m2 are the partitions of G∗>W∗ =

(M1,M2). That is K(·) ∈ spanM1h1(·) and L(·) ∈ spanM2h2(·).

Theorem 3.1 implies that the proposed estimator is consistent if either the propensity score

model or the outcome model is correctly specified. In particular, the second condition can be

rewritten as K(Xi) = α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), for some vectors α1,α2 ∈ Rq.Hence, the functions h1(·) and h2(·) play very different roles in the proposed iCBPS methodology.

Specifically, M1h1(·) serves as the basis functions for the conditional baseline effect K(·) while

M2h2(·) represents the basis functions for the conditional treatment effect L(·).Next, we establish the asymptotic normality of the proposed estimator if either the propensity

score model or the outcome model is correctly specified. For this result, we require an additional

set of regularity conditions.

Assumption 3.2. The following regularity conditions are assumed.

1. G∗>W∗G∗ and Ω = E(gβo(Ti,Xi)gβo(Ti,Xi)>) are nonsingular.

2. The function C(X) defined in Condition 7 of Assumption 3.1 satisfies E(|Y (0)|Ck(X)) <∞and E(|Y (1)|Ck(X)) <∞ for 1 ≤ k ≤ q.

Condition 1 of Assumption 3.2 ensures the non-singularity of the asymptotic variance matrix

and Condition 2 is a mild technical condition required for the dominated convergence theorem.

Theorem 3.2 (Asymptotic Normality). Suppose that Assumptions 3.1 and 3.2 hold.

1. If Condition 1 of Theorem 3.1 holds, then the proposed iCBPS-based IPTW estimator µβ

has the following asymptotic distribution:

√n(µ

β− µ)

d−→ N(

0, H∗>ΣH∗), (3.3)

11

where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1 and

H∗ = −E(K(Xi) + (1− πβ∗(Xi))L(Xi)

πβ∗(Xi)(1− πβ∗(Xi))·∂πβ∗(Xi)

∂β

),

Σ =

(Σµ Σ>µβ

Σµβ Σβ

), with Σµ = E

(Y 2i (1)

πβ∗(Xi)+

Y 2i (0)

1− πβ∗(Xi)

)− µ2. (3.4)

In addition, Σµβ is given by

Σµβ = −(G∗>W∗G∗)−1G∗>W∗E(K(Xi) + (1− πβ∗(Xi))L(Xi)

(1− πβ∗(Xi))πβ∗(Xi)h(Xi)

),

where h(Xi) = (h>1 (Xi),h>2 (Xi))

>.

2. If Condition 2 of Theorem 3.1 holds, then the proposed iCBPS-based IPTW estimator µβ

has the following asymptotic distribution:

√n(µ

β− µ)

d−→ N(

0, H∗>ΣH∗), (3.5)

where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1,

H∗ = −E[

π(Xi)(K(Xi) + L(Xi))

πβo(Xi)2+

(1− π(Xi))K(Xi)

(1− πβo(Xi))2

∂πβo(Xi)

∂βo

],

Σ =

(Σµ Σ>µβ

Σµβ Σβ

)with Σµ = E

(π(Xi)Y

2i (1)

πβo(Xi)2+

(1− π(Xi))Y2i (0)

(1− πβo(Xi))2

)− µ2.

In addition, Σµβ is given by

Σµβ = −(G∗>W∗G∗)−1G∗>W∗S,

where S = (S>1 ,S>2 )> and

S1 = E[

π(Xi)(K(Xi) + L(Xi)− πβo(Xi)µ)

πβo(Xi)2

+(1− π(Xi))(K(Xi) + (1− πβo(Xi))µ)

(1− πβo(Xi))2

h1(Xi)

],

S2 = E[

π(Xi)[(K(Xi) + L(Xi))(1− πβo(Xi))− πβo(Xi)µ]

πβo(Xi)2

+(1− π(Xi))K(Xi) + (1− πβo(Xi))µ

1− πβo(Xi)

h2(Xi)

].

3. If both Conditions 1 and 2 of Theorem 3.1 hold and W∗ = Ω−1, then the proposed iCBPS-

based IPTW estimator µβ

has the following asymptotic distribution:

√n(µ

β− µ)

d−→ N(0, V ),

12

where

V = Σµ − (α>1 M1,α>2 M2)G∗(G∗>Ω−1G∗)−1G∗>

(M>

1 α1

M>2 α2

)(3.6)

and Σµ is defined in (3.4).

The asymptotic variance V in (3.6) contains two terms. The first term Σµ represents the

variance of each summand in the estimator defined in equation (1.5) with β replaced by its true

value β∗. The second term can be interpreted as the effect of estimating β∗ via covariate balancing

conditions. Since this second term is nonnegative, the proposed estimator is more efficient than

the standard IPTW estimator with the true propensity score model, i.e., V ≤ Σµ. In particular,

Henmi and Eguchi (2004) offer a theoretical analysis of such efficiency gain due to the estimation

of nuisance parameters under a general estimating equation framework.

Since the choice of h1(·) and h2(·) can be arbitrary, it might be tempting to incorporate addi-

tional covariate balancing conditions into h1(·) and h2(·). However, the following corollary shows

that when both the propensity score and outcome models are correctly specified, one cannot im-

prove the efficiency of the proposed estimator by increasing the number of functions h1(·) and h2(·)or equivalently, the dimensionality of covariate balancing conditions, i.e., g1β(T ,X) and g2β(T ,X).

Corollary 3.1. Define h1(X) = (h>1 (X),a>1 (X))> and h2(X) = (h>2 (X),a>2 (X))>, where a1(·)and a2(·) are some additional covariate balancing functions. Similarly, let g1(X) and g2(X) denote

the corresponding estimating equations defined by h1(X) and h2(X). The resulting iCBPS-based

IPTW estimator is denoted by µβ

where β is in (1.7) and its asymptotic variance is denoted by V .

Under Condition 3 of Theorem 3.1, we have V ≤ V , where V is defined in (3.6).

The above corollary shows a potential trade-off between robustness and efficiency when choosing

h1(·) and h2(·). Recall that if rank(M1) = m1 and rank(M2) = m2, Condition 2 of Theorem 3.1

can be written as K(·) ∈ spanh1(·) and L(·) ∈ spanh2(·). Therefore, we can make the proposed

estimator more robust by incorporating more basis functions into h1(·) and h2(·), such that this

condition is more likely to hold. However, Corollary 3.1 shows that doing so may inflate the variance

of the proposed estimator.

In the following, we focus on the efficiency of the estimator. Using the notations in this section,

we can rewrite the semiparametric asymptotic variance bound Vopt in (2.6) as

Vopt = Σµ − (α>1 M1,α>2 M2)Ω

(M>

1 α1

M>2 α2

). (3.7)

Comparing this expression with (3.6), we see that the proposed estimator is semiparametrically

efficient if G∗ is a square matrix (i.e., m = q) and invertible. In this special case, the dimension

of β must be identical to that of the covariate balancing functions gβ(T ,X). We can accomplish

this by over-parameterizing the propensity score model (e.g., including higher-order terms and

interactions). This important result is summarized as the following corollary.

Corollary 3.2. Assumem = q and G∗ is invertible. Under Assumption 3.1, the proposed estimator

µβ

in (1.5) is doubly robust in the sense that µβ

p−→ µ if either of the following conditions holds:

13

1. The propensity score model is correctly specified. That is P(Ti = 1 |Xi) = πβ∗(Xi).

2. The functions K(·) and L(·) lie in the linear space spanned by the functions h1(·) and h2(·)respectively. That is K(·) ∈ spanh1(·) and L(·) ∈ spanh2(·).

In addition, under Assumption 3.2, if both conditions hold, then the proposed estimator is semi-

parametrically efficient with the asymptotic variance given in (3.7).

This corollary shows that the proposed estimator has two advantages over the existing CBPS

estimator in Imai and Ratkovic (2014) with balancing first (and second) moment of Xi and/or the

score function of the propensity score model. First, the proposed estimator µβ

is doubly robust to

model misspecification, whereas the existing CBPS estimator does not have this property. Second,

the estimator can be more efficient than the existing CBPS estimator.

The result in Corollary 3.2 implies that the asymptotic variance of µβ

is identical to the semi-

parametric variance bound Vopt, even if we incorporate additional covariate balancing functions

into h1(·) and h2(·). Namely, under the conditions in Corollary 3.2, we have V = V = Vopt in

the context of Corollary 3.1. Thus, in this setting, we can improve the robustness of the estimator

without sacrificing the efficiency by increasing the number of functions h1(·) and h2(·). Meanwhile,

this also makes the propensity score model more flexible, since we need to increase the number of

parameters β to ensure m = q as required in Corollary 3.2. This observation further motivates us

to consider a nonparametric/semiparametric approach to improve the iCBPS method, which will

be shown in Section 4.

Remark 3.1. Robins et al. (1994) propose the following estimator of the ATE with the double

robustness and semiparametric efficiency properties

µβ,α,γ =1

n

n∑i=1

TiYiπβ(Xi)

− (1− Ti)Yi1− πβ(Xi)

− (Ti − πβ(Xi))(K(Xi,α) + L(Xi,γ)

πβ(Xi)+

K(Xi,α)

1− πβ(Xi)

),

where K(Xi,α) and L(Xi,γ) are some parametric models with finite dimensional parameters α

and γ. Unlike the projection approach behind this classical doubly robust estimator µβ,α,γ (Tsiatis,

2007), the proposed iCBPS-based IPTW estimator µβ

is constructed by a new covariate balancing

method. From a practical perspective, one needs to plug consistent estimators of (α,γ) and β (e.g.,

usually MLE) into µβ,α,γ to estimate the treatment effect. In contrast, the proposed method is

easier to implement, which avoids the estimation of the parameters in the conditional mean models

K(Xi,α) and L(Xi,γ).

Remark 3.2 (Implementation of iCBPS). Based on Corollary 3.2, h1(·) serves as the basis

functions for the baseline conditional mean function K(·), while h2(·) represents the basis functions

for the conditional average treatment effect function L(·). Thus, in practice, researchers can choose

a set of basis functions for the baseline conditional mean function and the conditional average treat-

ment effect function when determining the specification for h1(·) and h2(·). Once these functions

are selected, they can over-parameterize the propensity score model such that m = q holds. The

resulting iCBPS-based IPTW estimator may reduce bias under model misspecification and attain

high efficiency.

14

Table 3.1: Correct Outcome Model with Correct Propensity Score Model.

n = 300 n = 1000

β1 0 0.33 0.67 1 0 0.33 0.67 1

Bias

True 0.19 2.48 −6.14 −1.97 −0.42 1.77 −1.71 3.39

GLM 0.77 1.13 −4.80 −15.61 0.64 0.46 0.08 −2.61

CBPS 0.70 0.59 −1.17 −5.28 0.68 0.65 0.03 −2.02

iCBPS 0.61 0.63 −0.86 −4.50 0.70 0.73 0.23 −1.28

True 27.17 39.61 84.09 213.13 15.53 21.58 45.50 228.89

Std GLM 5.52 17.69 65.45 163.59 2.44 8.06 34.77 108.03

Dev CBPS 3.36 3.66 4.54 7.40 1.79 1.96 2.36 3.84

iCBPS 3.20 3.26 3.38 5.12 1.72 1.80 1.85 2.50

RMSE

True 27.17 39.69 84.31 213.14 15.54 21.65 45.53 228.91

GLM 5.58 17.73 65.63 164.34 2.52 8.07 34.77 108.07

CBPS 3.43 3.71 4.69 9.09 1.91 2.06 2.36 4.34

iCBPS 3.26 3.32 3.49 6.82 1.86 1.94 1.87 2.80

3.2 Simulation Studies

In this section, we conduct a set of simulation studies to examine the performance of the proposed

methodology. We consider the following linear model for the potential outcomes,

Yi(1) = 200 + 27.4Xi1 + 13.7Xi2 + 13.7Xi3 + 13.7Xi4 + εi,

Yi(0) = 200 + 13.7Xi2 + 13.7Xi3 + 13.7Xi4 + εi.

where εi ∼ N(0, 1), independent of Xi, and consider the following true propensity score model

P(Ti = 1 |Xi = xi) =exp(−β1xi1 + 0.5xi2 − 0.25xi3 − 0.1xi4)

1 + exp(−β1xi1 + 0.5xi2 − 0.25xi3 − 0.1xi4), (3.8)

where β1 varies from 0 to 1. When implementing the proposed methodology, we set h1(xi) =

(1, xi2, xi3, xi4) and h2(xi) = xi1 so that the number of equations is equal to the number of param-

eters to be estimated. Each covariate is generated independently from N(3, 2). Each set of results

is based on 1,000 Monte Carlo simulations.

We examine the performance of the IPTW estimator when the propensity score model is fitted

using the maximum likelihood (GLM), the standard CBPS with balancing the first moment (CBPS),

and the proposed improvement of CBPS (iCBPS) as well as the case where the true propensity score

(True), i.e., β = β∗, is used for the IPTW estimator. In the first set of simulations, we use correctly

specified propensity score model. Table 3.1 shows the standard deviation, bias, root mean square

error (RMSE) of these estimators when the sample size is n = 300 and n = 1000. We find that

CBPS and iCBPS substantially outperform both True and GLM in terms of efficiency. In addition,

iCBPS is more efficient than CBPS in all cases though both have similar biases in most cases. These

findings are consistent with Corollary 3.2.

15

Table 3.2: Correct Outcome Model with Misspecified Propensity Score Model.

n = 300 n = 1000

β1 0 0.33 0.67 1 0 0.33 0.67 1

Bias

True 0.41 −0.38 −2.28 −2.57 0.46 0.02 1.03 −6.59

GLM 1.76 −24.12 −57.51 −72.73 0.93 −25.34 −53.49 −74.57

CBPS 1.77 −0.94 −4.73 −6.69 0.74 −2.51 −6.41 −8.32

iCBPS 0.50 0.41 −0.63 −1.95 0.63 0.30 −0.60 −1.87

True 47.60 41.63 70.40 310.96 25.80 21.12 54.64 90.80

Std GLM 9.72 26.53 161.18 48.66 2.67 13.65 23.74 30.76

Dev CBPS 4.14 4.15 4.84 5.36 1.79 2.08 2.76 3.45

iCBPS 3.42 3.11 3.18 3.20 1.75 1.76 1.82 1.74

RMSE

True 47.60 41.64 70.44 310.97 25.80 21.12 54.65 91.04

GLM 9.88 35.86 171.13 87.50 2.83 28.78 58.52 80.67

CBPS 4.50 4.26 6.77 8.57 1.94 3.26 6.97 9.01

iCBPS 3.45 3.14 3.25 3.75 1.86 1.78 1.92 2.55

We further evaluate our method by considering different cases of misspecification for the outcome

and propensity score models. We begin with the case where the outcome model is linear like

before but the propensity score is misspecified. While we use the model given in equation (3.8)

when estimating the propensity score, the actual treatment is generated according to the following

different model,

P(Ti = 1 |X = xi) =exp(−β1x

∗i1 + 0.5x∗i2 − 0.25x∗i3 − 0.1x∗i4)

1 + exp(−β1x∗i1 + 0.5x∗i2 − 0.25x∗i3 − 0.1x∗i4),

with x∗i1 = exp(xi1/3), x∗i2 = xi2/1 + exp(xi1)+ 10, x∗i3 = xi1xi3/25 + 0.6, and x∗i4 = xi1 +xi4 + 20

where β1 again varies from 0 to 1. In other words, the model misspecification is introduced using

nonlinear transformations. Table 3.2 shows the results for this case. As expected from the double

robustness property shown in Theorem 3.1, we find that the bias for iCBPS becomes significantly

smaller than CBPS and GLM. iCBPS also dominates the other estimators in terms of efficiency.

We next examine the cases where the outcome model is misspecified. We do this by generating

potential outcomes from the following quadratic model

E(Yi(1) |Xi = xi) = 200 + 27.4x2i1 + 13.7x2

i2 + 13.7x2i3 + 13.7x2

i4,

E(Yi(0) |Xi = xi) = 200 + 13.7x2i2 + 13.7x2

i3 + 13.7x2i4,

whereas the propensity score model is the same as the one in (3.8) with β1 varying from 0 to 0.4.

Table 3.3 shows the results when the outcome model is misspecified but the propensity score model

is correct. We find that the magnitude of bias is similar across all estimators though iCBPS has

the least bias. CBPS and iCBPS again dominate GLM in terms of efficiency with iCBPS having the

smallest standard deviation. Finally, we consider the case where both the outcome and propensity

score models are misspecified. Table 3.4 shows the results. We find that iCBPS outperforms all the

other estimators in both bias and efficiency.

16

Table 3.3: Misspecified Outcome Model with Correct Propensity Score Model.

n = 300 n = 1000

β1 0 0.13 0.27 0.4 0 0.13 0.27 0.4

Bias

True 1.11 4.65 1.19 0.75 2.33 3.46 2.36 1.91

GLM 2.09 2.76 2.02 −0.17 2.72 2.35 2.69 2.21

CBPS 2.18 2.46 1.27 −0.98 2.64 2.30 2.38 1.57

iCBPS 2.07 2.22 1.08 −0.70 2.57 2.28 2.38 1.56

True 38.57 42.90 47.86 57.09 21.35 22.57 27.24 31.66

Std GLM 14.36 17.72 23.99 32.35 7.41 8.67 12.53 17.82

Dev CBPS 12.12 13.68 15.11 15.76 6.52 6.94 8.32 9.44

iCBPS 11.81 12.64 13.50 13.64 6.37 6.42 7.27 7.97

RMSE

True 38.58 43.15 47.88 57.10 21.48 22.84 27.34 31.72

GLM 14.51 17.94 24.08 32.35 7.89 8.98 12.82 17.95

CBPS 12.32 13.90 15.16 15.79 7.03 7.31 8.65 9.57

iCBPS 11.99 12.83 13.54 13.66 6.87 6.82 7.65 8.12

Table 3.4: Misspecified Outcome with Misspecified Propensity Score Models.

n = 300 n = 1000

β1 0 0.13 0.27 0.4 0 0.13 0.27 0.4

Bias

True 1.23 1.63 1.19 0.13 0.19 1.51 0.47 1.63

GLM 3.73 1.54 −1.58 −7.16 1.63 −0.01 −3.26 −8.46

CBPS 8.32 6.35 4.23 1.36 1.89 0.75 −0.70 −3.17

iCBPS 4.11 2.28 1.18 −0.32 1.45 0.50 −0.32 −1.53

True 60.98 57.09 52.36 50.10 33.47 30.57 27.66 28.41

Std GLM 23.20 19.45 19.53 20.52 7.71 8.05 8.46 9.70

Dev CBPS 18.29 16.66 14.76 14.64 6.85 7.04 6.83 6.67

iCBPS 14.05 12.27 11.95 11.64 6.69 6.77 6.49 6.14

RMSE

True 60.99 57.11 52.37 50.10 33.47 30.61 27.67 28.46

GLM 23.50 19.51 19.59 21.73 7.88 8.05 9.07 12.87

CBPS 20.10 17.83 15.35 14.70 7.11 7.08 6.87 7.38

iCBPS 14.64 12.48 12.01 11.64 6.84 6.79 6.50 6.33

4 Nonparametric/Semiparametric iCBPS Method

As seen in Corollary 3.2, the proposed estimator is efficient if both the propensity score P(Ti =

1 | Xi) and the conditional mean functions K(·) and L(·) are correctly specified. To avoid model

misspecification, we can choose a large number of basis functions h1(·) and h2(·), such that the

conditional mean functions K(·) and L(·) can be well approximated by the linear spans of h1(·) and

17

h2(·), respectively. Therefore, the correct specification of K(·) and L(·) can be attained. However,

the parametric assumption on the propensity score model P(Ti = 1 | Xi) = πβ∗(Xi) imposed

in Corollary 3.2 can be restrictive and prone to be violated. Once the propensity score model is

misspecified, the proposed iCBPS-based IPTW estimator µβ

is inefficient and even inconsistent.

To relax the strong parametric assumptions imposed in the previous sections, in what follows, we

propose a flexible nonparametric/semiparametric approach for modeling the propensity score and

the conditional mean functions. The main advantage of this nonparametric approach is that, the

resulting iCBPS-based IPTW estimator is semiparametrically efficient under a much broader class

of propensity score models and the conditional mean models than those in Corollary 3.2.

More specifically, we assume P(Ti = 1 |Xi) = J(m∗(Xi)), where J(·) is a known monotonic link

function (e.g., J(·) = exp(·)/(1+exp(·))), and m∗(·) is an unknown smooth function. One practical

way to estimate m∗(·) is to approximate it by the linear combination of K basis functions, where K

is allowed to grow with n. This approach is known as the sieve estimation (Andrews, 1991; Newey,

1997). In detail, let B(x) = b1(x), ..., bK(x) denote a collection of K basis functions, whose

mathematical requirement is given in Assumption 4.1. Intuitively, we would like to approximate

m∗(x) by β∗>B(x), for some coefficient β∗ ∈ RK .

To estimate β∗, similar to the parametric case, we define gβ(T ,X) = n−1∑n

i=1 gβ(Ti,Xi),

where gβ(Ti,Xi) = (g>1β(Ti,Xi), g>2β(Ti,Xi))

> with

g1β(Ti,Xi) =

(Ti

J(β>B(Xi))− 1− Ti

1− J(β>B(Xi))

)h1(Xi),

g2β(Ti,Xi) =

(Ti

J(β>B(Xi))− 1

)h2(Xi).

Recall that h1(X) ∈ Rm1 and h2(X) ∈ Rm2 are interpreted as the basis functions for K(X) and

L(X). Let m1 +m2 = m and h(X) = (h1(X)>,h2(X)>)>. Here, we assume m = K, so that the

number of equations in gβ(T ,X) is identical to the dimension of the parameter β. Then define

β = arg minβ∈Θ ‖gβ(T ,X)‖22, where Θ is the parameter space for β and ‖v‖2 represents the L2

norm of the vector v. The resulting IPTW estimator is

µβ

=1

n

n∑i=1

(TiYi

J(β>B(Xi))− (1− Ti)Yi

1− J(β>B(Xi))

).

The following assumptions are imposed to establish the large sample properties of µβ

.

Assumption 4.1. The following regularity conditions are assumed.

1. The minimizer βo = argminβ∈Θ ‖E(gβ(T ,X))‖22 is unique.

2. βo ∈ int(Θ), where Θ is a compact set.

3. There exist constants 0 < c0 < 1/2, c1 > 0 and c2 > 0 such that c0 ≤ J(v) ≤ 1 − c0

and 0 < c1 ≤ ∂J(v)/∂v ≤ c2, for any v = β>B(x) with β ∈ int(Θ). There exists a small

neighborhood of v∗ = β∗>B(x), say B such that for any v ∈ B it holds that |∂2J(v)/∂v2| ≤ c3

for some constant c3 > 0.

18

4. E|Y (1)|2 <∞ and E|Y (0)|2 <∞.

5. Let G∗ := E[B(Xi)h(Xi)>∆i(m

∗(Xi))], where ∆i(m(Xi)) = diag(ξi(m(Xi))1m1 , φi(m(Xi))1m2)

is a K ×K diagonal matrix with

ξi(m(Xi)) = −( TiJ2(m(Xi))

+1− Ti

(1− J(m(Xi)))2

)∂J(m(Xi))

∂m,

φi(m(Xi)) = − TiJ2(m(Xi))

∂J(m(Xi))

∂m.

Here, 1m1 is a vector of 1’s with length m1. Assume that there exists a constant C1 > 0, such

that λmin(G∗>G∗) ≥ C1, where λmin(·) denotes the smallest eigenvalue of a matrix.

6. For some constant C, it holds ‖E[h(Xi)h(Xi)>]‖2 ≤ C and ‖E[B(Xi)B(Xi)

>]‖2 ≤ C, where

‖A‖2 denotes the spectral norm of the matrix A. In addition, supx∈X ‖h(x)‖2 ≤ CK1/2, and

supx∈X ‖B(x)‖2 ≤ CK1/2.

7. Let m∗(·) ∈M and K(·), L(·) ∈ H, whereM andH are two sets of smooth functions. Assume

that logN[ ](ε,M, L2(P )) ≤ C(1/ε)1/k1 and logN[ ](ε,H, L2(P )) ≤ C(1/ε)1/k2 , where C is a

positive constant and k1, k2 > 1/2. Here, N[ ](ε,M, L2(P )) denotes the minimum number of

ε-brackets needed to cover M; see Definition 2.1.6 of van der Vaart and Wellner (1996).

Note that the first five conditions are similar to Assumptions 3.1 and 3.2. In particular, condition

5 is the natural extension of condition 1 in Assumption 3.2, when the dimension of the matrix G∗

grows with the sample size n. Condition 6 is a mild technical condition on the basis functions

h(x) and B(x), which is implied by Assumption 2 of Newey (1997). In particular, this condition

is satisfied by many bases such as the regression spline, trigonometric polynomial, wavelet bases;

see Newey (1997); Horowitz et al. (2004); Chen (2007); Belloni et al. (2015). Finally, condition 7 is

a technical condition on the complexity of the function classes M and H. Specifically, it requires

that the bracketing number N[ ](ε, ·, L2(P )) ofM and H cannot increase too fast as ε approaches to

0. This condition holds for many commonly used function classes. For instance, if M corresponds

to the Holder class with smoothness parameter s defined on a bounded convex subset of Rd, then

logN[ ](ε,M, L2(P )) ≤ C(1/ε)d/s by Corollary 2.6.2 of van der Vaart and Wellner (1996). Hence,

this condition simply requires s/d > 1/2. Given Assumption 4.1, the following theorem establishes

the asymptotic normality and semiparametric efficiency of the estimator µβ

.

Theorem 4.1. Assume that Assumption 4.1 holds, and there exist rb, rh > 1/2, β∗ and α∗ =

(α∗>1 ,α∗>2 )> ∈ RK , such that the propensity score model satisfies

supx∈X|m∗(x)− β∗>B(x)| = O(K−rb), (4.1)

and the outcome models K(·) and L(·) satisfy

supx∈X|K(x)−α∗>1 h1(x)| = O(K−rh), sup

x∈X|L(x)−α∗>2 h2(x)| = O(K−rh). (4.2)

19

Assume K = o(n1/3) and n1

2(rb+rh) = o(K). Then

n1/2(µβ− µ)

d−→ N(0, Vopt),

where Vopt is the asymptotic variance bound in (2.6). Thus, µβ

is semiparametrically efficient.

This theorem can be viewed as a nonparametric extension of Corollary 3.2. It shows that one

can construct an efficient estimator of the treatment effect without imposing strong parametric

assumptions on the propensity score model and the outcome model.

In the following, we comment on the technical assumptions in this theorem. We assume m∗(x)

and K(x) (also L(x)) can be uniformly approximated by the basis functions B(x) and h1(x)

(also h2(x)), respectively. It is well known that the uniform rate of convergence is related to the

smoothness of the functions m∗(x) and K(x) (also L(x)) and the dimension of X. For instance,

if the function classM for m∗(x) and H for K(x) (also L(x)) correspond to the Holder class with

smoothness parameter s on the domain X = [0, 1]d, then the uniform convergence holds for the

spline basis and wavelet basis with rb = rh = s/d; see Newey (1997); Chen (2007) for details. Thus,

Theorem 4.1 is directly applicable to this setting in which no additional structure assumptions

are imposed on m∗(x) and K(x) (also L(x)). Compared to the nonparametric result in Hirano

et al. (2003) which requires s/d > 7 and Chan et al. (2015) which requires s/d > 13, our theorem

needs a much weaker condition, i.e., s/d > 3/4. To achieve this improved result, our proof exploits

the matrix Bernstein’s concentration inequalities (Tropp, 2015) and a Bernstein-type concentration

inequality for U-statistics (Arcones, 1995).

When d is large, to handle the curse of dimensionality and meanwhile retain the flexibility

of the model, we may assume the following additive structure m∗(x) =∑d

j=1m∗j (xj); see Stone

(1985); Hastie and Tibshirani (1990); Horowitz et al. (2004); Fan and Jiang (2005), among others.

One major advantage of the sieve estimation approach is that it is convenient to incorporate such

structural assumption and the results in Theorem 4.1 are still applicable. In this case, the uniform

convergence holds with rb = s, where s is a common smoothness parameter of the Holder class for

all m∗j (x)j=1,...,d. In particular, if m∗(x) and K(x) (also L(x)) all have the additive structures

and each unknown component lies in the Holder class with s = 1, then Theorem 4.1 holds provided

K = o(n1/3) and n1/4 = o(K). Similar to the additive model, our method can be easily extended to

the partially linear model for m∗(x) and K(x) (also L(x)), by redefining the basis functions B(x)

and h(x). As discussed in Section 5 of Newey (1997), the large sample results (e.g., our Theorem

4.1) of the sieve estimator are directly applicable to the partially linear case. Thus, our Theorem 4.1

provides a unified semiparametric efficiency result under many nonparametric/semiparametric set-

tings, which is more comprehensive than the existing fully nonparametric propensity score methods

in Hirano et al. (2003); Chan et al. (2015).

Interestingly, we find that the asymptotic bias of the estimator µβ

is proportional to K−(rb+rh),

which is the product of the approximation errors for m∗(x) and K(x) (also L(x)). Thus, to make

the bias of the estimator µβ

asymptotically ignorable, we may require either rb or rh sufficiently

large, due to the multiplicative form of the asymptotic bias. This phenomenon can be viewed

as a nonparametric version of the double robustness property, in the sense that the asymptotic

20

normality and semiparametric efficiency results hold if either the propensity score model or the

outcome model is approximated reasonably well by basis functions. To the best of our knowledge,

this property does not appear in the existing literature. In fact, in this theorem, it is not necessary

to assume both the propensity score model and the outcome models are approximated with very

high accuracy (i.e., both rb and rh are large).

5 Concluding Remarks

This paper presents a theoretical investigation of the covariate balancing propensity score method-

ology that others have found work well in practice (e.g., Wyss et al., 2014; Frolich et al., 2015).

We derive the optimal choice of the covariate balancing function so that the resulting IPTW esti-

mator is first order unbiased under local misspecification of the propensity score model. We then

show that this choice is also optimal under arbitrary misspecification. Furthermore, it turns out

that this covariate balancing function is also optimal even under correct specification, attaining the

semiparametric efficiency bound.

Given these theoretical insights, we propose an improvement to the standard CBPS methodol-

ogy by carefully choosing the covariate balancing estimating functions. We prove that the proposed

iCBPS-based IPTW estimator is consistent if either the outcome or propensity score model is cor-

rect. In addition to this double robustness property, the proposed estimator is semiparametrically

efficient. The advantage of the proposed methodology, over the standard doubly robust estimators,

is that it does not require the separate estimation of an outcome model. Our simulation results

confirm the theoretical results, demonstrating the advantages of the proposed iCBPS methodology.

To relax the parametric assumptions and improve the double robustness property, we further ex-

tend the iCBPS method to the nonparametric/semiparametric settings. We show that the proposed

estimator can achieve the semiparametric efficiency bound without imposing parametric assump-

tions on the propensity score model and the outcome model. A new nonparametric version of the

double robustness property is discovered. Finally, we note that our theoretical results require much

weaker technical conditions and cover a broader class of semiparametric models than the existing

nonparametric propensity score methods.

References

Andrews, D. W. (1991). Asymptotic normality of series estimators for nonparametric and semi-

parametric regression models. Econometrica 307–345.

Arcones, M. A. (1995). A bernstein-type inequality for u-statistics and u-processes. Statistics &

probability letters 22 239–247.

Belloni, A., Chernozhukov, V., Chetverikov, D. and Kato, K. (2015). Some new asymp-

totic theory for least squares series: Pointwise and uniform results. Journal of Econometrics 186

345–366.

21

Chan, K., Yam, S. and Zhang, Z. (2015). Globally efficient nonparametric inference of average

treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical

Society, Series B, Methodological Forthcoming.

Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of

econometrics 6 5549–5632.

Chen, X. and Christensen, T. M. (2015). Optimal uniform convergence rates and asymp-

totic normality for series estimators under weak dependence and weak conditions. Journal of

Econometrics 188 447–465.

Fan, J. and Jiang, J. (2005). Nonparametric inferences for additive models. Journal of the

American Statistical Association 100 890–907.

Fong, C., Hazlett, C. and Imai, K. (2015). Covariate balancing propensity score for general

treatment regimes. Tech. rep., Department of Politics, Princeton University.

Frolich, M., Huber, M. and Wiesenfarth, M. (2015). The finite sample performance of

semi- and nonparametric estimators for treatment effects and policy evaluation. Tech. rep., IZA

Discussion Paper No. 8756.

Graham, B. S., Pinto, C. and Egel, D. (2012). Inverse probability tilting for moment condition

models with missing data. Review of Economic Studies 79 1053–1079.

Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of

average treatment effects. Econometrica 315–331.

Hainmueller, J. (2012). Entropy balancing for causal effects: Multivariate reweighting method

to produce balanced samples in observational studies. Political Analysis 20 25–46.

Hansen, B. E. (2014). A unified asymptotic distribution theory for parametric and non-parametric

least squares. Tech. rep., Working paper, University of Wisconsin.

Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.

Econometrica 50 1029–1054.

Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models, vol. 43. CRC Press.

Henmi, M. and Eguchi, S. (2004). A paradox concerning nuisance parameters and projected

estimating functions. Biometrika 91 929–941.

Hirano, K., Imbens, G. and Ridder, G. (2003). Efficient estimation of average treatment effects

using the estimated propensity score. Econometrica 71 1307–1338.

Horowitz, J. L., Mammen, E. et al. (2004). Nonparametric estimation of an additive model

with a link function. The Annals of Statistics 32 2412–2443.

22

Horvitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from

a finite universe. Journal of the American Statistical Association 47 663–685.

Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal

Statistical Society, Series B (Statistical Methodology) 76 243–263.

Imai, K. and Ratkovic, M. (2015). Robust estimation of inverse probability weights for marginal

structural models. Journal of the American Statistical Association 110 1013–1023.

Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: a comparison of

alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22

574–580.

Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal

of Econometrics 79 147–168.

Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing.

Handbook of econometrics 4 2111–2245.

Owen, A. B. (2001). Empirical Likelihood. Chapman & Hall/CRC, New York.

Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients

when some regressors are not always observed. Journal of the American Statistical Association

89 846–866.

Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression

models for repeated outcomes in the presence of missing data. Journal of the American Statistical

Association 90 106–121.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in obser-

vational studies for causal effects. Biometrika 70 41–55.

Rubin, D. B. (1990). Comments on “On the application of probability theory to agricultural

experiments. Essay on principles. Section 9” by J. Splawa-Neyman translated from the Polish

and edited by D. M. Dabrowska and T. P. Speed. Statistical Science 5 472–480.

Stone, C. J. (1985). Additive regression and other nonparametric models. The annals of Statistics

689–705.

Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika

97 661–682.

Tropp, J. A. (2015). An introduction to matrix concentration inequalities. arXiv preprint

arXiv:1501.01571 .

Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business Media.

23

van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes: With

Applications to Statistics. Springer Science & Business Media.

Van der Vaart, A. W. (2000). Asymptotic statistics, vol. 3. Cambridge university press.

Wyss, R., Ellis, A. R., Brookhart, M. A., Girman, C. J., Funk, M. J., LoCasale, R.

and Sturmer, T. (2014). The role of prediction modeling in propensity score estimation: An

evaluation of logistic regression, bCART, and the covariate-balancing propensity score. American

Journal of Epidemiology 180 645–655.

24

A Preliminaries

To simplify the notation, we use π∗i = πβ∗(Xi) and πoi = πβo(Xi). For any vector C ∈ RK , we

denote |C| = (|C1|, ..., |CK |)> and write C ≤ B for Ck ≤ Bk for any 1 ≤ k ≤ K.

Assumption A.1. (Regularity Conditions for CBPS in Section 2)

1. There exists a positive definite matrix W∗ such that Wp−→W∗.

2. The minimizer βo = argminβ E(gβ(T ,X))>W∗E(gβ(T ,X)) is unique.

3. βo ∈ int(Θ), where Θ is a compact set.

4. πβ(X) is continuous in β.

5. There exists a constant 0 < c0 < 1/2 such that with probability tending to one, c0 ≤ πβ(X) ≤1− c0, for any β ∈ int(Θ).

6. E|fj(X)| <∞ for 1 ≤ j ≤ m and E|Y (1)|2 <∞, E|Y (0)|2 <∞.

7. G∗ := E(∂g(βo)/∂β) exists and there is a q-dimensional function C(X) and a small constant

r > 0 such that supβ∈Br(βo) |∂πβ(X)/∂β| ≤ C(X) and E(|fj(X)|C(X)) <∞ for 1 ≤ j ≤ m,

where Br(βo) is a ball in Rq with radius r and center βo. In addition, E(|Y |C(X)) <∞.

8. G∗>W∗G∗ and E(gβo(Ti,Xi)gβo(Ti,Xi)>) are nonsingular.

9. In the locally misspecified model (2.1), assume |u(X;β∗)| ≤ 1 almost surely.

Lemma A.1 (Lemma 2.4 in Newey and McFadden (1994)). Assume that the data Zi are i.i.d., Θ

is compact, a(Z, θ) is continuous for θ ∈ Θ, and there is D(Z) with |a(Z, θ)| ≤ D(Z) for all θ ∈ Θ

and E(D(Z)) <∞, then E(a(Z, θ)) is continuous and supθ∈Θ |n−1∑n

i=1 a(Zi, θ)−E(a(Z, θ))| p−→ 0.

Lemma A.2. Under Assumption A.1 (or Assumptions 3.1), we have βp−→ βo. Moreover, if

P(Ti = 1 |Xi) = π∗i (Xi), then βp−→ β∗.

Proof of Lemma A.2. The proof of βp−→ βo follows from Theorem 2.6 in Newey and McFadden

(1994). Note that their conditions (i)–(iii) follow directly from Assumption 3.1 (1)–(4). We only

need to verify their condition (iv), i.e., E(supβ∈Θ |gβj(Ti,Xi)|) <∞ where

gβj(Ti,Xi) =( Tiπβ(Xi)

− 1− Ti1− πβ(Xi)

)fj(Xi),

By Assumption A.1 (5), we have |gβj(Ti,Xi)| ≤ 2|fj(Xi)|/c0 and thus E(supβ∈Θ |gβj(Ti,Xi)|) <∞by Assumption A.1 (6). Moreover, if P(Ti = 1 | Xi) = πβ∗(Xi), in the following we show

that β∗ = βo. This is because by P(Ti = 1 | Xi) = πβ∗(Xi), we have E(gβ∗(Ti,Xi)) =

0 and E(g>β∗(Ti,Xi))W∗E(gβ∗(Ti,Xi)) = 0. Since W∗ is positive definite, we can see that

E(g>β (Ti,Xi))W∗E(gβ(Ti,Xi)) ≥ 0. Hence β∗ is the minimizer of E(g>β (Ti,Xi))W

∗E(gβ(Ti,Xi))

25

and by the uniqueness of the minimizer we prove that β∗ = βo. In addition, for the proof of Theo-

rem 3.1, we similarly verify the following conditions to prove this lemma for the iCBPS estimator,

i.e., E(supβ∈Θ |g1βj(Ti, Xi)|) <∞ and E(supβ∈Θ |g2βj(Ti,Xi)|) <∞, where

g1βj(Ti,Xi) =( Tiπβ(Xi)

− 1− Ti1− πβ(Xi)

)h1j(Xi), and g2βj(Ti,Xi) =

( Tiπβ(Xi)

− 1)h2j(Xi).

We have |g1βj(Ti,Xi)| ≤ 2|h1j(Xi)|/c0 and thus E(supβ∈Θ |g1βj(Ti,Xi)|) < ∞. Similarly, we can

prove E(supβ∈Θ |g2βj(Ti,Xi)|) <∞. This completes the proof.

Lemma A.3. Under Assumption A.1 (or Assumptions 3.1 and 3.2), we have

n1/2(β − βo) = −(H∗>f W∗H∗f )−1n1/2H∗>f W∗gβo(T ,X) + op(1), (A.1)

n1/2(β − βo) d−→ N(0, (H∗>f W∗H∗f )−1H∗>f W∗ΩW∗H∗f (H∗>f W∗H∗f )−1), (A.2)

where Ω = Var(gβ∗(Ti,Xi)). If the propensity score model is correctly specified with P(Ti = 1 |Xi) = πβ∗(Xi) and W∗ = Ω−1 holds, then n1/2(β − β∗) d−→ N(0, (H∗>f Ω−1H∗f )−1).

Proof. The proof of (A.1) and (A.2) follows from Theorem 3.4 in Newey and McFadden (1994).

Note that their conditions (i), (ii), (iii) and (v) are directly implied by our Assumption A.1

(3), (4), (2) and Assumption A.1 (1), respectively. In addition, their condition (iv), that is,

E(supβ∈N |∂gβo(Ti,Xi)/∂βj |) < ∞ for some small neighborhood N around βo, is also implied

by our Assumption A.1. To see this, by Assumption A.1 some simple calculations show that

supβ∈N

∣∣∣∂gβ(Ti,Xi)

∂βj

∣∣∣ ≤ (Ti|f(Xi)|c2

0

+(1− Ti)|f(Xi)|

c20

)supβ∈N

∣∣∣∂πβ(Xi)

∂βj

∣∣∣ ≤ Cj(X)|f(Xi)|/c20,

for N ∈ Br(βo). Hence, E(supβ∈N |∂gβo(Ti,Xi)/∂βj |) < ∞, by Assumption A.1 (7). Thus,

condition (iv) in Theorem 3.4 in Newey and McFadden (1994) holds. In order to apply this lemma

to the proofs in Section 3, we need to further verify this condition for gβ(·) = (g>1β(·), g>2β(·))>,

where

g1β(Ti,Xi) =( Tiπβ(Xi)

− 1− Ti1− πβ(Xi)

)h1(Xi), and g2β(Ti,Xi) =

( Tiπβ(Xi)

− 1)h2(Xi).

To this end, by Assumption 3.1 some simple calculations show that when

supβ∈N

∣∣∣∂g1β(Ti,Xi)

∂βj

∣∣∣ ≤ (Ti|h1(Xi)|c2

0

+(1− Ti)|h1(Xi)|

c20

)supβ∈N

∣∣∣∂πβ(Xi)

∂βj

∣∣∣ ≤ Cj(X)|h1(Xi)|/c20,

for N ∈ Br(βo). Hence, E(supβ∈N |∂g1βo(Ti,Xi)/∂βj |) <∞, by Assumption 3.1 (7). Following the

similar arguments, we can prove that E(supβ∈N |∂g2βo(Ti,Xi)/∂βj |) < ∞ holds. This completes

the proof of (A.2). As shown in Lemma A.2, if P(Ti = 1 | Xi) = πβ∗(Xi) holds, then βo = β∗.

Thus, the asymptotic normality of n1/2(β − β∗) follows from (A.2). The proof is complete.

26

B Proof of Results in Section 2

B.1 Proof of Theorem 2.1

Proof. First, we derive the bias of β. By the arguments in the proof of Lemma A.3, we can show

that β = βo + Op(n−1/2), where βo satisfies βo = argminβ E(gβ(T ,X))>W∗E(gβ(T ,X)). Let

u∗i = u(Xi;β∗). By the propensity score model and the fact that |u(Xi;β

∗)| is a bounded random

variable and E|fj(Xi)| <∞, we can show that

E(gβo) = Eπ∗i (1 + ξu∗i )f(Xi)

πoi− (1− π∗i − ξπ∗i u∗i )f(Xi)

1− πoi

+O(ξ2).

In addition, following the similar calculation, we have E(gβ∗) = O(ξ). Therefore,

limn→∞

E(gβ∗(T ,X))>W∗E(gβ∗(T ,X)) = 0.

Clearly, this quadratic form E(gβ(T ,X))>W∗E(gβ(T ,X)) must be nonnegative for any β. By the

uniqueness of βo, we have βo − β∗ = o(1). Therefore, we can expand πoi around π∗i , which yields

E(gβo) = Eξ( u∗i

1− π∗i

)f(Xi) +H∗f (βo − β∗)

+O(ξ2 + ‖βo − β∗‖22).

This implies that the bias of βo is

βo − β∗ = −ξ(H∗>f W∗H∗f )−1H∗>f W∗E( u∗i

1− π∗i

)f(Xi)

+O(ξ2). (B.1)

Our next step is to derive the bias of µβ

. Similar to the proof of Theorem 3.2, we have

µβ− µ =

1

n

n∑i=1

Di + H∗>y (β − βo) + op(n−1/2),

where

Di =TiYi(1)

πoi− (1− Ti)Yi(0)

1− πoi− µ,

and

n1/2(β − βo) = −(H∗>f W∗H∗f )−1n1/2H∗>f W∗gβo(T ,X) + op(1).

By the dominated convergence theorem, the bias of µβ

can be represented by

E(µβ

)− µ = E(Di) + o(n−1/2).

By ignorability of the treatment assignment and the definition of the true propensity score,

E(µβ

)− µ = Eπ∗i (1 + ξu∗i )Yi(1)

πoi− (1− π∗i − ξπ∗i u∗i )Yi(0)

1− πoi

− µ+ o(ξ).

Similarly, we expand πoi around π∗i ,

E(µβ

)− µ = EYi(1)− Yi(0) + ξ(u∗iYi(1) +

u∗iπ∗i

1− π∗iYi(0)) +H∗y (βo − β∗)

− µ+ o(ξ), (B.2)

27

where

H∗y = −E

((1− π∗i )Yi(1) + π∗i Yi(0))f(Xi).

Plugging (B.1) into above equation, the bias of µβ

is

E(µβ

)− µ = ξE(u∗iYi(1) +

u∗iπ∗i

1− π∗iYi(0))

−H∗y (H∗>f W∗H∗f )−1H∗>f W∗E( u∗i

1− π∗i

)f(Xi)

+ o(ξ)

= ξE(u∗i (K(Xi) + L(Xi)(1− π∗i ))

1− π∗i

)+H∗y (H∗>f W∗H∗f )−1H∗>f W∗E

(u∗i f(Xi)

1− π∗i

)+ o(ξ).

This completes the proof.

B.2 Proof of Corollary 2.1

Proof. When H∗f is invertible, it is easy to show the bias term can be written as

B =

[Eu(Xi;β

∗)(K(Xi) + (1− πβ∗(Xi))L(Xi))

1− πβ∗(Xi)

+H∗yH

∗−1f E

(u(Xi;β

∗)f(Xi)

1− πβ∗(Xi)

)],

when the propensity score model is locally misspecified. If we choose the balancing function f(X)

such that α>f(X) = K(Xi) + (1− π∗i )L(Xi) for some α ∈ Rq, we have

H∗y = −E(K(Xi) + (1− π∗i )L(Xi)

π∗i (1− π∗i )· ∂π

∗i

∂β

)= −α>E

( f(Xi)

π∗i (1− π∗i )

(∂π∗i∂β

)>),

H∗f = −E(∂gβ∗(Ti,Xi)

∂β

)= −E

(f(Xi)

π∗i (1− π∗i )

(∂π∗i∂β

)>).

So the bias becomes

B =

[α>E

u(Xi;β

∗)f(Xi)

1− πβ∗(Xi)

+α>H∗f (H∗f )−1E

(u(Xi;β

∗)f(Xi)

1− πβ∗(Xi)

)]= 0.

This proves that µβ

is first order unbiased.

B.3 Proof of Theorem 2.2

Proof. The proof is similar to that of Theorem 3.2. We omit the details.

B.4 Proof of Corollary 2.2

Proof. The asymptotic variance bound Vopt can be written as, Vopt = Σµ −α>Ωα, where

Ω = E(gβo(Ti,Xi)gβo(Ti,Xi)>) = E

(f(Xi)f(Xi)

>

π∗i (1− π∗i )

).

28

We can write the asymptotic variance of our estimator as

V = Σµ + 2H∗>y Σµβ +H∗>y ΣβH∗y ,

where

H∗y = E(∂µβ∗(Ti, Yi,Xi)

∂β

)= −E

(K(Xi) + (1− π∗i )L(Xi)

π∗i (1− π∗i )∂π∗i∂β

),

Σµβ = −(H∗f )−1 Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)),

H∗f = E(∂gβ∗(Ti,Xi)

∂β

)= −E

(f(Xi)

π∗i (1− π∗i )

(∂π∗i∂β

)>),

Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)) = E(K(X) + (1− π∗i )L(Xi)

π∗i (1− π∗i )f(Xi)

),

Σβ = (H∗f )−1 Var(gβ∗(Ti,Xi))(H∗>f )−1,

Var(gβ∗(Ti,Xi)) = E(

f(Xi)f(Xi)>

π∗i (1− π∗i )

).

If K(Xi)+(1−π∗i )L(Xi) lies in the linear space spanned by f(Xi), that is, K(Xi)+(1−π∗i )L(Xi) =

α>f(Xi), we have

H∗y = −E(α>f(Xi)

π∗i (1− π∗i )∂π∗i∂β

)= (α>H∗f )>.

So

H∗>y Σµβ = −α>H∗f (H∗f )−1E(α>f(Xi)f(Xi)

π∗i (1− π∗i )

)= −α>E

(f(Xi)f(Xi)

>

π∗i (1− π∗i )

)α,

and

H∗>y ΣβH∗y = α>H∗f (H∗f )−1E

(f(Xi)f(Xi)

>

π∗i (1− π∗i )

)(H∗>f )−1(α>H∗f )> = α>E

(f(Xi)f(Xi)

>

π∗i (1− π∗i )

)α.

It is seen that H∗>y Σµβ = −H∗>y ΣβH∗y . Then we have

V = Σµ −α>Ωα,

which corresponds to the minimum asymptotic variance Vopt.

B.5 Asymptotic Distribution under Local Misspecification

Theorem B.1. Under the locally misspecified propensity score model in (2.1) with ξ = n−1/2 and

Assumption A.1 in Appendix A, the estimator µβ

in (1.5), where β is obtained by (1.7), has the

following asymptotic distribution

√n(µ

β− µ)

d−→ N(B, H∗>ΣH∗), (B.3)

where B is the first order bias given in equation (2.2) of Theorem 2.1.

29

Proof. Recall that in the proof of Theorem 2.1 we obtain that

µβ− µ =

1

n

n∑i=1

Di + H∗>y (β − βo) + op(n−1/2),

where

Di =TiYi(1)

πoi− (1− Ti)Yi(0)

1− πoi− µ,

and

n1/2(β − βo) = −(H∗>f W∗H∗f )−1n1/2H∗>f W∗gβo(T ,X) + op(1).

In addition, we have shown that E(Di) = Bn−1/2 + o(n−1/2). Thus,

µβ− µ =

1

n

n∑i=1

Di − E(Di)+ H∗>y (β − βo) +Bn−1/2 + op(n−1/2).

Then the asymptotic normality of√n(µ

β−µ) follows from the above asymptotic expansion and the

central limit theorem. The detailed calculation of the asymptotic variance is similar to Theorem 2.2,

which is omitted for simplicity.

C Proof of Results in Section 3

C.1 Proof of Theorem 3.1

Proof of Theorem 3.1. We first consider the case (1). That is the propensity score model is correctly

specified. By Lemma A.2, we have βp−→ β∗. Let

rβ(T, Y,X) =TY

πβ(X)− (1− T )Y

1− πβ(X).

It is seen that |rβ(T, Y,X)| ≤ 2|Y |/c0 and by Assumption 3.1 (6), E|Y | < ∞. Then Lemma A.1

yields supβ∈Θ |n−1∑n

i=1 rβ(Ti, Yi,Xi) − E(rβ(Ti, Yi,Xi))| = op(1). In addition, by βp−→ β∗ and

the dominated convergence theorem, we obtain that

µβ

= E(TiYiπ∗i− (1− Ti)Yi

1− π∗i

)+ op(1),

where π∗i = πβ∗(Xi). Since Yi = Yi(1)Ti+Yi(0)(1−Ti) and Yi(1), Yi(0) are independent of Ti given

Xi, we can further simplify the above expression,

µβ

= E(TiYiπ∗i− (1− Ti)Yi

1− π∗i

)+ op(1) = E

(TiYi(1)

π∗i− (1− Ti)Yi(0)

1− π∗i

)+ op(1)

= E(E(Ti |Xi)E(Yi(1) |Xi)

π∗i− (1− E(Ti |Xi))E(Yi(1) |Xi)

1− π∗i

)+ op(1).

In addition, if the propensity score model is correctly specified, it further implies

µβ

= E(E(Yi(1) |Xi)− E(Yi(0) |Xi)) + op(1) = E(Yi(1)− Yi(0)) + op(1) = µ+ op(1).

30

This completes the proof of consistence of µ when the propensity score model is correctly specified.

In the following, we consider the case (2). That isK(·) ∈ spanM1h1(·) and L(·) ∈ spanM2h2(·).By Lemma A.2, we have β

p−→ βo. The first order condition for βo yields ∂Q(βo)/∂β = 0, where

Q(β) = E(g>β )W∗E(gβ). By Assumption 3.1 (7) and the dominated convergence theorem, we can

interchange the differential with integral, and thus G∗>W∗E(gβo) = 0. Under the assumption that

P(Ti = 1 |Xi) = π(Xi) 6= πoi , we have

E(g1βo) = E(π(Xi)

πoi− 1− π(Xi)

1− πoi

)h1(Xi)

,

E(g2βo) = E(π(Xi)

πoi− 1)h2(Xi)

.

Rewrite G∗>W∗ = (M1,M2), where M1 ∈ Rq×m1 and M1 ∈ Rq×m2 . Then, βo satisfies

E(π(Xi)

πoi− 1− π(Xi)

1− πoi

)M1h1(Xi) +

(π(Xi)

πoi− 1)M2h2(Xi)

= 0. (C.1)

Following the similar arguments to that in case (1), we can prove that

µβ

= E(TiYiπoi− (1− Ti)Yi

1− πoi

)+ op(1)

= E(E(Ti |Xi)E(Yi(1) |Xi)

πoi− (1− E(Ti |Xi))E(Yi(1) |Xi)

1− πoi

)+ op(1).

By E(Ti |Xi) = π(Xi) and the outcome model, it further implies

µβ− µ = E

π(Xi)(K(Xi) + L(Xi))

πoi− (1− π(Xi))K(Xi)

1− πoi

− µ+ op(1)

= E(π(Xi)

πoi− 1− π(Xi)

1− πoi

)K(Xi)

+ E

π(Xi)L(Xi)

πoi

− µ+ op(1)

= E(π(Xi)

πoi− 1− π(Xi)

1− πoi

)K(Xi)

+ E

(π(Xi)

πoi− 1)L(Xi)

+ op(1),

where in the last step we use µ = E(L(Xi)). By equation (C.1), we obtain µ = µ+ op(1), provided

K(Xi) = α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), where α1 and α2 are q-dimensional vectors

of constants. This completes the whole proof.

C.2 Proof of Theorem 3.2

Proof of Theorem 3.2. We first consider the case (1). That is the propensity score model is correctly

specified. By the mean value theorem, we have µ = µ+ H(β)>(β − β∗), where

µ =1

n

n∑i=1

(TiYiπ∗i− (1− Ti)Yi

1− π∗i

), H(β) = − 1

n

n∑i=1

(TiYiπ2i

+(1− Ti)Yi(1− πi)2

)∂πi∂β

,

31

where π∗i = πβ∗(Xi), πi = πβ

(Xi) and β is an intermediate value between β and β∗. By Assump-

tion 3.2 (2), we can show that the summand in H(β) has a bounded envelop function. By Lemma

A.1, we have supβ∈Br(β∗) |H(β) − E(H(β))| = op(1). Since β is consistent, by the dominated

convergence theorem we can obtain H(β) = H∗ + op(1), where

H∗ = −E(TiYi

π∗2i+

(1− Ti)Yi(1− π∗i )2

)∂π∗i∂β

= −E

(Yi(1)

π∗i+

Yi(0)

1− π∗i

)∂π∗i∂β

= −E

K(Xi) + L(Xi)(1− π∗i )π∗i (1− π∗i )

∂π∗i∂β

.

Finally, we invoke the central limit theorem and equation (A.1) to obtain that

n1/2(µ− µ)d−→ N(0, H∗>ΣH∗),

where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1 and

Σ =

(Σµ Σ>µβ

Σµβ Σβ

).

Denote bi(Ti,Xi, Yi(1), Yi(0)) = TiYi(1)/π∗i − (1− Ti)Yi(0)/(1− π∗i )− µ. Here, some simple calcu-

lations yield,

Σµ = E[b2i (Ti,Xi, Yi(1), Yi(0))] = E(Y 2

i (1)

π∗i+Y 2i (0)

1− π∗i

)− µ2.

In addition, the off diagonal matrix can be written as Σµβ = (Σ>1µβ,Σ>2µβ)>, where

Σµβ = −(G∗>W∗G∗)−1G∗>W∗T,

where T = (E[g>1β∗(Ti,Xi)bi(Ti,Xi, Yi(1), Yi(0))],E[g>2β∗(Ti,Xi)bi(Ti,Xi, Yi(1), Yi(0))])> with

g1β(Ti,Xi) =( Tiπβ(Xi)

− 1− Ti1− πβ(Xi)

)h1(Xi), and g2β(Ti,Xi) =

( Tiπβ(Xi)

− 1)h2(Xi).

After some algebra, we can show that

T =

E(K(Xi) + (1− π∗i )L(Xi)

(1− π∗i )π∗ih>1 (Xi)

),E(K(Xi) + (1− π∗i )L(Xi)

π∗ih>2 (Xi)

)>.

This completes the proof of equation (3.3). Next, we consider the case (2). Recall that P(Ti = 1 |Xi) = π(Xi) 6= πβo(Xi). Following the similar arguments, we can show that

µβ− µ =

1

n

n∑i=1

Di + H∗>(β − βo) + op(n−1/2),

where

Di =TiYi(1)

πoi− (1− Ti)Yi(0)

1− πoi− µ,

32

and

H∗ = −E(π(Xi)(K(Xi) + L(Xi))

πo2i+

(1− π(Xi))K(Xi)

(1− πoi )2

)∂πoi∂β

.

By equation (A.1) in Lemma A.3, we have that

n1/2(µβ− µ)

d−→ N(0, H∗>ΣH∗),

where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1 and

Σ =

(Σµ Σ>µβ

Σµβ Σβ

).

Denote ci(Ti,Xi, Yi(1), Yi(0)) = TiYi(1)/πoi − (1− Ti)Yi(0)/(1− πoi )− µ. As shown in the proof of

Theorem 3.1, E[bi(Ti,Xi, Yi(1), Yi(0))] = 0. Thus,

Σµ = E[c2i (Ti,Xi, Yi(1), Yi(0))] = E

(TiY 2i (1)

πo2i+

(1− Ti)Y 2i (0)

(1− πoi )2

)− µ2

= E(π(Xi)Y

2i (1)

πo2i+

(1− π(Xi))Y2i (0)

(1− πoi )2

)− µ2.

Similarly, the off diagonal matrix can be written as Σµβ = (Σ>1µβ,Σ>2µβ)>, where

Σµβ = −(G∗>W∗G∗)−1G∗>W∗S,

where S = (E[g>1βo(Ti,Xi)ci(Ti,Xi, Yi(1), Yi(0))],E[g>2βo(Ti,Xi)ci(Ti,Xi, Yi(1), Yi(0))])> with

g1β(Ti,Xi) =( Tiπβ(Xi)

− 1− Ti1− πβ(Xi)

)h1(Xi), and g2β(Ti,Xi) =

( Tiπβ(Xi)

− 1)h2(Xi). (C.2)

After some tedious algebra, we can show that S = (S>1 ,S>2 )>, where

S1 = E(π(Xi)(K(Xi) + L(Xi)− πoi µ)

πo2i+

(1− π(Xi))(K(Xi) + (1− πoi )µ)

(1− πoi )2

)h1(Xi)

,

S2 = E(π(Xi)[(K(Xi) + L(Xi))(1− πoi )− πoi µ]

πo2i+

(1− π(Xi))K(Xi) + (1− πoi )µ1− πoi

)h2(Xi)

.

This completes the proof of equation (3.5).

Finally, we start to prove part 3. By (3.3), the asymptotic variance of µ denoted by V , can be

written as

V = Σµ + 2H∗>Σµβ + H∗>ΣβH∗. (C.3)

Note that by Lemma A.3, we have Σβ = (G∗>Ω−1G∗)−1. Under this correctly specified propensity

score model, some algebra yields

Ω = E[gβ∗(Ti,Xi)g>β∗(Ti,Xi)] =

E(h1h>1

π∗i (1−π∗i )) E(h1h>2π∗i

)

E(h2h>1π∗i

) E(h2h>2 (1−π∗i )

π∗i)

,

33

where gβ(Ti,Xi) = (g>1β(Ti,Xi), g>2β(Ti,Xi))

> and g1β(Ti,Xi) and g2β(Ti,Xi) are defined in (C.2).

In addition, G∗ = (G∗>1 ,G∗>2 )>, where

G∗1 = −E( h1(Xi)

π∗i (1− π∗i )

(∂π∗i∂β

)>), G∗2 = −E

(h2(Xi)

π∗i

(∂π∗i∂β

)>). (C.4)

Since the functions K(·) and L(·) lie in the linear space spanned by the functions M1h1(·) and

M2h2(·) respectively, where M1 ∈ Rq×m1 and M1 ∈ Rq×m2 are the partitions of G∗>W∗ =

(M1,M2). We have K(Xi) = α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), where α1 and α2 are

q-dimensional vectors of constants. Thus

H∗ = −EK(Xi) + L(Xi)(1− π∗i )

π∗i (1− π∗i )∂π∗i∂β

= −E

α>1 M1h1(Xi) +α>2 M2h2(Xi)(1− π∗i )π∗i (1− π∗i )

∂π∗i∂β

.

Comparing to the expression of G∗ in (C.4), we can rewrite H∗ as

H∗ = G∗>

(M>

1 α1

M>2 α2

).

Following the similar derivations, it is seen that

Σµβ = −(G∗>Ω−1G∗)−1G∗>Ω−1

Eα>1 M1h1(Xi)+α

>2 M2h2(Xi)(1−π∗i )

π∗i (1−π∗i ) h1(Xi)

Eα>1 M1h1(Xi)+α

>2 M2h2(Xi)(1−π∗i )π∗i

h2(Xi)

,

which is equivalent to

Σµβ = −(G∗>Ω−1G∗)−1G∗>

(M>

1 α1

M>2 α2

).

Hence,

H∗>Σµβ = −(α>1 M1,α>2 M2)G∗(G∗>Ω−1G∗)−1G∗>

(M>

1 α1

M>2 α2

)= −H∗>ΣβH∗.

Together with (C.3), we have

V = Σµ − (α>1 M1,α>2 M2)G∗(G∗>Ω−1G∗)−1G∗>

(M>

1 α1

M>2 α2

).

This completes of the proof.

34

C.3 Proof of Corollary 3.1

Proof of Corollary 3.1. By Theorem 3.2, it suffices to show that

(α>1 M1, α>2 M2)G∗CG∗>

(M>

1 α1

M>2 α2

)≤ (α>1 M1,α

>2 M2)G∗CG∗>

(M>

1 α1

M>2 α2

), (C.5)

where C = (G∗>Ω−1G∗)−1 and α1 and M1 among others are the corresponding quantities with

h1(X) and h2(X). Assume that h1(X) ∈ Rm1+a1 and h2(X) ∈ Rm2+a2 . Since K(Xi) =

α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), we find that (α>1 M1, α>2 M2) = (α>1 M1, 0,α

>2 M2, 0),

which is a vector in Rm+a with a = a1 + a2. Because some components of (α>1 M1, α>2 M2) are 0,

by the matrix algebra, (C.5) holds if C− C is positive semidefinite. Without loss of generality, we

rearrange orders and write the (m+ a)× q matrix G∗ and the (m+ a)× (m+ a) matrix Ω∗ as

G∗ =( G∗

A

), and Ω =

( Ω Ω1

Ω1 Ω2

).

For simplicity, we use the following notation: two matrices satisfy O1 ≥ O2 if O1 −O2 is positive

semidefinite. To show C ≥ C, we have the following derivation

G∗>Ω−1G∗ = (G∗>,A>)( Ω Ω1

Ω1 Ω2

)−1( G∗

A

)≥ (G∗>,A>)

( Ω−1 0

0 0

)( G∗

A

)= G∗>Ω−1G∗.

This completes the proof of (C.5), and therefore the corollary holds.

C.4 Proof of Corollary 3.2

Proof of Corollary 3.2. The proof of the double robustness property mainly follows from Theorem

3.1. In this case, we only need to verify that spanh1(·) = spanM1h1(·) and spanh2(·) =

spanM2h2(·), where M1 ∈ Rq×m1 and M1 ∈ Rq×m2 are the partitions of G∗>W∗ = (M1,M2).

Apparently, we have spanM1h1(·) ⊆ spanh1(·), since the former can always be written as a

linear combination of h1(·). To show spanh1(·) ⊆ spanM1h1(·), note that them1×m1 principal

submatrix M11 of M1 is invertible. Thus, spanh1(·) = spanM11h1(·) ⊆ spanM1h1(·). This

is because the m1 dimensional functions M11h1(·) are identical to the first m1 coordinates of

M1h1(·). This completes the proof of double robustness property. The efficiency property follows

from Theorem 3.2. We do not replicate the details.

D Proof of Results in Section 4

For notational simplicity, we denote π∗(x) = J(m∗(x)), J∗(x) = J(β∗>B(x)), and J(x) =

J(β>B(x)). Define Qn(β) = ‖gβ(T ,X)‖22 and Q(β) = ‖Egβ(Ti,Xi)‖22. In the following proof, we

use C,C ′ and C ′′ to denote generic positive constants, whose values may change from line to line.

35

Lemma D.1 (Bernstein’s inequality for U -statistics (Arcones, 1995)). Given i.i.d. random vari-

ables Z1, . . . Zn taking values in a measurable space (S,B) and a symmetric and measurable kernel

function h : Sm → R, we define the U -statistics with kernel h as U :=(nm

)−1∑i1<...<im

h(Zi1 , . . . , Zim).

Suppose that Eh(Zi1 , . . . , Zim) = 0, EE[h(Zi1 , . . . , Zim) | Zi1 ]

2= σ2 and ‖h‖∞ ≤ b. There exists

a constant K(m) > 0 depending on m such that

P(|U | > t) ≤ 4 exp− nt2/[2m2σ2 +K(m)bt]

, ∀t > 0.

Lemma D.2. Under the conditions in Theorem 4.1, it holds that

supβ∈Θ

∣∣∣Qn(β)−Q(β)∣∣∣ = Op

(√K2 logK

n

).

Proof of Lemma D.2. Let ξ(β) = (ξ1(β), ..., ξn(β))> and φ(β) = (φ1(β), ..., φn(β))>, where

ξi(β) =Ti

J(β>B(Xi))− 1− Ti

1− J(β>B(Xi)), φi(β) =

TiJ(β>B(Xi))

− 1.

Then we have

Qn(β) = n−2n∑i=1

n∑j=1

[ξi(β)ξj(β)h1(Xi)

>h1(Xj) + φi(β)φj(β)h2(Xi)>h2(Xj)

]= n−2

∑1≤i 6=j≤n

[ξi(β)ξj(β)h1(Xi)

>h1(Xj) + φi(β)φj(β)h2(Xi)>h2(Xj)

]+An(β),

where An(β) = n−2∑n

i=1

[ξi(β)2‖h1(Xi)‖22 + φi(β)2‖h2(Xi)‖22

]. Since there exists a constant

c0 > 0 such that c0 ≤ |J(β>B(x))| ≤ 1 − c0 for any β ∈ Θ and Ti ∈ 0, 1, it implies

supβ∈Θ max1≤i≤n |ξi(β)| ≤ C and supβ∈Θ max1≤i≤n |φi(β)| ≤ C for some constant C > 0. Then

we can show that

E(

supβ∈Θ|An(β)|

)≤ C

nE(‖h(Xi)‖22) = O(K/n).

By the Markov inequality, we have supβ∈Θ |An(β)| = Op(K/n) = op(1). Following the similar

arguments, it can be easily shown that supβ∈Θ |Q(β)|/n = O(K/n). Thus, it holds that

supβ∈Θ|Qn(β)−Q(β)| = sup

β∈Θ

∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

uij(β)∣∣∣+Op(K/n), (D.1)

where uij(β) = u1ij(β) + u2ij(β) is a kernel function of a U-statistic with

u1ij(β) = ξi(β)ξj(β)h1(Xi)>h1(Xj)− E[ξi(β)ξj(β)h1(Xi)

>h1(Xj)],

u2ij(β) = φi(β)φj(β)h2(Xi)>h2(Xj)− E[φi(β)φj(β)h2(Xi)

>h2(Xj)].

Since Θ is a compact set in RK , by the covering number theory, there exists a constant C such that

M = (C/r)K balls with the radius r can cover Θ. Namely, Θ ⊆ ∪1≤m≤MΘm, where Θm = β ∈

36

RK : ‖β − βm‖2 ≤ r for some β1, ...,βM . Thus, for any given ε > 0,

P(

supβ∈Θ

∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u1ij(β)∣∣∣ > ε

)≤

M∑m=1

P(

supβ∈Θm

∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u1ij(β)∣∣∣ > ε

)

≤M∑m=1

[P(∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u1ij(βm)∣∣∣ > ε/2

)+ P

(supβ∈Θm

2

n(n− 1)

∑1≤i<j≤n

∣∣∣u1ij(β)− u1ij(βm)∣∣∣ > ε/2

)]. (D.2)

By the Cauchy-Schwarz inequality, |h1(Xi)>h1(Xj)| ≤ ‖h1(Xi)‖2‖h1(Xj)‖2 ≤ CK, and thus

|u1ij(βm)| ≤ CK. In addition, for any β,

Eξi(β)h1(Xi)

>E[ξj(β)h1(Xj)]− E[ξi(β)ξj(β)h1(Xi)>h1(Xj)]

2

≤ Eξi(β)h1(Xi)

>E[ξj(β)h1(Xj)]2 ≤ ‖Eξ2

i (β)h1(Xi)h1(Xi)>‖2 · ‖Eξj(β)h1(Xj)‖22 ≤ CK,

for some constant C > 0. Here, in the last step we use that fact that

‖Eξj(β)h1(Xj)‖22 ≤ E‖ξj(β)h1(Xj)‖22 ≤ C · E‖h1(Xj)‖22 ≤ CK,

and ‖Eξ2i (β)h1(Xi)h1(Xi)

>‖2 is bounded because ‖Eh1(Xj)h1(Xj)>‖2 is bounded by assumption.

Thus, we can apply the Bernstein’s inequality in Lemma D.1 to the U-statistic with kernel function

u1ij(βm),

P(∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u1ij(βm)∣∣∣ > ε/2

)≤ 2 exp

(− Cnε2/[K +Kε]

), (D.3)

for some constant C > 0. Since |∂J(v)/∂v| is upper bounded by a constant for any v = β>B(x), it

is easily seen that for any β ∈ Θm, |ξi(β)−ξi(βm)| ≤ C|(β−βm)>B(Xi)| ≤ CrK1/2, where the last

step follows from the Cauchy-Schwarz inequalty. This further implies |ξi(β)ξj(β)−ξi(βm)ξj(βm)| ≤CrK1/2 for some constant C > 0 by performing a standard perturbation analysis. Thus,

|u1ij(β)− u1ij(βm)| ≤ CrK1/2|h1(Xi)>h1(Xj)| ≤ CrK3/2,

and note that with r = K−2, then CrK1/2E|h1(Xi)>h1(Xj)| ≤ ε/4 for n large enough. Thus

P(

supβ∈Θm

2

n(n− 1)

∑1≤i<j≤n

∣∣∣u1ij(β)− u1ij(βm)∣∣∣ > ε/2

)≤ P

(2CrK1/2

n(n− 1)

∑1≤i<j≤n

|h1(Xi)>h1(Xj)| > ε/2

)≤ P

(2CrK1/2

n(n− 1)

∑1≤i<j≤n

[|h1(Xi)

>h1(Xj)| − E|h1(Xi)>h1(Xj)|

]> ε/4

)≤ 2 exp(−CnKε2), (D.4)

37

where the last step follows from the Hoeffding inequality for U-statistic. Thus, combining (D.2),

(D.3) and (D.4), we have for some constants C1, C2, C3 > 0, as n goes to infinity,

P(

supβ∈Θ

∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u1ij(β)∣∣∣ > ε

)≤ exp(C1K logK − C2nε

2/[K +Kε]) + exp(C1K logK − C3nε2K)→ 0,

where we take ε = C√K2 logK/n for some constant C sufficiently large. This implies

supβ∈Θ

∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u1ij(β)∣∣∣ = Op

(√K2 logK

n

).

Following the same arguments, we can show that with the same choice of ε,

supβ∈Θ

∣∣∣ 2

n(n− 1)

∑1≤i<j≤n

u2ij(β)∣∣∣ = Op

(√K2 logK

n

).

Plugging these results into (D.1), we complete the proof.

Lemma D.3 (Bernstein’s inequality for random matrices (Tropp, 2015)). Let Zk be a sequence

of independent random matrices with dimensions d1 × d2. Assume that EZk = 0 and ‖Zk‖2 ≤ Rnalmost sure. Define

σ2n = max

∥∥∥ n∑k=1

E(ZkZ>k )∥∥∥

2,∥∥∥ n∑k=1

E(Z>k Zk)∥∥∥

2

.

Then, for all t ≥ 0,

P(∥∥∥ n∑

k=1

Zk

∥∥∥2≥ t)≤ (d1 + d2) exp

(− t2/2

σ2n +Rnt/3

).

Lemma D.4. Let H = (h(X1), ...,h(Xn))> and B = (B(X1), ...,B(Xn))> be two n×K matrices.

Under the conditions in Theorem 4.1, then

‖H>H/n− E[h(Xi)h(Xi)>]‖2 = Op(

√K logK/n) (D.5)

and

‖B>B/n− E[B(Xi)B(Xi)>]‖2 = Op(

√K logK/n). (D.6)

Proof of Lemma D.4. We prove this result by applying Lemma D.3. In particular, to prove (D.5),

we take Zi = n−1[h(Xi)h(Xi)> − E(h(Xi)h(Xi)

>)]. It is easily seen that

‖Zi‖2 ≤ n−1[tr(h(Xi)h(Xi)>) + ‖E(h(Xi)h(Xi)

>)‖2] ≤ (CK + C)/n,

where C is some positive constant. Moreover,∥∥∥ n∑i=1

E(ZiZ>i )∥∥∥

2≤ n−1

(‖Eh(Xi)h(Xi)

>h(Xi)h(Xi)>‖2 + ‖E(h(Xi)h(Xi)

>)‖22)

≤ n−1(CK · ‖E(h(Xi)h(Xi)>)‖2 + C2) ≤ n−1(C2K + C2).

38

Note that√K logK/n = o(1). Now, if we take t = C

√K logK/n in Lemma D.3 for some constant

C sufficiently large, then we have P(‖∑n

k=1 Zk‖2 ≥ t) ≤ 2K exp(−C ′ logK) for some C ′ > 1. Then,

the right hand side converges to 0, as K → ∞. This completes the proof of (D.5). The proof of

(D.6) follows from the same arguments and is omitted for simplicity.

Lemma D.5. Under the conditions in Theorem 4.1, the following results hold.

1 Let U = 1n

∑ni=1Ui, Ui = (U>i1 ,U

>i2)>, with

Ui1 =( Tiπ∗i− 1− Ti

1− π∗i

)h1(Xi), Ui2 =

( Tiπ∗i− 1)h2(Xi).

Then ‖U‖2 = Op(K1/2/n1/2).

2 Let B(r) = β ∈ RK : ‖β − β∗‖2 ≤ r, and r = O(K1/2/n1/2 +K−rb). Then

supβ∈B(r)

∥∥∥∂gβ(T ,X)

∂β−G∗

∥∥∥2

= Op

(K1/2r +

√K logK

n

).

3 Let Ji = J(β>B(Xi)), Ji = ∂J(v)/∂v|v=β>B(Xi), and

T∗ = E[E(Yi(1) |Xi)

π∗i− E(Yi(0) |Xi)

1− π∗i

]J∗i B(Xi)

.

Then

supβ∈B(r)

∥∥∥ 1

n

n∑i=1

[TiYi(1)

J2i

+(1− Ti)Yi(0)

(1− Ji)2

]JiB(Xi) + G∗>α∗

∥∥∥2

= Op

(K1/2r +K−rh

).

Proof of Lemma D.5. We start from the proof of the first result. Note that E(Ui) = 0. Then

E‖U‖22 = E(U>i Ui)/n and then there exists some constant C > 0,

E‖U‖22 = E[n−1

K∑k=1

( Tiπ∗i− 1− Ti

1− π∗i

)2hk(Xi)

2I(k ≤ m1) +( Tiπ∗i− 1)2hk(Xi)

2I(k > m1)]

≤ CK∑k=1

Ehk(Xi)2/n = O(K/n).

By the Markov inequality, this implies ‖U‖2 = Op(K1/2/n1/2), which completes the proof of the

first result. In the following, we prove the second result. Denote

ξi(m(Xi)) = −( TiJ2(m(Xi))

+1− Ti

(1− J(m(Xi)))2

)J(m(Xi))

φi(m(Xi)) = − TiJ2(m(Xi))

J(m(Xi)),

39

and ∆i(m(Xi)) = diag(ξi(m(Xi))1m1 , φi(m(Xi))1m2) is a K ×K diagonal matrix, where 1m1 is a

vector of 1 with length m1. Then, note that

∂gβ(T ,X)

∂β−G∗ =

1

n

n∑i=1

B(Xi)h(Xi)>∆i(β

>B(Xi))− E[B(Xi)h(Xi)>∆i(m

∗(Xi))],

which can be decomposed into the two terms Iβ + II, where

Iβ =1

n

n∑i=1

B(Xi)h(Xi)>[∆i(β

>B(Xi))−∆i(m∗(Xi))], II =

n∑i=1

Zi,

Zi = n−1B(Xi)h(Xi)

>∆i(m∗(Xi))− E[B(Xi)h(Xi)

>∆i(m∗(Xi))]

.

We first consider the term II. It can be easily verified that ‖∆i(m∗(Xi))‖2 ≤ C for some constant

C > 0. In addition, ‖B(Xi)h(Xi)>‖2 ≤ ‖B(Xi)‖2 · ‖h(Xi)‖2 ≤ CK. Thus, ‖Zi‖2 ≤ CK/n.

Following the similar argument in the proof of Lemma D.4,∥∥∥ n∑i=1

E(ZiZ>i )∥∥∥

2≤ n−1‖EB(Xi)h(Xi)

>∆i(m∗(Xi))∆i(m

∗(Xi))h(Xi)B(Xi)>‖2

+ n−1‖EB(Xi)h(Xi)>∆i(m

∗(Xi))‖22.

We now consider the last two terms separately. Note that

‖EB(Xi)h(Xi)>∆i(m

∗(Xi))‖22 = sup‖u‖2=1,‖v‖2=1

|Eu>B(Xi)h(Xi)>∆i(m

∗(Xi))v|2

≤ sup‖u‖2=1

|Eu>B(Xi)B(Xi)>u| · sup

‖v‖2=1|Ev>∆i(m

∗(Xi))h(Xi)h(Xi)>∆i(m

∗(Xi))v|

≤ ‖E(B(Xi)B(Xi)>)‖2 · C‖E(h(Xi)h(Xi)

>)‖2 ≤ C ′, (D.7)

where C,C ′ are some positive constants. Following the similar arguments to (D.7),

‖EB(Xi)h(Xi)>∆i(m

∗(Xi))∆i(m∗(Xi))h(Xi)B(Xi)

>‖2≤ CK · sup

‖u‖2=1|Eu>B(Xi)B(Xi)

>u| ≤ CK · ‖EB(Xi)B(Xi)>‖2 ≤ C ′K,

for some constants C,C ′ > 0. This implies ‖∑n

i=1 E(ZiZ>i )‖2 ≤ CK/n. Thus, Lemma D.3 implies

‖II‖2 = Op(√K logK/n). Next, we consider the term Iβ. Following the similar arguments to

(D.7), we can show that

supβ∈B(r)

‖Iβ‖2 = supβ∈B(r)

sup‖u‖2=1,‖v‖2=1

∣∣∣ 1n

n∑i=1

u>B(Xi)h(Xi)>[∆i(β

>B(Xi))−∆i(m∗(Xi))]v

∣∣∣≤∥∥∥ 1

n

n∑i=1

B(Xi)B(Xi)>∥∥∥1/2

2·∥∥∥ 1

n

n∑i=1

h(Xi)h(Xi)>∥∥∥1/2

2

· supβ∈B(r)

max1≤i≤n

‖∆i(β>B(Xi))−∆i(m

∗(Xi))‖2

≤ C supβ∈B(r)

supx∈X|(β∗ − β)>B(x)|+ C sup

x∈X|m∗(x)− β∗>B(x)|

≤ C ′(K1/2r +K−rb) ≤ C ′′K1/2r,

40

for some C,C ′, C ′′ > 0, where the second inequality follows from Lemma D.4 and the Lipschitz

property of ξi(·) and φi(·), and the third inequality is due to the Cauchy-Schwarz inequality and

approximation assumption of the sieve estimator. This completes the proof of the second result.

For the third result, let

ηi(m(Xi)) =( TiYi(1)

J2(m(Xi))+

(1− Ti)Yi(0)

(1− J(m(Xi)))2

)J(m(Xi)).

Thus, the following decomposition holds,

1

n

n∑i=1

ηi(β>B(Xi))B(Xi) + G∗>α∗ = T1β + T2 + T3,

where

T1β =1

n

n∑i=1

[ηi(β>B(Xi))− ηi(m∗(B(Xi)))]B(Xi)

T2 =1

n

n∑i=1

[ηi(m

∗(B(Xi)))B(Xi)− Eηi(m∗(B(Xi)))B(Xi)]

T3 = Eηi(m∗(B(Xi)))B(Xi) + G∗>α∗.

Similar to the proof for supβ∈B(r) ‖Iβ‖2 previously, we can easily show that supβ∈B(r) ‖T1β‖2 =

Op(K1/2r). Again, the key step is to use the results from Lemma D.4. For the second term

T2, we can use the similar arguments in the proof of the first result to show that E‖T2‖22 ≤CK ·E[ηi(m

∗(B(Xi))2]/n = O(K/n). The Markov inequality implies ‖T2‖2 = Op(K

1/2/n1/2). For

the third term T3, after some algebra, we can show that

‖T3‖2 ≤ C(

supx∈X|K(x)−α∗>1 h1(x)|+ sup

x∈X|L(x)−α∗>2 h2(x)|

)= Op(K

−rh).

Combining the L2 error bound for T1β, T2 and T3, we obtain the last result. This completes the

whole proof.

Lemma D.6. Under the conditions in Theorem 4.1, it holds that

‖β − β∗‖2 = op(1).

Proof of Lemma D.6. Recall that βo is the minimizer of Q(β). We now decompose Q(β)−Q(βo)

as

Q(β)−Q(βo) = [Q(β)−Qn(β)]︸ ︷︷ ︸I

+ [Qn(β)−Qn(βo)]︸ ︷︷ ︸II

+ [Qn(βo)−Q(βo)]︸ ︷︷ ︸III

. (D.8)

In the following, we study the terms I, II and III one by one. For the term I, Lemma D.2 implies

|Q(β) − Qn(β)| ≤ supβ∈Θ

∣∣∣Qn(β) − Q(β)∣∣∣ = op(1). This shows that |I| = op(1) and the same

argument yields |III| = op(1). For the term II, by the definition of β, it is easy to see that II ≤ 0.

Thus, combining with (D.8), we have for any constant η > 0 to be chosen later, Q(β)−Q(βo) < η

41

with probability tending to one. For any ε > 0, define Eε = Θ∩‖β−βo‖2 ≥ ε. By the uniqueness

of βo, for any β ∈ Eε, we have Q(β) > Q(βo). Since Eε is a compact set, we have infβ∈Eε Q(β) >

Q(βo). This implies that for any ε > 0, there exists η′ > 0 such that Q(β) > Q(βo) + η′ for

any β ∈ Eε. If β ∈ Eε, then Q(βo) + η > Q(β) > Q(βo) + η′ with probability tending to one.

Apparently, this does not holds if we take η < η′. Thus, we have proved that β /∈ Eε, that is

‖β − βo‖2 ≤ ε for any ε > 0. Thus, we have ‖β − βo‖2 = op(1).

Next, we shall show that ‖βo − β∗‖2 = op(1). It is easily seen that these together lead to the

desired consistency result

‖β − β∗‖2 ≤ ‖βo − β∗‖2 + ‖β − βo‖2 = op(1).

To show ‖βo − β∗‖2 = op(1), we use the similar strategy. That is we want to show that for any

constant η > 0, Q(β∗) − Q(βo) < η. In the following, we prove that Q(β∗) = O(K1−2rb). Note

that

Q(β∗) ≤ C2K−2rb

K∑j=1

E|hj(X)|2 = O(K1−2rb),

where the first inequality follows from the Cauchy-Schwarz inequality and the last step uses the as-

sumption that supx∈X ‖h(x)‖2 = O(K1/2). In addition, it holds that Q(βo) ≤ Q(β∗) = O(K1−2rb).

As K → ∞, it yields Q(β∗) − Q(βo) < η, for any constant η > 0. The same arguments yield

‖βo − β∗‖2 = op(1). This completes the proof of the consistency result.

Lemma D.7. Under the conditions in Theorem 4.1, there exists a global minimizer β (if Qn(β)

has multiple minimizers), such that

‖β − β∗‖2 = Op(K1/2/n1/2 +K−rb). (D.9)

Proof of Lemma D.7. We first prove that there exists a local minimizer ∆ of Qn(β∗ + ∆), such

that ∆ ∈ C, where C = ∆ ∈ RK : ‖∆‖2 ≤ r, and r = C(K1/2/n1/2 +K−rb) for some constant C

large enough. To this end, it suffices to show that

P

inf∆∈∂C

Qn(β∗ + ∆)−Qn(β∗) > 0→ 1, as n→∞, (D.10)

where ∂C = ∆ ∈ RK : ‖∆‖2 = r. Applying the mean value theorem to each component of

gβ∗+∆(T ,X),

gβ∗+∆(T ,X) = gβ∗(T ,X) + G∆,

where G =∂gβ(T ,X)

∂β and for notational simplicity we assume there exists a common β = vβ∗ +

(1 − v)β for some 0 ≤ v ≤ 1 lies between β∗ and β∗ + ∆ (Rigorously speaking, we need different

β for different component of gβ∗+∆(T ,X)). Thus, for any ∆ ∈ ∂C,

Qn(β∗ + ∆)−Qn(β∗) = 2gβ∗(T ,X)G∆ + ∆>(G>G)∆

≥ −2‖gβ∗(T ,X)‖2 · ‖G‖2 · ‖∆‖2 + ‖∆‖22 · λmin(G>G)

≥ −C(K1/2/n1/2 +K−rb) · r + C · r2, (D.11)

42

for some constant C > 0. In the last step, we first use the results that ‖gβ∗(T ,X)‖2 = Op(K1/2/n1/2+

K−rb), which is derived by combining Lemma D.5 with the arguments similar to (D.14) in the proof

of Lemma D.8. In addition, ‖G‖2 ≤ ‖G−G∗‖2+‖G∗‖2 ≤ C, since ‖G∗‖2 is bounded by a constant

and ‖G−G∗‖2 = op(1) by Lemma D.5. By the Weyl inequality and Lemma D.5,

λmin(G>G) ≥ λmin(G∗>G∗)− ‖G>G−G∗>G∗‖2≥ C − ‖G−G∗‖2 · ‖G‖2 − ‖G−G∗‖2 · ‖G∗‖2 ≥ C/2,

for n sufficiently large. By (D.11), if r = C(K1/2/n1/2 +K−rb) for some constant C large enough,

the right hand side is positive for n large enough. This establishes (D.10). Next, we show that

β = β∗ + ∆ is a global minimizer of Qn(β). This is true because the first order condition implies

(∂gβ

(T ,X)

∂β

)gβ

(T ,X) = 0, =⇒ gβ

(T ,X) = 0,

provided ∂gβ

(T ,X)/∂β is invertible. Following the similar arguments by applying the Weyl in-

equality, ∂gβ

(T ,X)/∂β is invertible with probability tending to one. Since gβ

(T ,X) = 0, it implies

Qn(β) = 0. Noting that Qn(β) ≥ 0 for any β, we obtain that β is indeed a global minimizer of

Qn(β).

Lemma D.8. Under the conditions in Theorem 4.1, β satisfies the following asymptotic expansion

β − β∗ = −G−1U + ∆n, (D.12)

where U = 1n

∑ni=1Ui, Ui = (U>i1 ,U

>i2)>, with

Ui1 =

(Tiπ∗i− 1− Ti

1− π∗i

)h1(Xi), Ui2 =

(Tiπ∗i− 1

)h2(Xi),

and

‖∆n‖2 = Op

(K1/2 ·

(K1/2

n1/2+

1

Krb

)2+

√K logK

n·(K1/2

n1/2+

1

Krb

)).

Proof of Lemma D.8. Similar to the proof of Lemma D.7, we apply the mean value theorem to

each component of gβ

(T ,X),

gβ∗(T ,X) +(∂gβ(T ,X)

∂β

)(β − β∗) = 0,

where for notational simplicity we assume there exists a common β = vβ∗ + (1 − v)β for some

0 ≤ v ≤ 1 lies between β∗ and β. After rearrangement, we derive

β − β∗ = −G∗−1gβ∗(T ,X) +[G∗−1 −

(∂gβ(T ,X)

∂β

)−1]gβ∗(T ,X)

= −G∗−1U + ∆n1 + ∆n2 + ∆n3, (D.13)

43

where

∆n1 = G∗−1[U − gβ∗(T ,X)], ∆n2 =[G∗−1 −

(∂gβ(T ,X)

∂β

)−1]U

and

∆n3 =[G∗−1 −

(∂gβ(T ,X)

∂β

)−1]· [gβ∗(T ,X)− U ].

We first consider ∆n1 in (D.13). Let ξ = (ξ1, ..., ξn)>, where

ξi = Ti

( 1

π∗i− 1

J∗i

)− (1− Ti)

( 1

1− π∗i− 1

1− J∗i

), for 1 ≤ i ≤ m1,

and

ξi = Ti

( 1

π∗i− 1

J∗i

), for m1 + 1 ≤ i ≤ K.

Let H = (h(X1), ...,h(Xn))> be a n×K matrix. Then, for some constants C,C ′ > 0,

‖∆n1‖22 = n−2ξ>HG∗−1G∗−1H>ξ ≤ n−2‖ξ‖22 · ‖HG∗−1G∗−1H>‖2≤ Cn−1‖ξ‖22 · ‖H>H/n‖2 ≤ C ′n−1‖ξ‖22, (D.14)

where the third step follows from the fact that ‖G∗−1‖2 is bounded and the last step follows from

Lemma D.4 and the maximum eigenvalue of E[h(Xi)h(Xi)>] is bounded. Since |∂J(v)/∂v| is upper

bounded by a constant for any v ≤ supx∈X |m∗(x)|, then there exist some constants C,C ′ > 0, suc

that for any m1 + 1 ≤ i ≤ K,

|ξi| ≤ C|π∗i − J∗i | ≤ C ′ supx∈X|m∗(x)− β∗>B(x)| ≤ C ′K−rb .

Similarly, |ξi| ≤ 2C ′K−rb for any 1 ≤ i ≤ m1. Thus, it yields n−1‖ξ‖22 = Op(K−2rb). Combining

with (D.14), we conclude that ‖∆n1‖2 = Op(K−rb).

Next, we consider ∆n2. Since ‖G∗−1‖2 is bounded, we have

‖∆n2‖2 ≤ ‖G∗−1‖2 ·∥∥∥(∂gβ(T ,X)

∂β

)−1∥∥∥2·∥∥∥G∗ − ∂gβ(T ,X)

∂β

∥∥∥2· ‖U‖2

≤ C(‖β − β∗‖2K1/2 +

√K logK

n

)·√K

n,

where the last step follows from Lemma D.5.

Finally, we consider ∆n3. By the same arguments in the control of terms ∆n1 and ∆n2, we can

prove that

‖∆n3‖2 ≤∥∥∥G∗−1 −

(∂gβ(T ,X)

∂β

)−1∥∥∥2· ‖gβ∗(T ,X)− U‖2

≤ C(‖β − β∗‖2K1/2 +

√K logK

n

)·K−rb .

44

Combining the rates of ‖∆n1‖2, ‖∆n2‖2 and ‖∆n3‖2 with (D.13), by Lemma D.5, we obtain

‖β − β∗‖2 ≤ ‖G∗−1gβ∗(T ,X)‖2 + ‖∆n1‖2 + ‖∆n2‖2 + ‖∆n3‖2

≤ C(K1/2

n1/2+

1

Krb

)+ C ′

(‖β − β∗‖2K1/2 +

√K logK

n

)·(K1/2

n1/2+

1

Krb

),

for some constants C,C ′ > 0. Therefore, (D.12) holds with ∆n = ∆n1 + ∆n2 + ∆n3, where

‖∆n‖2 = Op

(K1/2 ·

(K1/2

n1/2+

1

Krb

)2+

√K logK

n·(K1/2

n1/2+

1

Krb

)).

This completes the proof.

Proof of Theorem 4.1. We now consider the following decomposition of µβ− µ,

µβ− µ =

1

n

n∑i=1

[Ti(Yi(1)−K(Xi)− L(Xi))

Ji− (1− Ti)(Yi(0)−K(Xi))

1− Ji

]+

1

n

n∑i=1

(TiJi− 1− Ti

1− Ji

)K(Xi) +

1

n

n∑i=1

(TiJi− 1)L(Xi) +

1

n

n∑i=1

L(Xi)− µ

=1

n

n∑i=1

[Ti(Yi(1)−K(Xi)− L(Xi))

Ji− (1− Ti)(Yi(0)−K(Xi))

1− Ji

]+

1

n

n∑i=1

(TiJi− 1− Ti

1− Ji

)∆K(Xi) +

1

n

n∑i=1

(TiJi− 1)

∆L(Xi) +1

n

n∑i=1

L(Xi)− µ,

where Ji = J(β>B(Xi)), ∆K(Xi) = K(Xi) − α∗>1 h1(Xi) and ∆L(Xi) = L(Xi) − α∗>2 h2(Xi).

Here, the second equality holds by the definition of β. Thus, we have

µβ− µ =

1

n

n∑i=1

Si +R0 +R1 +R2 +R3

where

Si =Tiπ∗i

[Yi(1)−K(Xi)− L(Xi)

]− 1− Ti

1− π∗i

[Yi(0)−K(Xi)

]+ L(Xi)− µ,

R0 =1

n

n∑i=1

Ti(Yi(1)−K(Xi)− L(Xi))

Jiπ∗i(π∗i − Ji),

R1 =1

n

n∑i=1

(1− Ti)(Yi(0)−K(Xi))

(1− Ji)(1− π∗i )(π∗i − Ji),

R2 =1

n

n∑i=1

(TiJi− 1− Ti

1− Ji

)∆K(Xi), R3 =

1

n

n∑i=1

(TiJi− 1)

∆L(Xi).

In the following, we will show that Rj = op(n−1/2) for 0 ≤ j ≤ 3. Thus, the asymptotic normality of

n1/2(µβ−µ) follows from the previous decomposition. In addition, Si agrees with the efficient score

45

function for estimating µ (Hahn, 1998). Thus, the proposed estimator µβ

is also semiparametrically

efficient.

Now, we first focus on R0. Consider the following empirical process Gn(f0) = n1/2(Pn −P)f0(T, Y (1),X), where Pn stands for the empirical measure and P stands for the expectation, and

f0(T, Y (1),X) =T (Y (1)−K(X)− L(X))

J(m(X))π∗(X)[π∗(X)− J(m(X))].

By Lemma D.7, we can easily show that

supx∈X|J(β>B(x))− π∗(x)| . sup

x∈X|β>B(x)− β∗>B(x)|

+ supx∈X|m∗(x)− β∗>B(x)| = Op(K/n

1/2 +K1/2−rb) = op(1).

For notational simplicity, we denote ‖f‖∞ = supx∈X |f(x)|. Define the set of functions F = f0 :

‖m − m∗‖∞ ≤ δ, where δ = C(K/n1/2 + K1/2−rb) for some constant C > 0. By the strong

ignorability of the treatment assignment, we have that Pf0(T, Y (1),X) = 0. By the Markov

inequality and the maximal inequality in Corollary 19.35 of Van der Vaart (2000),

n1/2R0 ≤ supf0∈F

Gn(f0) . E supf0∈F

Gn(f0) . J[ ](‖F0‖P,2,F , L2(P )),

where J[ ](‖F0‖P,2,F , L2(P )) is the bracketing integral, and F0 is the envelop function. Since J is

bounded away from 0, we have |f0(T, Y (1),X)| . δ|Y (1)−K(X)−L(X)| := F0. Then ‖F0‖P,2 ≤δE|Y (1)|21/2 . δ. Next, we consider N[ ](ε,F , L2(P )). Define F0 = f0 : ‖m −m∗‖∞ ≤ C for

some constant C > 0. Thus, it is easily seen that logN[ ](ε,F , L2(P )) . logN[ ](ε,F0δ, L2(P )) =

logN[ ](ε/δ,F0, L2(P )) . logN[ ](ε/δ,M, L2(P )) . (δ/ε)1/k1 , where we use the fact that J is

bounded away from 0 and J is Lipschitz. The last step follows from the assumption on the brack-

eting number of M. Then

J[ ](‖F0‖P,2,F , L2(P )) .∫ δ

0

√logN[ ](ε,F , L2(P ))dε .

∫ δ

0(δ/ε)1/(2k1)dε,

which goes to 0, as δ → 0, because 2k1 > 1 by assumption and thus the integral converges. Thus,

this shows that n1/2R0 = op(1). By the similar argument, we can show that n1/2R1 = op(1).

Next, we consider R2. Define the following empirical process Gn(f2) = n1/2(Pn − P)f2(T,X),

where

f2(T,X) =T − J(m(X))

J(m(X))(1− J(m(X)))∆K(X).

By the assumption on the approximation property of the basis functions, we have ‖∆K‖∞ . K−rh .

In addition,

‖J(β>B(X))− π∗(X)‖P,2 ≤ ‖J(β>B(X))− J(β∗>B(X))‖P,2 + ‖J(β∗>B(X))− π∗(X)‖P,2. ‖β>B(X)− β∗>B(X)‖P,2 + sup

x∈X|m∗(x)− β∗>B(x)|

= Op(K1/2/n1/2 +K−rb),

46

where the last step follows from Lemma D.7.

Define the set of functions F = f2 : ‖m−m∗‖P,2 ≤ δ1, ‖∆‖∞ ≤ δ2, where δ1 = C(K1/2/n1/2 +

K−rb) and δ2 = CK−rh for some constant C > 0. Thus,

n1/2R2 ≤ supf2∈F

Gn(f2) + n1/2 supf2∈F

Pf2.

We first consider the second term n1/2 supf2∈F Pf2. Let G1 = m ∈ M : ‖m −m∗‖P,2 ≤ δ1 and

G2 = ∆ ∈ H − α∗>1 h1 : ‖∆‖∞ ≤ δ2. By the definition of the propensity score and Cauchy

inequality,

n1/2 supf2∈F

Pf2 = n1/2 supm∈G1,∆∈G2

Eπ∗(X)− J(m(X))

J(m(X))(1− J(m(X)))∆(X)

. n1/2 supm∈G1

‖π∗ − J(m)‖P,2 sup∆∈G2

‖∆‖P,2

. n1/2δ1δ2 . n1/2(K1/2/n1/2 +K−rb)K−rh = o(1),

where the last step follows from rh > 1/2 and the scaling assumption n1/2 . Krb+rh in this

theorem. Next, we need to control the maximum of the empirical process supf2∈F Gn(f2). Fol-

lowing the similar argument to that for R0, we only need to upper bound the bracketing integral

J[ ](‖F2‖P,2,F , L2(P )). Since J is bounded away from 0 and 1, we can set the envelop func-

tion to be F2 := Cδ2 for some constant C > 0 and thus ‖F2‖P,2 . δ2. Define F0 = f2 :

‖m−m∗‖P,2 ≤ C, ‖∆‖P,2 ≤ 1 for some constant C > 0, G10 = m ∈ M+m∗ : ‖m‖P,2 ≤ C and

G20 = ∆ ∈ H −α∗>1 h1 : ‖∆‖P,2 ≤ 1. Similarly, we have

logN[ ](ε,F , L2(P )) . logN[ ](ε/δ2,F0, L2(P ))

. logN[ ](ε/δ2,G10, L2(P )) + logN[ ](ε/δ2,G20, L2(P ))

. logN[ ](ε/δ2,M, L2(P )) + logN[ ](ε/δ2,H, L2(P ))

. (δ2/ε)1/k1 + (δ2/ε)

1/k2 ,

where the second step follows from the boundness assumption on J and its Lipschitz property, the

third step is due to G10 −m∗ ⊂ M and G20 + α∗>1 h1 ⊂ H and the last step is by the bracketing

number condition in our assumption. Since 2k1 > 1 and 2k2 > 1, it is easily seen that the bracketing

integral J[ ](‖F2‖P,2,F , L2(P )) = o(1). This shows that supf2∈F Gn(f2) = op(1). Thus, we conclude

that n1/2R2 = op(1). By the similar argument, we can show that n1/2R3 = op(1). This completes

the whole proof.

47