improving covariate balancing propensity score: a doubly
TRANSCRIPT
Improving Covariate Balancing Propensity Score:
A Doubly Robust and Efficient Approach ∗
Jianqing Fan∗‡ Kosuke Imai†‡ Han Liu∗‡ Yang Ning∗ Xiaolin Yang∗†
Princeton University∗ Department of Operations Research and Financial Engineering
† Department of Politics ‡ Center for Statistics Machine Learning
June 14, 2016
Abstract
Inverse probability of treatment weighting (IPTW) is a popular method for estimating causal
effects in many disciplines. However, empirical studies show that the IPTW estimators can be
sensitive to the misspecification of propensity score model. To address this problem, several
researchers have proposed new methods to estimate propensity score by directly optimizing the
balance of pre-treatment covariates. While these methods appear to empirically perform well,
little is known about their theoretical properties. This paper makes two main contributions.
First, we conduct a theoretical investigation of one such methodology, the Covariate Balancing
Propensity Score (CBPS) recently proposed by Imai and Ratkovic (2014). We characterize
the asymptotic bias and efficiency of the CBPS-based IPTW estimator under both arbitrary
and local model misspecification as well as correct specification for general balancing functions.
Based on this finding, we address an open problem in the literature on how to optimally choose
the covariate balancing function for the CBPS methodology. Second, motivated by the form
of the optimal covariate balancing function, we further propose a new IPTW estimator by
generalizing the CBPS method. We prove that the proposed estimator is consistent if either the
propensity score model or the outcome model is correct. In addition to this double robustness
property, we also establish that the proposed estimator is semiparametrically efficient when
both the propensity score and outcome models are correctly specified. Unlike the standard
doubly robust estimators, however, the proposed methodology does not require the estimation
of outcome model. To relax the parametric assumptions on the propensity score model and the
outcome model, we further consider a sieve estimation approach to estimate the treatment effect.
A new “nonparametric double robustness” phenomenon is observed. Our simulations show that
the proposed estimator has better finite sample properties than the standard estimators.
Key words: Average treatment effect, causal inference, double robustness, model misspecifi-
cation, semiparametric efficiency, sieve estimation
∗Supported by NSF grants DMS-1206464 and DMS-1406266 and NIH grants R01-GM072611-12 and R01-
GM100474-04.
1
1 Introduction
Suppose that we have a random sample of n units from a population of interest. For each unit i,
we observe (Ti, Yi,Xi), where Xi ∈ Rd is a d-dimensional vector of pre-treatment covariates, Ti is a
binary treatment variable, and Yi is an outcome variable. In particular, Ti takes 1 if unit i receives
the treatment and is equal to 0 if unit i belongs to the control group. The observed outcome can
be written as Yi = Yi(1)Ti + Yi(0)(1− Ti), where Yi(1) and Yi(0) are the potential outcomes under
the treatment and control conditions, respectively. This notation implicitly requires the stable unit
treatment value assumption (Rubin, 1990). In addition, throughout this paper, we also assume the
strong ignorability of the treatment assignment (Rosenbaum and Rubin, 1983):
Yi(1), Yi(0) ⊥⊥ Ti |Xi and 0 < P(Ti = 1 |Xi) < 1. (1.1)
Furthermore, without loss of generality, we assume that the conditional expectation functions
of potential outcomes can be written as,
E(Yi(0) |Xi) = K(Xi) and E(Yi(1) |Xi) = K(Xi) + L(Xi) (1.2)
for some functions K(·) and L(·), which represent the conditional mean of the potential outcome
under the control condition and the conditional average treatment effect, respectively. Under this
setting, we are interested in estimating the average treatment effect (ATE),
µ = E(Yi(1)− Yi(0)) = E(L(Xi)). (1.3)
The propensity score is defined as the conditional probability of treatment assignment (Rosen-
baum and Rubin, 1983),
π(Xi) = P(Ti = 1 |Xi). (1.4)
In practice, since Xi can be high dimensional, the propensity score is usually parameterized by a
model πβ(Xi) where β is a q-dimensional vector of parameters. A popular choice is the logistic
regression model, i.e., πβ(Xi) = exp(X>i β)/1 + exp(X>i β). Once the parameter β is estimated
(e.g., by the maximum likelihood estimator β), the Horvitz-Thompson estimator (Horvitz and
Thompson, 1952), which is based on the inverse probability of treatment weighting (IPTW), can
be used to obtain an estimate of the ATE (Robins et al., 1994),
µβ
=1
n
n∑i=1
(TiYi
πβ
(Xi)− (1− Ti)Yi
1− πβ
(Xi)
). (1.5)
Despite its popularity, researchers have found that the estimators based on IPTW are partic-
ularly sensitive to the misspecification of propensity score model (e.g., Kang and Schafer, 2007).
To overcome this problem, several researchers have recently proposed to estimate the propensity
score by optimizing covariate balance rather than maximizing the accuracy of predicting treatment
assignment (e.g., Tan, 2010; Hainmueller, 2012; Graham et al., 2012; Imai and Ratkovic, 2014;
Chan et al., 2015). In this paper, we focus on the Covariate Balancing Propensity Score (CBPS)
2
methodology proposed by Imai and Ratkovic (2014). In spite of its simplicity, several scholars
independently found that the CBPS performs well in practice (e.g., Wyss et al., 2014; Frolich et al.,
2015). The method can also be extended for the analysis of longitudinal data (Imai and Ratkovic,
2015) and general treatment regimes (Fong et al., 2015). In this paper, we conduct a theoretical
investigation of the CBPS. Given the similarity between the CBPS and some other methods, our
theoretical analysis may also provide new insights for understanding these related methods.
The CBPS method estimates the parameters of the propensity score model, β, by solving the
following m-dimensional estimating equation,
gβ(T ,X) =1
n
n∑i=1
gβ(Ti,Xi) = 0,
where gβ(Ti,Xi) =
(Ti
πβ(Xi)− 1− Ti
1− πβ(Xi)
)f(Xi) (1.6)
for some covariate balancing function f(·) : Rd → Rm, if the number of equations m is equal to the
number of parameters q. Imai and Ratkovic (2014) point out that the common practice of fitting
a logistic model is equivalent to balancing the score function with f(Xi) = π′β(Xi) =
∂πβ(Xi)∂β .
They find that choosing f(Xi) = Xi, which balances the first moment between the treatment and
control groups, significantly reduces the bias of the estimated ATE. Some researchers also choose
f(Xi) = (Xi X2i ) in their applications. This guarantees that the treatment and control groups have
an identical sample mean of f(Xi) after weighting by the estimated propensity score. If m > q,
then β can be estimated by optimizing the covariate balance by the generalized method of moments
(GMM) method (Hansen, 1982):
β = argminβ∈Θ
gβ(T ,X)> W gβ(T ,X), (1.7)
where Θ is the parameter space for β in Rq and W is an (m×m) positive definite weighting matrix,
which we assume in this paper does not depend on β. Alternatively, the empirical likelihood method
can be used (Owen, 2001; Fong et al., 2015). Once the estimate of β is obtained, we can estimate
the ATE using the IPTW estimator in (1.5).
The main idea of the CBPS and other related methods is to directly optimize the balance of
covariates between the treatment and control groups so that even when the propensity score model
is misspecified we still obtain a reasonable balance of the covariates between the treatment and
control groups. However, one open question remains in this literature: How shall we choose the
covariate balancing function f(Xi)? In particular, if the propensity score model is misspecified,
this problem becomes even more important. Although some researchers have proposed the non-
parametric propensity score estimators to alleviate this problem (e.g., Hirano et al., 2003; Chan
et al., 2015), these methods are potentially difficult to apply in practice even when the number of
pre-treatment covariates is moderate.
This paper makes two main contributions. First, we conduct a thorough theoretical study of
the CBPS-based IPTW estimator with the general balancing function f(·). We characterize the
asymptotic bias of the CBPS-based IPTW estimator under both arbitrary and local misspecification
3
of the propensity score model. We also study the efficiency of the IPTW estimator under correct
specification and local model misspecification. Based on these findings, we show how to optimally
choose the covariate balancing function f(Xi) for the CBPS methodology (Section 2). In particular,
we show that once the covariate balancing function is chosen in this way, the CBPS-based IPTW
estimator achieves a double robustness property: the estimator is consistent if either the propensity
score model or the outcome model is correct.
However, the optimal choice of f(Xi) requires the knowledge of the propensity score, which
is unknown. Thus, the application of the CBPS method with the optimal f(Xi) is limited in
practice. To address this issue, our second contribution is to propose a new IPTW estimator by
generalizing the CBPS method. We show that the IPTW estimator based on the improved CBPS
(iCBPS) method retains the double robustness property. We also show that the proposed estimator
is semiparametrically efficient when both the propensity score and outcome models are correctly
specified (Section 3). Different from the CBPS method with the optimal f(Xi), the proposed
iCBPS method does not require the knowledge of the propensity score and is easy to implement
in practice. In addition, unlike the standard doubly robust estimators (Robins et al., 1994), the
proposed iCBPS method does not require the estimation of outcome model. Our simulation study
shows that the proposed estimator outperforms the standard estimators (Section 3.2).
To relax the parametric assumptions on the propensity score model and the outcome model, we
further extend the proposed iCBPS method to the nonparametric/semiparametric settings, by using
a sieve estimation approach (Newey, 1997; Chen, 2007). In Section 4, we establish a unified semi-
parametric efficiency result for the IPTW estimator under many nonparametric/semiparametric
settings, including the fully nonparametric model, additive model and partially linear model. Our
result provides a more comprehensive theoretical framework than the existing nonparametric lit-
erature (e.g., Hirano et al., 2003; Chan et al., 2015), which usually assumes the propensity score
model is fully nonparametric and therefore suffers from the curse of dimensionality. In addition, our
theoretical results require weaker technical assumptions. For instance, in the fully nonparametric
setting, the theory in Hirano et al. (2003) and Chan et al. (2015) requires s/d > 7 and s/d > 13,
respectively, where s is the smoothness parameter of some function class and d = dim(Xi). In com-
parison, we only require s/d > 3/4, which is significantly weaker than the existing conditions. To
prove this result, we exploit the matrix Bernstein’s concentration inequalities (Tropp, 2015) and a
Bernstein-type concentration inequality for U-statistics (Arcones, 1995). Similar tools from recent
random matrix theory have been used by Hansen (2014); Chen and Christensen (2015); Belloni
et al. (2015) to study the optimal rate of convergence for sieve-based least square estimation. How-
ever, unlike the sieve-based least square estimator, our sieve estimator of the propensity score does
not have a closed form and this leads to extra technical challenges for the establishment of the con-
sistency and rate of convergence. Finally, in this nonparametric setting, we observe an interesting
phenomenon that our unified semiparametric efficiency result holds if either the propensity score
model or the outcome model is approximated reasonably well. This can be viewed as a nonpara-
metric version of the double robustness property. To the best of our knowledge, this phenomenon
does not appear in the existing literature. The last section provides concluding remarks.
4
2 Consequences of Model Misspecification
Our theoretical investigation starts by examining the consequences of model misspecification for the
CBPS-based IPTW estimator. For this, we first derive the asymptotic bias of the IPTW estimator
under local misspecification of the propensity score model and show that we can eliminate the
bias by carefully choosing the covariate balancing function f(Xi) such that it spans K(Xi) + (1−πβo(Xi))L(Xi), where βo is the limiting value of β, i.e., β
p−→ βo. In other words, we want
to choose the covariate balancing function f(Xi) such that it spans the weighted average of the
two conditional mean functions of potential outcomes, i.e., there exists an α ∈ Rm such that
α>f(Xi) = πβo(Xi)E(Yi(0) | Xi) + (1− πβo(Xi))E(Yi(1) | Xi). This result is further extended to
arbitrary model misspecification.
The above result implies that when balancing covariates, for any given unit we should give a
greater weight to the determinants of the potential outcome that is less likely to be realized. For
example, if a unit is less likely to be treated, then it is more important to balance the covariates
that influence the mean potential outcome under the treatment condition. In contrast, if a unit is
more likely to be assigned to the treatment group, then the covariates that determine the potential
outcome under the control condition become more important. We also show that even when the
propensity score is correctly specified, this choice of covariate balancing function is optimal, enabling
the resulting estimator to attain the semiparametric efficiency bound.
2.1 Bias under Model Misspecification
While researchers can avoid gross model misspecification through careful model fitting, in practice
it is often difficult to nail down the exact specification. The prominent simulation study of Kang
and Schafer (2007), for example, is designed to illustrate this phenomenon. We therefore consider
the consequences of local misspecification of propensity score model. In particular, we assume that
the true propensity score π(Xi) is related to the working model πβ(Xi) through the exponential
tilt for some β∗,
π(Xi) = πβ∗(Xi) exp(ξ u(Xi;β∗)) (2.1)
where u(Xi;β∗) is a function determining the direction of misspecification and ξ represents the
magnitude of misspecification. We assume ξ = o(1) and ξ−1n−1/2 = O(1), as n → ∞ so that the
true propensity score π(Xi) is in a local neighborhood of the working model πβ∗(Xi). Under this
local model misspecification setting, we derive the asymptotic bias and variance of the CBPS-based
IPTW estimator in (1.5). The next theorem gives the expression of the asymptotic bias.
Theorem 2.1 (Asymptotic Bias under Local Misspecification). If the propensity score model is
locally misspecified as in (2.1), under Assumption A.1 in Appendix A, the bias of the IPTW
estimator defined in (1.5) is given by E(µβ
)− µ = Bξ + o(ξ), where
B =
E[u(Xi;β
∗)K(Xi) + L(Xi)(1− πβ∗(Xi))1− πβ∗(Xi)
]+H∗y (H∗>f W∗H∗f )−1H∗>f W∗E
(u(Xi;β
∗)f(Xi)
1− πβ∗(Xi)
), (2.2)
5
and W∗ is the limiting value of W in (1.7),
H∗y = −E(K(Xi) + (1− πβ∗(Xi))L(Xi)
πβ∗(Xi)(1− πβ∗(Xi))·∂πβ∗(Xi)
∂β
),
H∗f = −E
(f(Xi)
πβ∗(Xi)(1− πβ∗(Xi))
(∂πβ∗(Xi)
∂β
)>).
This theorem shows that the estimator µβ
has a first order bias term Bξ under the local model
misspecification. Although the expression of B looks sophisticated, the next corollary shows how
to choose the covariate balancing function to eliminate the first order bias.
Corollary 2.1 (Optimal Choice of Covariate Balancing Function under Local Misspecification).
Suppose that we choose the covariate balancing function f(X) such that α>f(X) = K(X) + (1−πβ∗(X))L(X) holds, where α ∈ Rm is a vector of arbitrary constants. In addition, assume that the
number of parameters is equal to the dimension of covariate balancing function f(Xi), i.e., m = q.
Then, under the conditions in Theorem 2.1, the IPTW estimator µβ
given in equation (1.5) is first
order unbiased, i.e., B = 0.
In the following, we consider the asymptotic bias of the IPTW estimator when the propensity
score model is arbitrarily misspecified. We first describe our results in a heuristic way. Assume that
the propensity score model is misspecified, i.e., P(Ti = 1 | Xi) = π(Xi) 6= πβ(Xi) for any β ∈ Θ.
Let βo represent the limiting value of the CBPS estimator β, i.e., assuming βp−→ βo. Then, the
asymptotic bias of µβ
is given by E(µβo)− µ where µ is the true ATE defined in (1.3). The CBPS
method ensures that βo satisfies
E(gβo(T ,X)) = E(
Tiπβo(Xi)
− 1− Ti1− πβo(Xi)
)f(Xi)
= E
π(Xi)− πβo(Xi)
πβo(Xi)(1− πβo(Xi))f(Xi)
= 0. (2.3)
Note that E(µβo) can be written as
E(µβo) = E
TiYiπβo(Xi)
− (1− Ti)Yi1− πβo(Xi)
= E
Ti(K(Xi) + L(Xi))
πβo(Xi)− (1− Ti)K(Xi)
1− πβo(Xi)
,
where the second equality follows from the strong ignorability of treatment assignment and the law
of iterated expectation. Therefore, the asymptotic bias of the IPTW estimator is
E(µβo)− µ = E[(
Tiπβo(Xi)
− 1− Ti1− πβo(Xi)
)K(Xi) + (1− πβo(Xi))L(Xi)
].
Thus, by equation (2.3), we can eliminate the asymptotic bias of µβ
under arbitrary model mis-
specification by choosing the covariate balancing function f(Xi) such that α>f(Xi) = K(Xi) +
(1 − πβo(Xi))L(Xi) holds for some α ∈ Rm. In other words, µβ
remains consistent for µ even if
the propensity score model is misspecified. Under regularity conditions in Appendix A, this result
can be proved by applying the similar argument in Theorem 3.1 of Section 3. For simplicity, we
6
refer the details to the proof of Theorem 3.1 in Appendix C. Finally, we note that the choice of the
covariate balancing function under arbitrary model misspecification is the same as that under local
misspecification, and hence extends the result in Corollary 2.1 under local model misspecification
to the setting of arbitrary misspecification.
2.2 Efficiency Consideration
We next study how the choice of different covariate balancing functions affects the efficiency of the
IPTW estimator. We first consider the case where the propensity score model is correctly specified.
We further show that the efficiency result also applies to the case of local misspecification studied
above.
Let β∗ be the true value of the parameter β in the propensity score model πβ(Xi). We first
derive the asymptotic distribution of the CBPS-based IPTW estimator under correctly specified
propensity score model. In this case, the estimator is asymptotically unbiased and follows the nor-
mal distribution regardless of the choice of covariate balancing function. The asymptotic variance
of the estimator, however, depends on this choice.
Theorem 2.2 (Asymptotic Properties under Correct Specification). Suppose that the propensity
score model is correctly specified and β is obtained through equation (1.7). Let µ be the true
treatment effect and W∗ be the limiting value of W in (1.7). Let
µβ∗(Ti, Yi,Xi) =TiYi
πβ∗(Xi)− (1− Ti)Yi
1− πβ∗(Xi).
Under Assumption A.1 in Appendix A, µβ
satisfies
√n(µ
β− µ)
d−→ N(0, H∗>ΣH∗), (2.4)
where H∗ = (1,H∗>y ) and
Σ =
(Σµ Σ>µβ
Σµβ Σβ
),
with
Σµ = Var(µβ∗(Ti, Yi,Xi)
)= E
(Yi(1)2
πβ∗(Xi)+
Yi(0)2
1− π∗β(Xi)− (E(Yi(1))− E(Yi(0)))2
),
Σβ = (H∗>f W∗H∗f )−1H∗>f W∗Var(gβ∗(Ti,Xi))W∗H∗f (H∗>f W∗H∗f )−1,
Σµβ = −(H∗>f W∗H∗f )−1H∗>f W∗Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)),
7
where gβ∗(Ti,Xi) is defined in (1.6). Under the model in equation (1.2), we have
H∗y = −E(K(Xi) + (1− πβ∗(Xi))L(Xi)
πβ∗(Xi)(1− πβ∗(Xi))·∂πβ∗(Xi)
∂β
),
H∗f = −E
(f(Xi)
πβ∗(Xi)(1− πβ∗(Xi))
(∂πβ∗(Xi)
∂β
)>),
Var(gβ∗(Ti,Xi)) = E(
f(Xi)f(Xi)>
πβ∗(Xi)(1− πβ∗(Xi))
),
Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)) = E[K(Xi) + (1− πβ∗(Xi))L(Xi)f(Xi)
πβ∗(Xi)(1− πβ∗(Xi))
].
To better understand this result, consider a special case where the dimension of the covariate
balancing function is equal to the number of parameters to be estimated, i.e., m = q (Imai and
Ratkovic, 2014). In this case, we can solve the optimization problem in (1.7) by setting W to a
diagonal matrix. Then, the asymptotic variance of µβ
is equal to,
Var(µβ
) ≈ Var(µβ∗) +H∗>y H∗−1f Var(gβ∗(T ,X))H∗−1
f H∗y
−2H∗>y H∗−1f Cov(µβ∗ , gβ∗(T ,X)). (2.5)
The expression (2.5) contains three parts. The first term Var(µβ∗) is the variance of the estimator
under the true value β∗. The second term is the variance of the balancing equation gβ(T ,X) under
β∗ scaled by the quadratic term H∗>y H∗−1f . The third term is the covariance between the µβ∗ and
gβ∗(T ,X) scaled by H∗>y H∗−1f .
Based on the asymptotic variance in equation (2.5), the next corollary shows that the optimal
choice of covariate balancing function derived before also results in an IPTW estimator that is
semiparametrically efficient.
Corollary 2.2 (Optimal Choice of Covariate Balancing Function under Correct Specification).
Choose any covariate balancing function f(X) such that α>f(X) = K(X) + (1 − πβ∗(X))L(X)
holds, for some constant α ∈ Rm. In addition, assume that the number of parameters is equal to
the dimension of covariate balancing function f(Xi), i.e., m = q. Then, under the conditions in
Theorem 2.2, the IPTW estimator µβ
given in equation (1.5) attains the semiparametric asymptotic
variance bound in Theorem 1 of Hahn (1998), i.e.,
Vopt = E[Var(Yi(1) |Xi)
π(Xi)+
Var(Yi(0) |Xi)
1− π(Xi)+ L(Xi)− µ2
]. (2.6)
We note that there may exist many choices for f(X), which satisfy α>f(X) = K(X) + (1 −πβ∗(X))L(X) for some α ∈ Rm. This corollary implies that, provided f(X) satisfies the above
condition, the asymptotic variance of our IPTW estimator dose not depend on the particular choice
of f(X) and is the smallest among the class of regular estimators.
Finally, we comment that this efficiency result under correctly specified model also carries
over to the locally misspecified case examined earlier. In Appendix B.5, we show that under the
8
locally misspecified propensity score model in (2.1), the IPTW estimator µβ
satisfies√n(µ
β−
µ)d−→ N(B, H∗>ΣH∗), where B is the first order bias given in equation (2.2) of Theorem 2.1.
Thus, together with Corollary 2.1, the aforementioned choice of covariate balancing function, i.e.,
α>f(Xi) = K(Xi)+(1−πβo(Xi))L(Xi), yields an asymptotically unbiased and efficient estimator
of the ATE under local misspecification.
In summary, the theoretical analysis presented in this section has shown that the optimal
covariate balancing function f(Xi) for the CBPS methodology needs to satisfy α>f(Xi) = K(Xi)+
(1− πβo(Xi))L(Xi), which leads to an asymptotically unbiased and efficient estimator of the ATE
under various scenarios. Recall that K(Xi) is the conditional mean of the potential outcome under
the control group and L(Xi) is the conditional average treatment effect. Both K(·) and L(·) can be
estimated by imposing additional parametric assumptions. However, the optimal choice of f(Xi)
also requires the knowledge of the propensity score model πβo(Xi), where the limiting value βo
of β depends on the choice of f(Xi) itself. Thus, to construct the optimal covariate balancing
function f(Xi), one needs some prior knowledge of f(Xi) to estimate the propensity score model in
(1.6). This “chicken-and-egg” relationship between the optimal covariate balancing function f(Xi)
and propensity score model πβo(Xi) makes the implementation of the existing CBPS method with
the optimal f(Xi) difficult in practice. To address this issue, in the following section we propose a
new IPTW estimator by generalizing the CBPS method, such that the optimal covariate balancing
function does not depend on the knowledge of the propensity score model πβo(Xi).
3 The Improved CBPS Methodology
Recall that the optimal covariate balancing function f(Xi) for the CBPS methodology needs to
satisfy α>f(Xi) = K(Xi) + (1 − πβo(Xi))L(Xi). Notice that the asymptotic bias of the IPTW
estimator with the optimal covariate balancing function f(Xi) can be decomposed into two terms
in the following manner,
E(µβo)− µ = E[(
Tiπβo(Xi)
− 1− Ti1− πβo(Xi)
)K(Xi) + (1− πβo(Xi))L(Xi)
]= E
[(Ti
πβo(Xi)− 1− Ti
1− πβo(Xi)
)K(Xi) +
(Ti
πβo(Xi)− 1
)L(Xi)
]. (3.1)
Our main idea is to minimize the magnitudes of both terms to eliminate the bias of the IPTW
estimator. Motivated by this observation, we propose to balance the first term and second term
separately. This leads to the following set of estimating functions:
gβ(T ,X) =
(g1β(T ,X)
g2β(T ,X)
), (3.2)
where g1β(T ,X) = n−1∑n
i=1 g1β(Ti,Xi) and g2β(T ,X) = n−1∑n
i=1 g2β(Ti,Xi) with
g1β(Ti,Xi) =
(Ti
πβ(Xi)− 1− Ti
1− πβ(Xi)
)h1(Xi),
g2β(Ti,Xi) =
(Ti
πβ(Xi)− 1
)h2(Xi),
9
for some functions h1(·) : Rd → Rm1 and h2(·) : Rd → Rm2 with m1 + m2 = m. Note that, as
seen in the asymptotic bias of the IPTW estimator in (3.1), h1(Xi) in g1β(T ,X) aims to recover
the conditional mean function of the potential outcome under the control condition, i.e., K(Xi),
whereas h2(Xi) in g2β(T ,X) aims to recover the conditional mean function of the treatment effect,
i.e., L(Xi). It is easily seen that g1β(T ,X) is the same as the existing covariate balancing moment
function in (1.6), which balances the covariates h1(Xi) between the treatment and control groups.
More importantly, unlike the existing CBPS method, we introduce a new set of functions g2β(T ,X)
which matches the weighted covariates h2(Xi) in the treatment group to the unweighted covariates
h2(Xi) in the control group, because g2β(T ,X) = 0 can be rewritten as∑Ti=1
1− πβ(Xi)
πβ(Xi)h2(Xi) =
∑Ti=0
h2(Xi).
As seen in the derivation of (3.1), the auxiliary “covariate-imbalance” estimating function g2β(T ,X)
is the key to remove the dependence of the optimal covariate balancing function f(Xi) on the
propensity score model πβo(Xi). Thus, our method is an improved version of the CBPS method
(iCBPS). Given the estimating functions in (3.2), we can estimate β by the GMM estimator β in
(1.7). Similarly, the ATE is estimated by the IPTW estimator µβ
in (1.5). The implementation of
the proposed iCBPS method (e.g., the choice of h1(·) and h2(·)) will be discussed in later sections.
3.1 Theoretical Properties
We now derive the theoretical properties of the IPTW estimator given in (1.5) based on the proposed
iCBPS method. In particular, we will show that the proposed estimator is doubly robust and
semiparametrically efficient. The following set of assumptions are imposed for the establishment of
double robustness.
Assumption 3.1. The following regularity conditions are assumed.
1. There exists a positive definite matrix W∗ such that Wp−→W∗.
2. The minimizer βo = argminβ∈Θ E(gβ(T ,X))>W∗E(gβ(T ,X)) is unique.
3. βo ∈ int(Θ), where Θ is a compact set.
4. πβ(X) is continuous in β.
5. There exists a constant 0 < c0 < 1/2 such that with probability tending to one, c0 ≤ πβ(X) ≤1− c0, for any β ∈ int(Θ).
6. E|Y (1)|2 <∞ and E|Y (0)|2 <∞.
7. G∗ := E(∂g(βo)/∂β) exists where g(β) = (g1β(T ,X)>, g2β(T ,X)>)> and there is a q-
dimensional function C(X) and a small constant r > 0 such that supβ∈Br(βo) |∂πβ(X)/∂βk| ≤Ck(X) for 1 ≤ k ≤ q, and E(|h1j(X)|Ck(X)) < ∞ for 1 ≤ j ≤ m1, 1 ≤ k ≤ q and
E(|h2j(X)|Ck(X)) <∞ for 1 ≤ j ≤ m2, 1 ≤ k ≤ q, where Br(βo) is a ball in Rq with radius
r and center βo.
10
Conditions 1-4 of Assumption 3.1 are the standard conditions for consistency of the GMM
estimator (Newey and McFadden, 1994). Condition 5 is commonly used in the missing data and
causal inference literature, which essentially says the propensity score cannot be too close to 0 and
1 (Robins et al., 1994, 1995). Conditions 6-7 are technical conditions that enable us to apply the
dominated convergence theorem. Note that, supβ∈Br(βo) |∂πβ(X)/∂βk| ≤ Ck(X) in Condition 7
is a local condition in the sense that it only requires the existence of an envelop function C(X)
around a small neighborhood of βo.
We now establish the double robustness of the proposed estimator under Assumption 3.1.
Theorem 3.1 (Double Robustness). Under Assumption 3.1, the proposed iCBPS-based IPTW
estimator µβ
is doubly robust. That is, µβ
p−→ µ if at least one of the following two conditions
holds:
1. The propensity score model is correctly specified, i.e., P(Ti = 1 |Xi) = πβ∗(Xi);
2. The functions K(·) and L(·) lie in the linear space spanned by the functions M1h1(·) and
M2h2(·) respectively, where M1 ∈ Rq×m1 and M2 ∈ Rq×m2 are the partitions of G∗>W∗ =
(M1,M2). That is K(·) ∈ spanM1h1(·) and L(·) ∈ spanM2h2(·).
Theorem 3.1 implies that the proposed estimator is consistent if either the propensity score
model or the outcome model is correctly specified. In particular, the second condition can be
rewritten as K(Xi) = α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), for some vectors α1,α2 ∈ Rq.Hence, the functions h1(·) and h2(·) play very different roles in the proposed iCBPS methodology.
Specifically, M1h1(·) serves as the basis functions for the conditional baseline effect K(·) while
M2h2(·) represents the basis functions for the conditional treatment effect L(·).Next, we establish the asymptotic normality of the proposed estimator if either the propensity
score model or the outcome model is correctly specified. For this result, we require an additional
set of regularity conditions.
Assumption 3.2. The following regularity conditions are assumed.
1. G∗>W∗G∗ and Ω = E(gβo(Ti,Xi)gβo(Ti,Xi)>) are nonsingular.
2. The function C(X) defined in Condition 7 of Assumption 3.1 satisfies E(|Y (0)|Ck(X)) <∞and E(|Y (1)|Ck(X)) <∞ for 1 ≤ k ≤ q.
Condition 1 of Assumption 3.2 ensures the non-singularity of the asymptotic variance matrix
and Condition 2 is a mild technical condition required for the dominated convergence theorem.
Theorem 3.2 (Asymptotic Normality). Suppose that Assumptions 3.1 and 3.2 hold.
1. If Condition 1 of Theorem 3.1 holds, then the proposed iCBPS-based IPTW estimator µβ
has the following asymptotic distribution:
√n(µ
β− µ)
d−→ N(
0, H∗>ΣH∗), (3.3)
11
where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1 and
H∗ = −E(K(Xi) + (1− πβ∗(Xi))L(Xi)
πβ∗(Xi)(1− πβ∗(Xi))·∂πβ∗(Xi)
∂β
),
Σ =
(Σµ Σ>µβ
Σµβ Σβ
), with Σµ = E
(Y 2i (1)
πβ∗(Xi)+
Y 2i (0)
1− πβ∗(Xi)
)− µ2. (3.4)
In addition, Σµβ is given by
Σµβ = −(G∗>W∗G∗)−1G∗>W∗E(K(Xi) + (1− πβ∗(Xi))L(Xi)
(1− πβ∗(Xi))πβ∗(Xi)h(Xi)
),
where h(Xi) = (h>1 (Xi),h>2 (Xi))
>.
2. If Condition 2 of Theorem 3.1 holds, then the proposed iCBPS-based IPTW estimator µβ
has the following asymptotic distribution:
√n(µ
β− µ)
d−→ N(
0, H∗>ΣH∗), (3.5)
where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1,
H∗ = −E[
π(Xi)(K(Xi) + L(Xi))
πβo(Xi)2+
(1− π(Xi))K(Xi)
(1− πβo(Xi))2
∂πβo(Xi)
∂βo
],
Σ =
(Σµ Σ>µβ
Σµβ Σβ
)with Σµ = E
(π(Xi)Y
2i (1)
πβo(Xi)2+
(1− π(Xi))Y2i (0)
(1− πβo(Xi))2
)− µ2.
In addition, Σµβ is given by
Σµβ = −(G∗>W∗G∗)−1G∗>W∗S,
where S = (S>1 ,S>2 )> and
S1 = E[
π(Xi)(K(Xi) + L(Xi)− πβo(Xi)µ)
πβo(Xi)2
+(1− π(Xi))(K(Xi) + (1− πβo(Xi))µ)
(1− πβo(Xi))2
h1(Xi)
],
S2 = E[
π(Xi)[(K(Xi) + L(Xi))(1− πβo(Xi))− πβo(Xi)µ]
πβo(Xi)2
+(1− π(Xi))K(Xi) + (1− πβo(Xi))µ
1− πβo(Xi)
h2(Xi)
].
3. If both Conditions 1 and 2 of Theorem 3.1 hold and W∗ = Ω−1, then the proposed iCBPS-
based IPTW estimator µβ
has the following asymptotic distribution:
√n(µ
β− µ)
d−→ N(0, V ),
12
where
V = Σµ − (α>1 M1,α>2 M2)G∗(G∗>Ω−1G∗)−1G∗>
(M>
1 α1
M>2 α2
)(3.6)
and Σµ is defined in (3.4).
The asymptotic variance V in (3.6) contains two terms. The first term Σµ represents the
variance of each summand in the estimator defined in equation (1.5) with β replaced by its true
value β∗. The second term can be interpreted as the effect of estimating β∗ via covariate balancing
conditions. Since this second term is nonnegative, the proposed estimator is more efficient than
the standard IPTW estimator with the true propensity score model, i.e., V ≤ Σµ. In particular,
Henmi and Eguchi (2004) offer a theoretical analysis of such efficiency gain due to the estimation
of nuisance parameters under a general estimating equation framework.
Since the choice of h1(·) and h2(·) can be arbitrary, it might be tempting to incorporate addi-
tional covariate balancing conditions into h1(·) and h2(·). However, the following corollary shows
that when both the propensity score and outcome models are correctly specified, one cannot im-
prove the efficiency of the proposed estimator by increasing the number of functions h1(·) and h2(·)or equivalently, the dimensionality of covariate balancing conditions, i.e., g1β(T ,X) and g2β(T ,X).
Corollary 3.1. Define h1(X) = (h>1 (X),a>1 (X))> and h2(X) = (h>2 (X),a>2 (X))>, where a1(·)and a2(·) are some additional covariate balancing functions. Similarly, let g1(X) and g2(X) denote
the corresponding estimating equations defined by h1(X) and h2(X). The resulting iCBPS-based
IPTW estimator is denoted by µβ
where β is in (1.7) and its asymptotic variance is denoted by V .
Under Condition 3 of Theorem 3.1, we have V ≤ V , where V is defined in (3.6).
The above corollary shows a potential trade-off between robustness and efficiency when choosing
h1(·) and h2(·). Recall that if rank(M1) = m1 and rank(M2) = m2, Condition 2 of Theorem 3.1
can be written as K(·) ∈ spanh1(·) and L(·) ∈ spanh2(·). Therefore, we can make the proposed
estimator more robust by incorporating more basis functions into h1(·) and h2(·), such that this
condition is more likely to hold. However, Corollary 3.1 shows that doing so may inflate the variance
of the proposed estimator.
In the following, we focus on the efficiency of the estimator. Using the notations in this section,
we can rewrite the semiparametric asymptotic variance bound Vopt in (2.6) as
Vopt = Σµ − (α>1 M1,α>2 M2)Ω
(M>
1 α1
M>2 α2
). (3.7)
Comparing this expression with (3.6), we see that the proposed estimator is semiparametrically
efficient if G∗ is a square matrix (i.e., m = q) and invertible. In this special case, the dimension
of β must be identical to that of the covariate balancing functions gβ(T ,X). We can accomplish
this by over-parameterizing the propensity score model (e.g., including higher-order terms and
interactions). This important result is summarized as the following corollary.
Corollary 3.2. Assumem = q and G∗ is invertible. Under Assumption 3.1, the proposed estimator
µβ
in (1.5) is doubly robust in the sense that µβ
p−→ µ if either of the following conditions holds:
13
1. The propensity score model is correctly specified. That is P(Ti = 1 |Xi) = πβ∗(Xi).
2. The functions K(·) and L(·) lie in the linear space spanned by the functions h1(·) and h2(·)respectively. That is K(·) ∈ spanh1(·) and L(·) ∈ spanh2(·).
In addition, under Assumption 3.2, if both conditions hold, then the proposed estimator is semi-
parametrically efficient with the asymptotic variance given in (3.7).
This corollary shows that the proposed estimator has two advantages over the existing CBPS
estimator in Imai and Ratkovic (2014) with balancing first (and second) moment of Xi and/or the
score function of the propensity score model. First, the proposed estimator µβ
is doubly robust to
model misspecification, whereas the existing CBPS estimator does not have this property. Second,
the estimator can be more efficient than the existing CBPS estimator.
The result in Corollary 3.2 implies that the asymptotic variance of µβ
is identical to the semi-
parametric variance bound Vopt, even if we incorporate additional covariate balancing functions
into h1(·) and h2(·). Namely, under the conditions in Corollary 3.2, we have V = V = Vopt in
the context of Corollary 3.1. Thus, in this setting, we can improve the robustness of the estimator
without sacrificing the efficiency by increasing the number of functions h1(·) and h2(·). Meanwhile,
this also makes the propensity score model more flexible, since we need to increase the number of
parameters β to ensure m = q as required in Corollary 3.2. This observation further motivates us
to consider a nonparametric/semiparametric approach to improve the iCBPS method, which will
be shown in Section 4.
Remark 3.1. Robins et al. (1994) propose the following estimator of the ATE with the double
robustness and semiparametric efficiency properties
µβ,α,γ =1
n
n∑i=1
TiYiπβ(Xi)
− (1− Ti)Yi1− πβ(Xi)
− (Ti − πβ(Xi))(K(Xi,α) + L(Xi,γ)
πβ(Xi)+
K(Xi,α)
1− πβ(Xi)
),
where K(Xi,α) and L(Xi,γ) are some parametric models with finite dimensional parameters α
and γ. Unlike the projection approach behind this classical doubly robust estimator µβ,α,γ (Tsiatis,
2007), the proposed iCBPS-based IPTW estimator µβ
is constructed by a new covariate balancing
method. From a practical perspective, one needs to plug consistent estimators of (α,γ) and β (e.g.,
usually MLE) into µβ,α,γ to estimate the treatment effect. In contrast, the proposed method is
easier to implement, which avoids the estimation of the parameters in the conditional mean models
K(Xi,α) and L(Xi,γ).
Remark 3.2 (Implementation of iCBPS). Based on Corollary 3.2, h1(·) serves as the basis
functions for the baseline conditional mean function K(·), while h2(·) represents the basis functions
for the conditional average treatment effect function L(·). Thus, in practice, researchers can choose
a set of basis functions for the baseline conditional mean function and the conditional average treat-
ment effect function when determining the specification for h1(·) and h2(·). Once these functions
are selected, they can over-parameterize the propensity score model such that m = q holds. The
resulting iCBPS-based IPTW estimator may reduce bias under model misspecification and attain
high efficiency.
14
Table 3.1: Correct Outcome Model with Correct Propensity Score Model.
n = 300 n = 1000
β1 0 0.33 0.67 1 0 0.33 0.67 1
Bias
True 0.19 2.48 −6.14 −1.97 −0.42 1.77 −1.71 3.39
GLM 0.77 1.13 −4.80 −15.61 0.64 0.46 0.08 −2.61
CBPS 0.70 0.59 −1.17 −5.28 0.68 0.65 0.03 −2.02
iCBPS 0.61 0.63 −0.86 −4.50 0.70 0.73 0.23 −1.28
True 27.17 39.61 84.09 213.13 15.53 21.58 45.50 228.89
Std GLM 5.52 17.69 65.45 163.59 2.44 8.06 34.77 108.03
Dev CBPS 3.36 3.66 4.54 7.40 1.79 1.96 2.36 3.84
iCBPS 3.20 3.26 3.38 5.12 1.72 1.80 1.85 2.50
RMSE
True 27.17 39.69 84.31 213.14 15.54 21.65 45.53 228.91
GLM 5.58 17.73 65.63 164.34 2.52 8.07 34.77 108.07
CBPS 3.43 3.71 4.69 9.09 1.91 2.06 2.36 4.34
iCBPS 3.26 3.32 3.49 6.82 1.86 1.94 1.87 2.80
3.2 Simulation Studies
In this section, we conduct a set of simulation studies to examine the performance of the proposed
methodology. We consider the following linear model for the potential outcomes,
Yi(1) = 200 + 27.4Xi1 + 13.7Xi2 + 13.7Xi3 + 13.7Xi4 + εi,
Yi(0) = 200 + 13.7Xi2 + 13.7Xi3 + 13.7Xi4 + εi.
where εi ∼ N(0, 1), independent of Xi, and consider the following true propensity score model
P(Ti = 1 |Xi = xi) =exp(−β1xi1 + 0.5xi2 − 0.25xi3 − 0.1xi4)
1 + exp(−β1xi1 + 0.5xi2 − 0.25xi3 − 0.1xi4), (3.8)
where β1 varies from 0 to 1. When implementing the proposed methodology, we set h1(xi) =
(1, xi2, xi3, xi4) and h2(xi) = xi1 so that the number of equations is equal to the number of param-
eters to be estimated. Each covariate is generated independently from N(3, 2). Each set of results
is based on 1,000 Monte Carlo simulations.
We examine the performance of the IPTW estimator when the propensity score model is fitted
using the maximum likelihood (GLM), the standard CBPS with balancing the first moment (CBPS),
and the proposed improvement of CBPS (iCBPS) as well as the case where the true propensity score
(True), i.e., β = β∗, is used for the IPTW estimator. In the first set of simulations, we use correctly
specified propensity score model. Table 3.1 shows the standard deviation, bias, root mean square
error (RMSE) of these estimators when the sample size is n = 300 and n = 1000. We find that
CBPS and iCBPS substantially outperform both True and GLM in terms of efficiency. In addition,
iCBPS is more efficient than CBPS in all cases though both have similar biases in most cases. These
findings are consistent with Corollary 3.2.
15
Table 3.2: Correct Outcome Model with Misspecified Propensity Score Model.
n = 300 n = 1000
β1 0 0.33 0.67 1 0 0.33 0.67 1
Bias
True 0.41 −0.38 −2.28 −2.57 0.46 0.02 1.03 −6.59
GLM 1.76 −24.12 −57.51 −72.73 0.93 −25.34 −53.49 −74.57
CBPS 1.77 −0.94 −4.73 −6.69 0.74 −2.51 −6.41 −8.32
iCBPS 0.50 0.41 −0.63 −1.95 0.63 0.30 −0.60 −1.87
True 47.60 41.63 70.40 310.96 25.80 21.12 54.64 90.80
Std GLM 9.72 26.53 161.18 48.66 2.67 13.65 23.74 30.76
Dev CBPS 4.14 4.15 4.84 5.36 1.79 2.08 2.76 3.45
iCBPS 3.42 3.11 3.18 3.20 1.75 1.76 1.82 1.74
RMSE
True 47.60 41.64 70.44 310.97 25.80 21.12 54.65 91.04
GLM 9.88 35.86 171.13 87.50 2.83 28.78 58.52 80.67
CBPS 4.50 4.26 6.77 8.57 1.94 3.26 6.97 9.01
iCBPS 3.45 3.14 3.25 3.75 1.86 1.78 1.92 2.55
We further evaluate our method by considering different cases of misspecification for the outcome
and propensity score models. We begin with the case where the outcome model is linear like
before but the propensity score is misspecified. While we use the model given in equation (3.8)
when estimating the propensity score, the actual treatment is generated according to the following
different model,
P(Ti = 1 |X = xi) =exp(−β1x
∗i1 + 0.5x∗i2 − 0.25x∗i3 − 0.1x∗i4)
1 + exp(−β1x∗i1 + 0.5x∗i2 − 0.25x∗i3 − 0.1x∗i4),
with x∗i1 = exp(xi1/3), x∗i2 = xi2/1 + exp(xi1)+ 10, x∗i3 = xi1xi3/25 + 0.6, and x∗i4 = xi1 +xi4 + 20
where β1 again varies from 0 to 1. In other words, the model misspecification is introduced using
nonlinear transformations. Table 3.2 shows the results for this case. As expected from the double
robustness property shown in Theorem 3.1, we find that the bias for iCBPS becomes significantly
smaller than CBPS and GLM. iCBPS also dominates the other estimators in terms of efficiency.
We next examine the cases where the outcome model is misspecified. We do this by generating
potential outcomes from the following quadratic model
E(Yi(1) |Xi = xi) = 200 + 27.4x2i1 + 13.7x2
i2 + 13.7x2i3 + 13.7x2
i4,
E(Yi(0) |Xi = xi) = 200 + 13.7x2i2 + 13.7x2
i3 + 13.7x2i4,
whereas the propensity score model is the same as the one in (3.8) with β1 varying from 0 to 0.4.
Table 3.3 shows the results when the outcome model is misspecified but the propensity score model
is correct. We find that the magnitude of bias is similar across all estimators though iCBPS has
the least bias. CBPS and iCBPS again dominate GLM in terms of efficiency with iCBPS having the
smallest standard deviation. Finally, we consider the case where both the outcome and propensity
score models are misspecified. Table 3.4 shows the results. We find that iCBPS outperforms all the
other estimators in both bias and efficiency.
16
Table 3.3: Misspecified Outcome Model with Correct Propensity Score Model.
n = 300 n = 1000
β1 0 0.13 0.27 0.4 0 0.13 0.27 0.4
Bias
True 1.11 4.65 1.19 0.75 2.33 3.46 2.36 1.91
GLM 2.09 2.76 2.02 −0.17 2.72 2.35 2.69 2.21
CBPS 2.18 2.46 1.27 −0.98 2.64 2.30 2.38 1.57
iCBPS 2.07 2.22 1.08 −0.70 2.57 2.28 2.38 1.56
True 38.57 42.90 47.86 57.09 21.35 22.57 27.24 31.66
Std GLM 14.36 17.72 23.99 32.35 7.41 8.67 12.53 17.82
Dev CBPS 12.12 13.68 15.11 15.76 6.52 6.94 8.32 9.44
iCBPS 11.81 12.64 13.50 13.64 6.37 6.42 7.27 7.97
RMSE
True 38.58 43.15 47.88 57.10 21.48 22.84 27.34 31.72
GLM 14.51 17.94 24.08 32.35 7.89 8.98 12.82 17.95
CBPS 12.32 13.90 15.16 15.79 7.03 7.31 8.65 9.57
iCBPS 11.99 12.83 13.54 13.66 6.87 6.82 7.65 8.12
Table 3.4: Misspecified Outcome with Misspecified Propensity Score Models.
n = 300 n = 1000
β1 0 0.13 0.27 0.4 0 0.13 0.27 0.4
Bias
True 1.23 1.63 1.19 0.13 0.19 1.51 0.47 1.63
GLM 3.73 1.54 −1.58 −7.16 1.63 −0.01 −3.26 −8.46
CBPS 8.32 6.35 4.23 1.36 1.89 0.75 −0.70 −3.17
iCBPS 4.11 2.28 1.18 −0.32 1.45 0.50 −0.32 −1.53
True 60.98 57.09 52.36 50.10 33.47 30.57 27.66 28.41
Std GLM 23.20 19.45 19.53 20.52 7.71 8.05 8.46 9.70
Dev CBPS 18.29 16.66 14.76 14.64 6.85 7.04 6.83 6.67
iCBPS 14.05 12.27 11.95 11.64 6.69 6.77 6.49 6.14
RMSE
True 60.99 57.11 52.37 50.10 33.47 30.61 27.67 28.46
GLM 23.50 19.51 19.59 21.73 7.88 8.05 9.07 12.87
CBPS 20.10 17.83 15.35 14.70 7.11 7.08 6.87 7.38
iCBPS 14.64 12.48 12.01 11.64 6.84 6.79 6.50 6.33
4 Nonparametric/Semiparametric iCBPS Method
As seen in Corollary 3.2, the proposed estimator is efficient if both the propensity score P(Ti =
1 | Xi) and the conditional mean functions K(·) and L(·) are correctly specified. To avoid model
misspecification, we can choose a large number of basis functions h1(·) and h2(·), such that the
conditional mean functions K(·) and L(·) can be well approximated by the linear spans of h1(·) and
17
h2(·), respectively. Therefore, the correct specification of K(·) and L(·) can be attained. However,
the parametric assumption on the propensity score model P(Ti = 1 | Xi) = πβ∗(Xi) imposed
in Corollary 3.2 can be restrictive and prone to be violated. Once the propensity score model is
misspecified, the proposed iCBPS-based IPTW estimator µβ
is inefficient and even inconsistent.
To relax the strong parametric assumptions imposed in the previous sections, in what follows, we
propose a flexible nonparametric/semiparametric approach for modeling the propensity score and
the conditional mean functions. The main advantage of this nonparametric approach is that, the
resulting iCBPS-based IPTW estimator is semiparametrically efficient under a much broader class
of propensity score models and the conditional mean models than those in Corollary 3.2.
More specifically, we assume P(Ti = 1 |Xi) = J(m∗(Xi)), where J(·) is a known monotonic link
function (e.g., J(·) = exp(·)/(1+exp(·))), and m∗(·) is an unknown smooth function. One practical
way to estimate m∗(·) is to approximate it by the linear combination of K basis functions, where K
is allowed to grow with n. This approach is known as the sieve estimation (Andrews, 1991; Newey,
1997). In detail, let B(x) = b1(x), ..., bK(x) denote a collection of K basis functions, whose
mathematical requirement is given in Assumption 4.1. Intuitively, we would like to approximate
m∗(x) by β∗>B(x), for some coefficient β∗ ∈ RK .
To estimate β∗, similar to the parametric case, we define gβ(T ,X) = n−1∑n
i=1 gβ(Ti,Xi),
where gβ(Ti,Xi) = (g>1β(Ti,Xi), g>2β(Ti,Xi))
> with
g1β(Ti,Xi) =
(Ti
J(β>B(Xi))− 1− Ti
1− J(β>B(Xi))
)h1(Xi),
g2β(Ti,Xi) =
(Ti
J(β>B(Xi))− 1
)h2(Xi).
Recall that h1(X) ∈ Rm1 and h2(X) ∈ Rm2 are interpreted as the basis functions for K(X) and
L(X). Let m1 +m2 = m and h(X) = (h1(X)>,h2(X)>)>. Here, we assume m = K, so that the
number of equations in gβ(T ,X) is identical to the dimension of the parameter β. Then define
β = arg minβ∈Θ ‖gβ(T ,X)‖22, where Θ is the parameter space for β and ‖v‖2 represents the L2
norm of the vector v. The resulting IPTW estimator is
µβ
=1
n
n∑i=1
(TiYi
J(β>B(Xi))− (1− Ti)Yi
1− J(β>B(Xi))
).
The following assumptions are imposed to establish the large sample properties of µβ
.
Assumption 4.1. The following regularity conditions are assumed.
1. The minimizer βo = argminβ∈Θ ‖E(gβ(T ,X))‖22 is unique.
2. βo ∈ int(Θ), where Θ is a compact set.
3. There exist constants 0 < c0 < 1/2, c1 > 0 and c2 > 0 such that c0 ≤ J(v) ≤ 1 − c0
and 0 < c1 ≤ ∂J(v)/∂v ≤ c2, for any v = β>B(x) with β ∈ int(Θ). There exists a small
neighborhood of v∗ = β∗>B(x), say B such that for any v ∈ B it holds that |∂2J(v)/∂v2| ≤ c3
for some constant c3 > 0.
18
4. E|Y (1)|2 <∞ and E|Y (0)|2 <∞.
5. Let G∗ := E[B(Xi)h(Xi)>∆i(m
∗(Xi))], where ∆i(m(Xi)) = diag(ξi(m(Xi))1m1 , φi(m(Xi))1m2)
is a K ×K diagonal matrix with
ξi(m(Xi)) = −( TiJ2(m(Xi))
+1− Ti
(1− J(m(Xi)))2
)∂J(m(Xi))
∂m,
φi(m(Xi)) = − TiJ2(m(Xi))
∂J(m(Xi))
∂m.
Here, 1m1 is a vector of 1’s with length m1. Assume that there exists a constant C1 > 0, such
that λmin(G∗>G∗) ≥ C1, where λmin(·) denotes the smallest eigenvalue of a matrix.
6. For some constant C, it holds ‖E[h(Xi)h(Xi)>]‖2 ≤ C and ‖E[B(Xi)B(Xi)
>]‖2 ≤ C, where
‖A‖2 denotes the spectral norm of the matrix A. In addition, supx∈X ‖h(x)‖2 ≤ CK1/2, and
supx∈X ‖B(x)‖2 ≤ CK1/2.
7. Let m∗(·) ∈M and K(·), L(·) ∈ H, whereM andH are two sets of smooth functions. Assume
that logN[ ](ε,M, L2(P )) ≤ C(1/ε)1/k1 and logN[ ](ε,H, L2(P )) ≤ C(1/ε)1/k2 , where C is a
positive constant and k1, k2 > 1/2. Here, N[ ](ε,M, L2(P )) denotes the minimum number of
ε-brackets needed to cover M; see Definition 2.1.6 of van der Vaart and Wellner (1996).
Note that the first five conditions are similar to Assumptions 3.1 and 3.2. In particular, condition
5 is the natural extension of condition 1 in Assumption 3.2, when the dimension of the matrix G∗
grows with the sample size n. Condition 6 is a mild technical condition on the basis functions
h(x) and B(x), which is implied by Assumption 2 of Newey (1997). In particular, this condition
is satisfied by many bases such as the regression spline, trigonometric polynomial, wavelet bases;
see Newey (1997); Horowitz et al. (2004); Chen (2007); Belloni et al. (2015). Finally, condition 7 is
a technical condition on the complexity of the function classes M and H. Specifically, it requires
that the bracketing number N[ ](ε, ·, L2(P )) ofM and H cannot increase too fast as ε approaches to
0. This condition holds for many commonly used function classes. For instance, if M corresponds
to the Holder class with smoothness parameter s defined on a bounded convex subset of Rd, then
logN[ ](ε,M, L2(P )) ≤ C(1/ε)d/s by Corollary 2.6.2 of van der Vaart and Wellner (1996). Hence,
this condition simply requires s/d > 1/2. Given Assumption 4.1, the following theorem establishes
the asymptotic normality and semiparametric efficiency of the estimator µβ
.
Theorem 4.1. Assume that Assumption 4.1 holds, and there exist rb, rh > 1/2, β∗ and α∗ =
(α∗>1 ,α∗>2 )> ∈ RK , such that the propensity score model satisfies
supx∈X|m∗(x)− β∗>B(x)| = O(K−rb), (4.1)
and the outcome models K(·) and L(·) satisfy
supx∈X|K(x)−α∗>1 h1(x)| = O(K−rh), sup
x∈X|L(x)−α∗>2 h2(x)| = O(K−rh). (4.2)
19
Assume K = o(n1/3) and n1
2(rb+rh) = o(K). Then
n1/2(µβ− µ)
d−→ N(0, Vopt),
where Vopt is the asymptotic variance bound in (2.6). Thus, µβ
is semiparametrically efficient.
This theorem can be viewed as a nonparametric extension of Corollary 3.2. It shows that one
can construct an efficient estimator of the treatment effect without imposing strong parametric
assumptions on the propensity score model and the outcome model.
In the following, we comment on the technical assumptions in this theorem. We assume m∗(x)
and K(x) (also L(x)) can be uniformly approximated by the basis functions B(x) and h1(x)
(also h2(x)), respectively. It is well known that the uniform rate of convergence is related to the
smoothness of the functions m∗(x) and K(x) (also L(x)) and the dimension of X. For instance,
if the function classM for m∗(x) and H for K(x) (also L(x)) correspond to the Holder class with
smoothness parameter s on the domain X = [0, 1]d, then the uniform convergence holds for the
spline basis and wavelet basis with rb = rh = s/d; see Newey (1997); Chen (2007) for details. Thus,
Theorem 4.1 is directly applicable to this setting in which no additional structure assumptions
are imposed on m∗(x) and K(x) (also L(x)). Compared to the nonparametric result in Hirano
et al. (2003) which requires s/d > 7 and Chan et al. (2015) which requires s/d > 13, our theorem
needs a much weaker condition, i.e., s/d > 3/4. To achieve this improved result, our proof exploits
the matrix Bernstein’s concentration inequalities (Tropp, 2015) and a Bernstein-type concentration
inequality for U-statistics (Arcones, 1995).
When d is large, to handle the curse of dimensionality and meanwhile retain the flexibility
of the model, we may assume the following additive structure m∗(x) =∑d
j=1m∗j (xj); see Stone
(1985); Hastie and Tibshirani (1990); Horowitz et al. (2004); Fan and Jiang (2005), among others.
One major advantage of the sieve estimation approach is that it is convenient to incorporate such
structural assumption and the results in Theorem 4.1 are still applicable. In this case, the uniform
convergence holds with rb = s, where s is a common smoothness parameter of the Holder class for
all m∗j (x)j=1,...,d. In particular, if m∗(x) and K(x) (also L(x)) all have the additive structures
and each unknown component lies in the Holder class with s = 1, then Theorem 4.1 holds provided
K = o(n1/3) and n1/4 = o(K). Similar to the additive model, our method can be easily extended to
the partially linear model for m∗(x) and K(x) (also L(x)), by redefining the basis functions B(x)
and h(x). As discussed in Section 5 of Newey (1997), the large sample results (e.g., our Theorem
4.1) of the sieve estimator are directly applicable to the partially linear case. Thus, our Theorem 4.1
provides a unified semiparametric efficiency result under many nonparametric/semiparametric set-
tings, which is more comprehensive than the existing fully nonparametric propensity score methods
in Hirano et al. (2003); Chan et al. (2015).
Interestingly, we find that the asymptotic bias of the estimator µβ
is proportional to K−(rb+rh),
which is the product of the approximation errors for m∗(x) and K(x) (also L(x)). Thus, to make
the bias of the estimator µβ
asymptotically ignorable, we may require either rb or rh sufficiently
large, due to the multiplicative form of the asymptotic bias. This phenomenon can be viewed
as a nonparametric version of the double robustness property, in the sense that the asymptotic
20
normality and semiparametric efficiency results hold if either the propensity score model or the
outcome model is approximated reasonably well by basis functions. To the best of our knowledge,
this property does not appear in the existing literature. In fact, in this theorem, it is not necessary
to assume both the propensity score model and the outcome models are approximated with very
high accuracy (i.e., both rb and rh are large).
5 Concluding Remarks
This paper presents a theoretical investigation of the covariate balancing propensity score method-
ology that others have found work well in practice (e.g., Wyss et al., 2014; Frolich et al., 2015).
We derive the optimal choice of the covariate balancing function so that the resulting IPTW esti-
mator is first order unbiased under local misspecification of the propensity score model. We then
show that this choice is also optimal under arbitrary misspecification. Furthermore, it turns out
that this covariate balancing function is also optimal even under correct specification, attaining the
semiparametric efficiency bound.
Given these theoretical insights, we propose an improvement to the standard CBPS methodol-
ogy by carefully choosing the covariate balancing estimating functions. We prove that the proposed
iCBPS-based IPTW estimator is consistent if either the outcome or propensity score model is cor-
rect. In addition to this double robustness property, the proposed estimator is semiparametrically
efficient. The advantage of the proposed methodology, over the standard doubly robust estimators,
is that it does not require the separate estimation of an outcome model. Our simulation results
confirm the theoretical results, demonstrating the advantages of the proposed iCBPS methodology.
To relax the parametric assumptions and improve the double robustness property, we further ex-
tend the iCBPS method to the nonparametric/semiparametric settings. We show that the proposed
estimator can achieve the semiparametric efficiency bound without imposing parametric assump-
tions on the propensity score model and the outcome model. A new nonparametric version of the
double robustness property is discovered. Finally, we note that our theoretical results require much
weaker technical conditions and cover a broader class of semiparametric models than the existing
nonparametric propensity score methods.
References
Andrews, D. W. (1991). Asymptotic normality of series estimators for nonparametric and semi-
parametric regression models. Econometrica 307–345.
Arcones, M. A. (1995). A bernstein-type inequality for u-statistics and u-processes. Statistics &
probability letters 22 239–247.
Belloni, A., Chernozhukov, V., Chetverikov, D. and Kato, K. (2015). Some new asymp-
totic theory for least squares series: Pointwise and uniform results. Journal of Econometrics 186
345–366.
21
Chan, K., Yam, S. and Zhang, Z. (2015). Globally efficient nonparametric inference of average
treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical
Society, Series B, Methodological Forthcoming.
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of
econometrics 6 5549–5632.
Chen, X. and Christensen, T. M. (2015). Optimal uniform convergence rates and asymp-
totic normality for series estimators under weak dependence and weak conditions. Journal of
Econometrics 188 447–465.
Fan, J. and Jiang, J. (2005). Nonparametric inferences for additive models. Journal of the
American Statistical Association 100 890–907.
Fong, C., Hazlett, C. and Imai, K. (2015). Covariate balancing propensity score for general
treatment regimes. Tech. rep., Department of Politics, Princeton University.
Frolich, M., Huber, M. and Wiesenfarth, M. (2015). The finite sample performance of
semi- and nonparametric estimators for treatment effects and policy evaluation. Tech. rep., IZA
Discussion Paper No. 8756.
Graham, B. S., Pinto, C. and Egel, D. (2012). Inverse probability tilting for moment condition
models with missing data. Review of Economic Studies 79 1053–1079.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of
average treatment effects. Econometrica 315–331.
Hainmueller, J. (2012). Entropy balancing for causal effects: Multivariate reweighting method
to produce balanced samples in observational studies. Political Analysis 20 25–46.
Hansen, B. E. (2014). A unified asymptotic distribution theory for parametric and non-parametric
least squares. Tech. rep., Working paper, University of Wisconsin.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
Econometrica 50 1029–1054.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models, vol. 43. CRC Press.
Henmi, M. and Eguchi, S. (2004). A paradox concerning nuisance parameters and projected
estimating functions. Biometrika 91 929–941.
Hirano, K., Imbens, G. and Ridder, G. (2003). Efficient estimation of average treatment effects
using the estimated propensity score. Econometrica 71 1307–1338.
Horowitz, J. L., Mammen, E. et al. (2004). Nonparametric estimation of an additive model
with a link function. The Annals of Statistics 32 2412–2443.
22
Horvitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from
a finite universe. Journal of the American Statistical Association 47 663–685.
Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal
Statistical Society, Series B (Statistical Methodology) 76 243–263.
Imai, K. and Ratkovic, M. (2015). Robust estimation of inverse probability weights for marginal
structural models. Journal of the American Statistical Association 110 1013–1023.
Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: a comparison of
alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22
574–580.
Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal
of Econometrics 79 147–168.
Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing.
Handbook of econometrics 4 2111–2245.
Owen, A. B. (2001). Empirical Likelihood. Chapman & Hall/CRC, New York.
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients
when some regressors are not always observed. Journal of the American Statistical Association
89 846–866.
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression
models for repeated outcomes in the presence of missing data. Journal of the American Statistical
Association 90 106–121.
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in obser-
vational studies for causal effects. Biometrika 70 41–55.
Rubin, D. B. (1990). Comments on “On the application of probability theory to agricultural
experiments. Essay on principles. Section 9” by J. Splawa-Neyman translated from the Polish
and edited by D. M. Dabrowska and T. P. Speed. Statistical Science 5 472–480.
Stone, C. J. (1985). Additive regression and other nonparametric models. The annals of Statistics
689–705.
Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika
97 661–682.
Tropp, J. A. (2015). An introduction to matrix concentration inequalities. arXiv preprint
arXiv:1501.01571 .
Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business Media.
23
van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes: With
Applications to Statistics. Springer Science & Business Media.
Van der Vaart, A. W. (2000). Asymptotic statistics, vol. 3. Cambridge university press.
Wyss, R., Ellis, A. R., Brookhart, M. A., Girman, C. J., Funk, M. J., LoCasale, R.
and Sturmer, T. (2014). The role of prediction modeling in propensity score estimation: An
evaluation of logistic regression, bCART, and the covariate-balancing propensity score. American
Journal of Epidemiology 180 645–655.
24
A Preliminaries
To simplify the notation, we use π∗i = πβ∗(Xi) and πoi = πβo(Xi). For any vector C ∈ RK , we
denote |C| = (|C1|, ..., |CK |)> and write C ≤ B for Ck ≤ Bk for any 1 ≤ k ≤ K.
Assumption A.1. (Regularity Conditions for CBPS in Section 2)
1. There exists a positive definite matrix W∗ such that Wp−→W∗.
2. The minimizer βo = argminβ E(gβ(T ,X))>W∗E(gβ(T ,X)) is unique.
3. βo ∈ int(Θ), where Θ is a compact set.
4. πβ(X) is continuous in β.
5. There exists a constant 0 < c0 < 1/2 such that with probability tending to one, c0 ≤ πβ(X) ≤1− c0, for any β ∈ int(Θ).
6. E|fj(X)| <∞ for 1 ≤ j ≤ m and E|Y (1)|2 <∞, E|Y (0)|2 <∞.
7. G∗ := E(∂g(βo)/∂β) exists and there is a q-dimensional function C(X) and a small constant
r > 0 such that supβ∈Br(βo) |∂πβ(X)/∂β| ≤ C(X) and E(|fj(X)|C(X)) <∞ for 1 ≤ j ≤ m,
where Br(βo) is a ball in Rq with radius r and center βo. In addition, E(|Y |C(X)) <∞.
8. G∗>W∗G∗ and E(gβo(Ti,Xi)gβo(Ti,Xi)>) are nonsingular.
9. In the locally misspecified model (2.1), assume |u(X;β∗)| ≤ 1 almost surely.
Lemma A.1 (Lemma 2.4 in Newey and McFadden (1994)). Assume that the data Zi are i.i.d., Θ
is compact, a(Z, θ) is continuous for θ ∈ Θ, and there is D(Z) with |a(Z, θ)| ≤ D(Z) for all θ ∈ Θ
and E(D(Z)) <∞, then E(a(Z, θ)) is continuous and supθ∈Θ |n−1∑n
i=1 a(Zi, θ)−E(a(Z, θ))| p−→ 0.
Lemma A.2. Under Assumption A.1 (or Assumptions 3.1), we have βp−→ βo. Moreover, if
P(Ti = 1 |Xi) = π∗i (Xi), then βp−→ β∗.
Proof of Lemma A.2. The proof of βp−→ βo follows from Theorem 2.6 in Newey and McFadden
(1994). Note that their conditions (i)–(iii) follow directly from Assumption 3.1 (1)–(4). We only
need to verify their condition (iv), i.e., E(supβ∈Θ |gβj(Ti,Xi)|) <∞ where
gβj(Ti,Xi) =( Tiπβ(Xi)
− 1− Ti1− πβ(Xi)
)fj(Xi),
By Assumption A.1 (5), we have |gβj(Ti,Xi)| ≤ 2|fj(Xi)|/c0 and thus E(supβ∈Θ |gβj(Ti,Xi)|) <∞by Assumption A.1 (6). Moreover, if P(Ti = 1 | Xi) = πβ∗(Xi), in the following we show
that β∗ = βo. This is because by P(Ti = 1 | Xi) = πβ∗(Xi), we have E(gβ∗(Ti,Xi)) =
0 and E(g>β∗(Ti,Xi))W∗E(gβ∗(Ti,Xi)) = 0. Since W∗ is positive definite, we can see that
E(g>β (Ti,Xi))W∗E(gβ(Ti,Xi)) ≥ 0. Hence β∗ is the minimizer of E(g>β (Ti,Xi))W
∗E(gβ(Ti,Xi))
25
and by the uniqueness of the minimizer we prove that β∗ = βo. In addition, for the proof of Theo-
rem 3.1, we similarly verify the following conditions to prove this lemma for the iCBPS estimator,
i.e., E(supβ∈Θ |g1βj(Ti, Xi)|) <∞ and E(supβ∈Θ |g2βj(Ti,Xi)|) <∞, where
g1βj(Ti,Xi) =( Tiπβ(Xi)
− 1− Ti1− πβ(Xi)
)h1j(Xi), and g2βj(Ti,Xi) =
( Tiπβ(Xi)
− 1)h2j(Xi).
We have |g1βj(Ti,Xi)| ≤ 2|h1j(Xi)|/c0 and thus E(supβ∈Θ |g1βj(Ti,Xi)|) < ∞. Similarly, we can
prove E(supβ∈Θ |g2βj(Ti,Xi)|) <∞. This completes the proof.
Lemma A.3. Under Assumption A.1 (or Assumptions 3.1 and 3.2), we have
n1/2(β − βo) = −(H∗>f W∗H∗f )−1n1/2H∗>f W∗gβo(T ,X) + op(1), (A.1)
n1/2(β − βo) d−→ N(0, (H∗>f W∗H∗f )−1H∗>f W∗ΩW∗H∗f (H∗>f W∗H∗f )−1), (A.2)
where Ω = Var(gβ∗(Ti,Xi)). If the propensity score model is correctly specified with P(Ti = 1 |Xi) = πβ∗(Xi) and W∗ = Ω−1 holds, then n1/2(β − β∗) d−→ N(0, (H∗>f Ω−1H∗f )−1).
Proof. The proof of (A.1) and (A.2) follows from Theorem 3.4 in Newey and McFadden (1994).
Note that their conditions (i), (ii), (iii) and (v) are directly implied by our Assumption A.1
(3), (4), (2) and Assumption A.1 (1), respectively. In addition, their condition (iv), that is,
E(supβ∈N |∂gβo(Ti,Xi)/∂βj |) < ∞ for some small neighborhood N around βo, is also implied
by our Assumption A.1. To see this, by Assumption A.1 some simple calculations show that
supβ∈N
∣∣∣∂gβ(Ti,Xi)
∂βj
∣∣∣ ≤ (Ti|f(Xi)|c2
0
+(1− Ti)|f(Xi)|
c20
)supβ∈N
∣∣∣∂πβ(Xi)
∂βj
∣∣∣ ≤ Cj(X)|f(Xi)|/c20,
for N ∈ Br(βo). Hence, E(supβ∈N |∂gβo(Ti,Xi)/∂βj |) < ∞, by Assumption A.1 (7). Thus,
condition (iv) in Theorem 3.4 in Newey and McFadden (1994) holds. In order to apply this lemma
to the proofs in Section 3, we need to further verify this condition for gβ(·) = (g>1β(·), g>2β(·))>,
where
g1β(Ti,Xi) =( Tiπβ(Xi)
− 1− Ti1− πβ(Xi)
)h1(Xi), and g2β(Ti,Xi) =
( Tiπβ(Xi)
− 1)h2(Xi).
To this end, by Assumption 3.1 some simple calculations show that when
supβ∈N
∣∣∣∂g1β(Ti,Xi)
∂βj
∣∣∣ ≤ (Ti|h1(Xi)|c2
0
+(1− Ti)|h1(Xi)|
c20
)supβ∈N
∣∣∣∂πβ(Xi)
∂βj
∣∣∣ ≤ Cj(X)|h1(Xi)|/c20,
for N ∈ Br(βo). Hence, E(supβ∈N |∂g1βo(Ti,Xi)/∂βj |) <∞, by Assumption 3.1 (7). Following the
similar arguments, we can prove that E(supβ∈N |∂g2βo(Ti,Xi)/∂βj |) < ∞ holds. This completes
the proof of (A.2). As shown in Lemma A.2, if P(Ti = 1 | Xi) = πβ∗(Xi) holds, then βo = β∗.
Thus, the asymptotic normality of n1/2(β − β∗) follows from (A.2). The proof is complete.
26
B Proof of Results in Section 2
B.1 Proof of Theorem 2.1
Proof. First, we derive the bias of β. By the arguments in the proof of Lemma A.3, we can show
that β = βo + Op(n−1/2), where βo satisfies βo = argminβ E(gβ(T ,X))>W∗E(gβ(T ,X)). Let
u∗i = u(Xi;β∗). By the propensity score model and the fact that |u(Xi;β
∗)| is a bounded random
variable and E|fj(Xi)| <∞, we can show that
E(gβo) = Eπ∗i (1 + ξu∗i )f(Xi)
πoi− (1− π∗i − ξπ∗i u∗i )f(Xi)
1− πoi
+O(ξ2).
In addition, following the similar calculation, we have E(gβ∗) = O(ξ). Therefore,
limn→∞
E(gβ∗(T ,X))>W∗E(gβ∗(T ,X)) = 0.
Clearly, this quadratic form E(gβ(T ,X))>W∗E(gβ(T ,X)) must be nonnegative for any β. By the
uniqueness of βo, we have βo − β∗ = o(1). Therefore, we can expand πoi around π∗i , which yields
E(gβo) = Eξ( u∗i
1− π∗i
)f(Xi) +H∗f (βo − β∗)
+O(ξ2 + ‖βo − β∗‖22).
This implies that the bias of βo is
βo − β∗ = −ξ(H∗>f W∗H∗f )−1H∗>f W∗E( u∗i
1− π∗i
)f(Xi)
+O(ξ2). (B.1)
Our next step is to derive the bias of µβ
. Similar to the proof of Theorem 3.2, we have
µβ− µ =
1
n
n∑i=1
Di + H∗>y (β − βo) + op(n−1/2),
where
Di =TiYi(1)
πoi− (1− Ti)Yi(0)
1− πoi− µ,
and
n1/2(β − βo) = −(H∗>f W∗H∗f )−1n1/2H∗>f W∗gβo(T ,X) + op(1).
By the dominated convergence theorem, the bias of µβ
can be represented by
E(µβ
)− µ = E(Di) + o(n−1/2).
By ignorability of the treatment assignment and the definition of the true propensity score,
E(µβ
)− µ = Eπ∗i (1 + ξu∗i )Yi(1)
πoi− (1− π∗i − ξπ∗i u∗i )Yi(0)
1− πoi
− µ+ o(ξ).
Similarly, we expand πoi around π∗i ,
E(µβ
)− µ = EYi(1)− Yi(0) + ξ(u∗iYi(1) +
u∗iπ∗i
1− π∗iYi(0)) +H∗y (βo − β∗)
− µ+ o(ξ), (B.2)
27
where
H∗y = −E
((1− π∗i )Yi(1) + π∗i Yi(0))f(Xi).
Plugging (B.1) into above equation, the bias of µβ
is
E(µβ
)− µ = ξE(u∗iYi(1) +
u∗iπ∗i
1− π∗iYi(0))
−H∗y (H∗>f W∗H∗f )−1H∗>f W∗E( u∗i
1− π∗i
)f(Xi)
+ o(ξ)
= ξE(u∗i (K(Xi) + L(Xi)(1− π∗i ))
1− π∗i
)+H∗y (H∗>f W∗H∗f )−1H∗>f W∗E
(u∗i f(Xi)
1− π∗i
)+ o(ξ).
This completes the proof.
B.2 Proof of Corollary 2.1
Proof. When H∗f is invertible, it is easy to show the bias term can be written as
B =
[Eu(Xi;β
∗)(K(Xi) + (1− πβ∗(Xi))L(Xi))
1− πβ∗(Xi)
+H∗yH
∗−1f E
(u(Xi;β
∗)f(Xi)
1− πβ∗(Xi)
)],
when the propensity score model is locally misspecified. If we choose the balancing function f(X)
such that α>f(X) = K(Xi) + (1− π∗i )L(Xi) for some α ∈ Rq, we have
H∗y = −E(K(Xi) + (1− π∗i )L(Xi)
π∗i (1− π∗i )· ∂π
∗i
∂β
)= −α>E
( f(Xi)
π∗i (1− π∗i )
(∂π∗i∂β
)>),
H∗f = −E(∂gβ∗(Ti,Xi)
∂β
)= −E
(f(Xi)
π∗i (1− π∗i )
(∂π∗i∂β
)>).
So the bias becomes
B =
[α>E
u(Xi;β
∗)f(Xi)
1− πβ∗(Xi)
+α>H∗f (H∗f )−1E
(u(Xi;β
∗)f(Xi)
1− πβ∗(Xi)
)]= 0.
This proves that µβ
is first order unbiased.
B.3 Proof of Theorem 2.2
Proof. The proof is similar to that of Theorem 3.2. We omit the details.
B.4 Proof of Corollary 2.2
Proof. The asymptotic variance bound Vopt can be written as, Vopt = Σµ −α>Ωα, where
Ω = E(gβo(Ti,Xi)gβo(Ti,Xi)>) = E
(f(Xi)f(Xi)
>
π∗i (1− π∗i )
).
28
We can write the asymptotic variance of our estimator as
V = Σµ + 2H∗>y Σµβ +H∗>y ΣβH∗y ,
where
H∗y = E(∂µβ∗(Ti, Yi,Xi)
∂β
)= −E
(K(Xi) + (1− π∗i )L(Xi)
π∗i (1− π∗i )∂π∗i∂β
),
Σµβ = −(H∗f )−1 Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)),
H∗f = E(∂gβ∗(Ti,Xi)
∂β
)= −E
(f(Xi)
π∗i (1− π∗i )
(∂π∗i∂β
)>),
Cov(µβ∗(Ti, Yi,Xi), gβ∗(Ti,Xi)) = E(K(X) + (1− π∗i )L(Xi)
π∗i (1− π∗i )f(Xi)
),
Σβ = (H∗f )−1 Var(gβ∗(Ti,Xi))(H∗>f )−1,
Var(gβ∗(Ti,Xi)) = E(
f(Xi)f(Xi)>
π∗i (1− π∗i )
).
If K(Xi)+(1−π∗i )L(Xi) lies in the linear space spanned by f(Xi), that is, K(Xi)+(1−π∗i )L(Xi) =
α>f(Xi), we have
H∗y = −E(α>f(Xi)
π∗i (1− π∗i )∂π∗i∂β
)= (α>H∗f )>.
So
H∗>y Σµβ = −α>H∗f (H∗f )−1E(α>f(Xi)f(Xi)
π∗i (1− π∗i )
)= −α>E
(f(Xi)f(Xi)
>
π∗i (1− π∗i )
)α,
and
H∗>y ΣβH∗y = α>H∗f (H∗f )−1E
(f(Xi)f(Xi)
>
π∗i (1− π∗i )
)(H∗>f )−1(α>H∗f )> = α>E
(f(Xi)f(Xi)
>
π∗i (1− π∗i )
)α.
It is seen that H∗>y Σµβ = −H∗>y ΣβH∗y . Then we have
V = Σµ −α>Ωα,
which corresponds to the minimum asymptotic variance Vopt.
B.5 Asymptotic Distribution under Local Misspecification
Theorem B.1. Under the locally misspecified propensity score model in (2.1) with ξ = n−1/2 and
Assumption A.1 in Appendix A, the estimator µβ
in (1.5), where β is obtained by (1.7), has the
following asymptotic distribution
√n(µ
β− µ)
d−→ N(B, H∗>ΣH∗), (B.3)
where B is the first order bias given in equation (2.2) of Theorem 2.1.
29
Proof. Recall that in the proof of Theorem 2.1 we obtain that
µβ− µ =
1
n
n∑i=1
Di + H∗>y (β − βo) + op(n−1/2),
where
Di =TiYi(1)
πoi− (1− Ti)Yi(0)
1− πoi− µ,
and
n1/2(β − βo) = −(H∗>f W∗H∗f )−1n1/2H∗>f W∗gβo(T ,X) + op(1).
In addition, we have shown that E(Di) = Bn−1/2 + o(n−1/2). Thus,
µβ− µ =
1
n
n∑i=1
Di − E(Di)+ H∗>y (β − βo) +Bn−1/2 + op(n−1/2).
Then the asymptotic normality of√n(µ
β−µ) follows from the above asymptotic expansion and the
central limit theorem. The detailed calculation of the asymptotic variance is similar to Theorem 2.2,
which is omitted for simplicity.
C Proof of Results in Section 3
C.1 Proof of Theorem 3.1
Proof of Theorem 3.1. We first consider the case (1). That is the propensity score model is correctly
specified. By Lemma A.2, we have βp−→ β∗. Let
rβ(T, Y,X) =TY
πβ(X)− (1− T )Y
1− πβ(X).
It is seen that |rβ(T, Y,X)| ≤ 2|Y |/c0 and by Assumption 3.1 (6), E|Y | < ∞. Then Lemma A.1
yields supβ∈Θ |n−1∑n
i=1 rβ(Ti, Yi,Xi) − E(rβ(Ti, Yi,Xi))| = op(1). In addition, by βp−→ β∗ and
the dominated convergence theorem, we obtain that
µβ
= E(TiYiπ∗i− (1− Ti)Yi
1− π∗i
)+ op(1),
where π∗i = πβ∗(Xi). Since Yi = Yi(1)Ti+Yi(0)(1−Ti) and Yi(1), Yi(0) are independent of Ti given
Xi, we can further simplify the above expression,
µβ
= E(TiYiπ∗i− (1− Ti)Yi
1− π∗i
)+ op(1) = E
(TiYi(1)
π∗i− (1− Ti)Yi(0)
1− π∗i
)+ op(1)
= E(E(Ti |Xi)E(Yi(1) |Xi)
π∗i− (1− E(Ti |Xi))E(Yi(1) |Xi)
1− π∗i
)+ op(1).
In addition, if the propensity score model is correctly specified, it further implies
µβ
= E(E(Yi(1) |Xi)− E(Yi(0) |Xi)) + op(1) = E(Yi(1)− Yi(0)) + op(1) = µ+ op(1).
30
This completes the proof of consistence of µ when the propensity score model is correctly specified.
In the following, we consider the case (2). That isK(·) ∈ spanM1h1(·) and L(·) ∈ spanM2h2(·).By Lemma A.2, we have β
p−→ βo. The first order condition for βo yields ∂Q(βo)/∂β = 0, where
Q(β) = E(g>β )W∗E(gβ). By Assumption 3.1 (7) and the dominated convergence theorem, we can
interchange the differential with integral, and thus G∗>W∗E(gβo) = 0. Under the assumption that
P(Ti = 1 |Xi) = π(Xi) 6= πoi , we have
E(g1βo) = E(π(Xi)
πoi− 1− π(Xi)
1− πoi
)h1(Xi)
,
E(g2βo) = E(π(Xi)
πoi− 1)h2(Xi)
.
Rewrite G∗>W∗ = (M1,M2), where M1 ∈ Rq×m1 and M1 ∈ Rq×m2 . Then, βo satisfies
E(π(Xi)
πoi− 1− π(Xi)
1− πoi
)M1h1(Xi) +
(π(Xi)
πoi− 1)M2h2(Xi)
= 0. (C.1)
Following the similar arguments to that in case (1), we can prove that
µβ
= E(TiYiπoi− (1− Ti)Yi
1− πoi
)+ op(1)
= E(E(Ti |Xi)E(Yi(1) |Xi)
πoi− (1− E(Ti |Xi))E(Yi(1) |Xi)
1− πoi
)+ op(1).
By E(Ti |Xi) = π(Xi) and the outcome model, it further implies
µβ− µ = E
π(Xi)(K(Xi) + L(Xi))
πoi− (1− π(Xi))K(Xi)
1− πoi
− µ+ op(1)
= E(π(Xi)
πoi− 1− π(Xi)
1− πoi
)K(Xi)
+ E
π(Xi)L(Xi)
πoi
− µ+ op(1)
= E(π(Xi)
πoi− 1− π(Xi)
1− πoi
)K(Xi)
+ E
(π(Xi)
πoi− 1)L(Xi)
+ op(1),
where in the last step we use µ = E(L(Xi)). By equation (C.1), we obtain µ = µ+ op(1), provided
K(Xi) = α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), where α1 and α2 are q-dimensional vectors
of constants. This completes the whole proof.
C.2 Proof of Theorem 3.2
Proof of Theorem 3.2. We first consider the case (1). That is the propensity score model is correctly
specified. By the mean value theorem, we have µ = µ+ H(β)>(β − β∗), where
µ =1
n
n∑i=1
(TiYiπ∗i− (1− Ti)Yi
1− π∗i
), H(β) = − 1
n
n∑i=1
(TiYiπ2i
+(1− Ti)Yi(1− πi)2
)∂πi∂β
,
31
where π∗i = πβ∗(Xi), πi = πβ
(Xi) and β is an intermediate value between β and β∗. By Assump-
tion 3.2 (2), we can show that the summand in H(β) has a bounded envelop function. By Lemma
A.1, we have supβ∈Br(β∗) |H(β) − E(H(β))| = op(1). Since β is consistent, by the dominated
convergence theorem we can obtain H(β) = H∗ + op(1), where
H∗ = −E(TiYi
π∗2i+
(1− Ti)Yi(1− π∗i )2
)∂π∗i∂β
= −E
(Yi(1)
π∗i+
Yi(0)
1− π∗i
)∂π∗i∂β
= −E
K(Xi) + L(Xi)(1− π∗i )π∗i (1− π∗i )
∂π∗i∂β
.
Finally, we invoke the central limit theorem and equation (A.1) to obtain that
n1/2(µ− µ)d−→ N(0, H∗>ΣH∗),
where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1 and
Σ =
(Σµ Σ>µβ
Σµβ Σβ
).
Denote bi(Ti,Xi, Yi(1), Yi(0)) = TiYi(1)/π∗i − (1− Ti)Yi(0)/(1− π∗i )− µ. Here, some simple calcu-
lations yield,
Σµ = E[b2i (Ti,Xi, Yi(1), Yi(0))] = E(Y 2
i (1)
π∗i+Y 2i (0)
1− π∗i
)− µ2.
In addition, the off diagonal matrix can be written as Σµβ = (Σ>1µβ,Σ>2µβ)>, where
Σµβ = −(G∗>W∗G∗)−1G∗>W∗T,
where T = (E[g>1β∗(Ti,Xi)bi(Ti,Xi, Yi(1), Yi(0))],E[g>2β∗(Ti,Xi)bi(Ti,Xi, Yi(1), Yi(0))])> with
g1β(Ti,Xi) =( Tiπβ(Xi)
− 1− Ti1− πβ(Xi)
)h1(Xi), and g2β(Ti,Xi) =
( Tiπβ(Xi)
− 1)h2(Xi).
After some algebra, we can show that
T =
E(K(Xi) + (1− π∗i )L(Xi)
(1− π∗i )π∗ih>1 (Xi)
),E(K(Xi) + (1− π∗i )L(Xi)
π∗ih>2 (Xi)
)>.
This completes the proof of equation (3.3). Next, we consider the case (2). Recall that P(Ti = 1 |Xi) = π(Xi) 6= πβo(Xi). Following the similar arguments, we can show that
µβ− µ =
1
n
n∑i=1
Di + H∗>(β − βo) + op(n−1/2),
where
Di =TiYi(1)
πoi− (1− Ti)Yi(0)
1− πoi− µ,
32
and
H∗ = −E(π(Xi)(K(Xi) + L(Xi))
πo2i+
(1− π(Xi))K(Xi)
(1− πoi )2
)∂πoi∂β
.
By equation (A.1) in Lemma A.3, we have that
n1/2(µβ− µ)
d−→ N(0, H∗>ΣH∗),
where H∗ = (1,H∗>)>, Σβ = (G∗>W∗G∗)−1G∗>W∗ΩW∗G∗(G∗>W∗G∗)−1 and
Σ =
(Σµ Σ>µβ
Σµβ Σβ
).
Denote ci(Ti,Xi, Yi(1), Yi(0)) = TiYi(1)/πoi − (1− Ti)Yi(0)/(1− πoi )− µ. As shown in the proof of
Theorem 3.1, E[bi(Ti,Xi, Yi(1), Yi(0))] = 0. Thus,
Σµ = E[c2i (Ti,Xi, Yi(1), Yi(0))] = E
(TiY 2i (1)
πo2i+
(1− Ti)Y 2i (0)
(1− πoi )2
)− µ2
= E(π(Xi)Y
2i (1)
πo2i+
(1− π(Xi))Y2i (0)
(1− πoi )2
)− µ2.
Similarly, the off diagonal matrix can be written as Σµβ = (Σ>1µβ,Σ>2µβ)>, where
Σµβ = −(G∗>W∗G∗)−1G∗>W∗S,
where S = (E[g>1βo(Ti,Xi)ci(Ti,Xi, Yi(1), Yi(0))],E[g>2βo(Ti,Xi)ci(Ti,Xi, Yi(1), Yi(0))])> with
g1β(Ti,Xi) =( Tiπβ(Xi)
− 1− Ti1− πβ(Xi)
)h1(Xi), and g2β(Ti,Xi) =
( Tiπβ(Xi)
− 1)h2(Xi). (C.2)
After some tedious algebra, we can show that S = (S>1 ,S>2 )>, where
S1 = E(π(Xi)(K(Xi) + L(Xi)− πoi µ)
πo2i+
(1− π(Xi))(K(Xi) + (1− πoi )µ)
(1− πoi )2
)h1(Xi)
,
S2 = E(π(Xi)[(K(Xi) + L(Xi))(1− πoi )− πoi µ]
πo2i+
(1− π(Xi))K(Xi) + (1− πoi )µ1− πoi
)h2(Xi)
.
This completes the proof of equation (3.5).
Finally, we start to prove part 3. By (3.3), the asymptotic variance of µ denoted by V , can be
written as
V = Σµ + 2H∗>Σµβ + H∗>ΣβH∗. (C.3)
Note that by Lemma A.3, we have Σβ = (G∗>Ω−1G∗)−1. Under this correctly specified propensity
score model, some algebra yields
Ω = E[gβ∗(Ti,Xi)g>β∗(Ti,Xi)] =
E(h1h>1
π∗i (1−π∗i )) E(h1h>2π∗i
)
E(h2h>1π∗i
) E(h2h>2 (1−π∗i )
π∗i)
,
33
where gβ(Ti,Xi) = (g>1β(Ti,Xi), g>2β(Ti,Xi))
> and g1β(Ti,Xi) and g2β(Ti,Xi) are defined in (C.2).
In addition, G∗ = (G∗>1 ,G∗>2 )>, where
G∗1 = −E( h1(Xi)
π∗i (1− π∗i )
(∂π∗i∂β
)>), G∗2 = −E
(h2(Xi)
π∗i
(∂π∗i∂β
)>). (C.4)
Since the functions K(·) and L(·) lie in the linear space spanned by the functions M1h1(·) and
M2h2(·) respectively, where M1 ∈ Rq×m1 and M1 ∈ Rq×m2 are the partitions of G∗>W∗ =
(M1,M2). We have K(Xi) = α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), where α1 and α2 are
q-dimensional vectors of constants. Thus
H∗ = −EK(Xi) + L(Xi)(1− π∗i )
π∗i (1− π∗i )∂π∗i∂β
= −E
α>1 M1h1(Xi) +α>2 M2h2(Xi)(1− π∗i )π∗i (1− π∗i )
∂π∗i∂β
.
Comparing to the expression of G∗ in (C.4), we can rewrite H∗ as
H∗ = G∗>
(M>
1 α1
M>2 α2
).
Following the similar derivations, it is seen that
Σµβ = −(G∗>Ω−1G∗)−1G∗>Ω−1
Eα>1 M1h1(Xi)+α
>2 M2h2(Xi)(1−π∗i )
π∗i (1−π∗i ) h1(Xi)
Eα>1 M1h1(Xi)+α
>2 M2h2(Xi)(1−π∗i )π∗i
h2(Xi)
,
which is equivalent to
Σµβ = −(G∗>Ω−1G∗)−1G∗>
(M>
1 α1
M>2 α2
).
Hence,
H∗>Σµβ = −(α>1 M1,α>2 M2)G∗(G∗>Ω−1G∗)−1G∗>
(M>
1 α1
M>2 α2
)= −H∗>ΣβH∗.
Together with (C.3), we have
V = Σµ − (α>1 M1,α>2 M2)G∗(G∗>Ω−1G∗)−1G∗>
(M>
1 α1
M>2 α2
).
This completes of the proof.
34
C.3 Proof of Corollary 3.1
Proof of Corollary 3.1. By Theorem 3.2, it suffices to show that
(α>1 M1, α>2 M2)G∗CG∗>
(M>
1 α1
M>2 α2
)≤ (α>1 M1,α
>2 M2)G∗CG∗>
(M>
1 α1
M>2 α2
), (C.5)
where C = (G∗>Ω−1G∗)−1 and α1 and M1 among others are the corresponding quantities with
h1(X) and h2(X). Assume that h1(X) ∈ Rm1+a1 and h2(X) ∈ Rm2+a2 . Since K(Xi) =
α>1 M1h1(Xi) and L(Xi) = α>2 M2h2(Xi), we find that (α>1 M1, α>2 M2) = (α>1 M1, 0,α
>2 M2, 0),
which is a vector in Rm+a with a = a1 + a2. Because some components of (α>1 M1, α>2 M2) are 0,
by the matrix algebra, (C.5) holds if C− C is positive semidefinite. Without loss of generality, we
rearrange orders and write the (m+ a)× q matrix G∗ and the (m+ a)× (m+ a) matrix Ω∗ as
G∗ =( G∗
A
), and Ω =
( Ω Ω1
Ω1 Ω2
).
For simplicity, we use the following notation: two matrices satisfy O1 ≥ O2 if O1 −O2 is positive
semidefinite. To show C ≥ C, we have the following derivation
G∗>Ω−1G∗ = (G∗>,A>)( Ω Ω1
Ω1 Ω2
)−1( G∗
A
)≥ (G∗>,A>)
( Ω−1 0
0 0
)( G∗
A
)= G∗>Ω−1G∗.
This completes the proof of (C.5), and therefore the corollary holds.
C.4 Proof of Corollary 3.2
Proof of Corollary 3.2. The proof of the double robustness property mainly follows from Theorem
3.1. In this case, we only need to verify that spanh1(·) = spanM1h1(·) and spanh2(·) =
spanM2h2(·), where M1 ∈ Rq×m1 and M1 ∈ Rq×m2 are the partitions of G∗>W∗ = (M1,M2).
Apparently, we have spanM1h1(·) ⊆ spanh1(·), since the former can always be written as a
linear combination of h1(·). To show spanh1(·) ⊆ spanM1h1(·), note that them1×m1 principal
submatrix M11 of M1 is invertible. Thus, spanh1(·) = spanM11h1(·) ⊆ spanM1h1(·). This
is because the m1 dimensional functions M11h1(·) are identical to the first m1 coordinates of
M1h1(·). This completes the proof of double robustness property. The efficiency property follows
from Theorem 3.2. We do not replicate the details.
D Proof of Results in Section 4
For notational simplicity, we denote π∗(x) = J(m∗(x)), J∗(x) = J(β∗>B(x)), and J(x) =
J(β>B(x)). Define Qn(β) = ‖gβ(T ,X)‖22 and Q(β) = ‖Egβ(Ti,Xi)‖22. In the following proof, we
use C,C ′ and C ′′ to denote generic positive constants, whose values may change from line to line.
35
Lemma D.1 (Bernstein’s inequality for U -statistics (Arcones, 1995)). Given i.i.d. random vari-
ables Z1, . . . Zn taking values in a measurable space (S,B) and a symmetric and measurable kernel
function h : Sm → R, we define the U -statistics with kernel h as U :=(nm
)−1∑i1<...<im
h(Zi1 , . . . , Zim).
Suppose that Eh(Zi1 , . . . , Zim) = 0, EE[h(Zi1 , . . . , Zim) | Zi1 ]
2= σ2 and ‖h‖∞ ≤ b. There exists
a constant K(m) > 0 depending on m such that
P(|U | > t) ≤ 4 exp− nt2/[2m2σ2 +K(m)bt]
, ∀t > 0.
Lemma D.2. Under the conditions in Theorem 4.1, it holds that
supβ∈Θ
∣∣∣Qn(β)−Q(β)∣∣∣ = Op
(√K2 logK
n
).
Proof of Lemma D.2. Let ξ(β) = (ξ1(β), ..., ξn(β))> and φ(β) = (φ1(β), ..., φn(β))>, where
ξi(β) =Ti
J(β>B(Xi))− 1− Ti
1− J(β>B(Xi)), φi(β) =
TiJ(β>B(Xi))
− 1.
Then we have
Qn(β) = n−2n∑i=1
n∑j=1
[ξi(β)ξj(β)h1(Xi)
>h1(Xj) + φi(β)φj(β)h2(Xi)>h2(Xj)
]= n−2
∑1≤i 6=j≤n
[ξi(β)ξj(β)h1(Xi)
>h1(Xj) + φi(β)φj(β)h2(Xi)>h2(Xj)
]+An(β),
where An(β) = n−2∑n
i=1
[ξi(β)2‖h1(Xi)‖22 + φi(β)2‖h2(Xi)‖22
]. Since there exists a constant
c0 > 0 such that c0 ≤ |J(β>B(x))| ≤ 1 − c0 for any β ∈ Θ and Ti ∈ 0, 1, it implies
supβ∈Θ max1≤i≤n |ξi(β)| ≤ C and supβ∈Θ max1≤i≤n |φi(β)| ≤ C for some constant C > 0. Then
we can show that
E(
supβ∈Θ|An(β)|
)≤ C
nE(‖h(Xi)‖22) = O(K/n).
By the Markov inequality, we have supβ∈Θ |An(β)| = Op(K/n) = op(1). Following the similar
arguments, it can be easily shown that supβ∈Θ |Q(β)|/n = O(K/n). Thus, it holds that
supβ∈Θ|Qn(β)−Q(β)| = sup
β∈Θ
∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
uij(β)∣∣∣+Op(K/n), (D.1)
where uij(β) = u1ij(β) + u2ij(β) is a kernel function of a U-statistic with
u1ij(β) = ξi(β)ξj(β)h1(Xi)>h1(Xj)− E[ξi(β)ξj(β)h1(Xi)
>h1(Xj)],
u2ij(β) = φi(β)φj(β)h2(Xi)>h2(Xj)− E[φi(β)φj(β)h2(Xi)
>h2(Xj)].
Since Θ is a compact set in RK , by the covering number theory, there exists a constant C such that
M = (C/r)K balls with the radius r can cover Θ. Namely, Θ ⊆ ∪1≤m≤MΘm, where Θm = β ∈
36
RK : ‖β − βm‖2 ≤ r for some β1, ...,βM . Thus, for any given ε > 0,
P(
supβ∈Θ
∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u1ij(β)∣∣∣ > ε
)≤
M∑m=1
P(
supβ∈Θm
∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u1ij(β)∣∣∣ > ε
)
≤M∑m=1
[P(∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u1ij(βm)∣∣∣ > ε/2
)+ P
(supβ∈Θm
2
n(n− 1)
∑1≤i<j≤n
∣∣∣u1ij(β)− u1ij(βm)∣∣∣ > ε/2
)]. (D.2)
By the Cauchy-Schwarz inequality, |h1(Xi)>h1(Xj)| ≤ ‖h1(Xi)‖2‖h1(Xj)‖2 ≤ CK, and thus
|u1ij(βm)| ≤ CK. In addition, for any β,
Eξi(β)h1(Xi)
>E[ξj(β)h1(Xj)]− E[ξi(β)ξj(β)h1(Xi)>h1(Xj)]
2
≤ Eξi(β)h1(Xi)
>E[ξj(β)h1(Xj)]2 ≤ ‖Eξ2
i (β)h1(Xi)h1(Xi)>‖2 · ‖Eξj(β)h1(Xj)‖22 ≤ CK,
for some constant C > 0. Here, in the last step we use that fact that
‖Eξj(β)h1(Xj)‖22 ≤ E‖ξj(β)h1(Xj)‖22 ≤ C · E‖h1(Xj)‖22 ≤ CK,
and ‖Eξ2i (β)h1(Xi)h1(Xi)
>‖2 is bounded because ‖Eh1(Xj)h1(Xj)>‖2 is bounded by assumption.
Thus, we can apply the Bernstein’s inequality in Lemma D.1 to the U-statistic with kernel function
u1ij(βm),
P(∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u1ij(βm)∣∣∣ > ε/2
)≤ 2 exp
(− Cnε2/[K +Kε]
), (D.3)
for some constant C > 0. Since |∂J(v)/∂v| is upper bounded by a constant for any v = β>B(x), it
is easily seen that for any β ∈ Θm, |ξi(β)−ξi(βm)| ≤ C|(β−βm)>B(Xi)| ≤ CrK1/2, where the last
step follows from the Cauchy-Schwarz inequalty. This further implies |ξi(β)ξj(β)−ξi(βm)ξj(βm)| ≤CrK1/2 for some constant C > 0 by performing a standard perturbation analysis. Thus,
|u1ij(β)− u1ij(βm)| ≤ CrK1/2|h1(Xi)>h1(Xj)| ≤ CrK3/2,
and note that with r = K−2, then CrK1/2E|h1(Xi)>h1(Xj)| ≤ ε/4 for n large enough. Thus
P(
supβ∈Θm
2
n(n− 1)
∑1≤i<j≤n
∣∣∣u1ij(β)− u1ij(βm)∣∣∣ > ε/2
)≤ P
(2CrK1/2
n(n− 1)
∑1≤i<j≤n
|h1(Xi)>h1(Xj)| > ε/2
)≤ P
(2CrK1/2
n(n− 1)
∑1≤i<j≤n
[|h1(Xi)
>h1(Xj)| − E|h1(Xi)>h1(Xj)|
]> ε/4
)≤ 2 exp(−CnKε2), (D.4)
37
where the last step follows from the Hoeffding inequality for U-statistic. Thus, combining (D.2),
(D.3) and (D.4), we have for some constants C1, C2, C3 > 0, as n goes to infinity,
P(
supβ∈Θ
∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u1ij(β)∣∣∣ > ε
)≤ exp(C1K logK − C2nε
2/[K +Kε]) + exp(C1K logK − C3nε2K)→ 0,
where we take ε = C√K2 logK/n for some constant C sufficiently large. This implies
supβ∈Θ
∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u1ij(β)∣∣∣ = Op
(√K2 logK
n
).
Following the same arguments, we can show that with the same choice of ε,
supβ∈Θ
∣∣∣ 2
n(n− 1)
∑1≤i<j≤n
u2ij(β)∣∣∣ = Op
(√K2 logK
n
).
Plugging these results into (D.1), we complete the proof.
Lemma D.3 (Bernstein’s inequality for random matrices (Tropp, 2015)). Let Zk be a sequence
of independent random matrices with dimensions d1 × d2. Assume that EZk = 0 and ‖Zk‖2 ≤ Rnalmost sure. Define
σ2n = max
∥∥∥ n∑k=1
E(ZkZ>k )∥∥∥
2,∥∥∥ n∑k=1
E(Z>k Zk)∥∥∥
2
.
Then, for all t ≥ 0,
P(∥∥∥ n∑
k=1
Zk
∥∥∥2≥ t)≤ (d1 + d2) exp
(− t2/2
σ2n +Rnt/3
).
Lemma D.4. Let H = (h(X1), ...,h(Xn))> and B = (B(X1), ...,B(Xn))> be two n×K matrices.
Under the conditions in Theorem 4.1, then
‖H>H/n− E[h(Xi)h(Xi)>]‖2 = Op(
√K logK/n) (D.5)
and
‖B>B/n− E[B(Xi)B(Xi)>]‖2 = Op(
√K logK/n). (D.6)
Proof of Lemma D.4. We prove this result by applying Lemma D.3. In particular, to prove (D.5),
we take Zi = n−1[h(Xi)h(Xi)> − E(h(Xi)h(Xi)
>)]. It is easily seen that
‖Zi‖2 ≤ n−1[tr(h(Xi)h(Xi)>) + ‖E(h(Xi)h(Xi)
>)‖2] ≤ (CK + C)/n,
where C is some positive constant. Moreover,∥∥∥ n∑i=1
E(ZiZ>i )∥∥∥
2≤ n−1
(‖Eh(Xi)h(Xi)
>h(Xi)h(Xi)>‖2 + ‖E(h(Xi)h(Xi)
>)‖22)
≤ n−1(CK · ‖E(h(Xi)h(Xi)>)‖2 + C2) ≤ n−1(C2K + C2).
38
Note that√K logK/n = o(1). Now, if we take t = C
√K logK/n in Lemma D.3 for some constant
C sufficiently large, then we have P(‖∑n
k=1 Zk‖2 ≥ t) ≤ 2K exp(−C ′ logK) for some C ′ > 1. Then,
the right hand side converges to 0, as K → ∞. This completes the proof of (D.5). The proof of
(D.6) follows from the same arguments and is omitted for simplicity.
Lemma D.5. Under the conditions in Theorem 4.1, the following results hold.
1 Let U = 1n
∑ni=1Ui, Ui = (U>i1 ,U
>i2)>, with
Ui1 =( Tiπ∗i− 1− Ti
1− π∗i
)h1(Xi), Ui2 =
( Tiπ∗i− 1)h2(Xi).
Then ‖U‖2 = Op(K1/2/n1/2).
2 Let B(r) = β ∈ RK : ‖β − β∗‖2 ≤ r, and r = O(K1/2/n1/2 +K−rb). Then
supβ∈B(r)
∥∥∥∂gβ(T ,X)
∂β−G∗
∥∥∥2
= Op
(K1/2r +
√K logK
n
).
3 Let Ji = J(β>B(Xi)), Ji = ∂J(v)/∂v|v=β>B(Xi), and
T∗ = E[E(Yi(1) |Xi)
π∗i− E(Yi(0) |Xi)
1− π∗i
]J∗i B(Xi)
.
Then
supβ∈B(r)
∥∥∥ 1
n
n∑i=1
[TiYi(1)
J2i
+(1− Ti)Yi(0)
(1− Ji)2
]JiB(Xi) + G∗>α∗
∥∥∥2
= Op
(K1/2r +K−rh
).
Proof of Lemma D.5. We start from the proof of the first result. Note that E(Ui) = 0. Then
E‖U‖22 = E(U>i Ui)/n and then there exists some constant C > 0,
E‖U‖22 = E[n−1
K∑k=1
( Tiπ∗i− 1− Ti
1− π∗i
)2hk(Xi)
2I(k ≤ m1) +( Tiπ∗i− 1)2hk(Xi)
2I(k > m1)]
≤ CK∑k=1
Ehk(Xi)2/n = O(K/n).
By the Markov inequality, this implies ‖U‖2 = Op(K1/2/n1/2), which completes the proof of the
first result. In the following, we prove the second result. Denote
ξi(m(Xi)) = −( TiJ2(m(Xi))
+1− Ti
(1− J(m(Xi)))2
)J(m(Xi))
φi(m(Xi)) = − TiJ2(m(Xi))
J(m(Xi)),
39
and ∆i(m(Xi)) = diag(ξi(m(Xi))1m1 , φi(m(Xi))1m2) is a K ×K diagonal matrix, where 1m1 is a
vector of 1 with length m1. Then, note that
∂gβ(T ,X)
∂β−G∗ =
1
n
n∑i=1
B(Xi)h(Xi)>∆i(β
>B(Xi))− E[B(Xi)h(Xi)>∆i(m
∗(Xi))],
which can be decomposed into the two terms Iβ + II, where
Iβ =1
n
n∑i=1
B(Xi)h(Xi)>[∆i(β
>B(Xi))−∆i(m∗(Xi))], II =
n∑i=1
Zi,
Zi = n−1B(Xi)h(Xi)
>∆i(m∗(Xi))− E[B(Xi)h(Xi)
>∆i(m∗(Xi))]
.
We first consider the term II. It can be easily verified that ‖∆i(m∗(Xi))‖2 ≤ C for some constant
C > 0. In addition, ‖B(Xi)h(Xi)>‖2 ≤ ‖B(Xi)‖2 · ‖h(Xi)‖2 ≤ CK. Thus, ‖Zi‖2 ≤ CK/n.
Following the similar argument in the proof of Lemma D.4,∥∥∥ n∑i=1
E(ZiZ>i )∥∥∥
2≤ n−1‖EB(Xi)h(Xi)
>∆i(m∗(Xi))∆i(m
∗(Xi))h(Xi)B(Xi)>‖2
+ n−1‖EB(Xi)h(Xi)>∆i(m
∗(Xi))‖22.
We now consider the last two terms separately. Note that
‖EB(Xi)h(Xi)>∆i(m
∗(Xi))‖22 = sup‖u‖2=1,‖v‖2=1
|Eu>B(Xi)h(Xi)>∆i(m
∗(Xi))v|2
≤ sup‖u‖2=1
|Eu>B(Xi)B(Xi)>u| · sup
‖v‖2=1|Ev>∆i(m
∗(Xi))h(Xi)h(Xi)>∆i(m
∗(Xi))v|
≤ ‖E(B(Xi)B(Xi)>)‖2 · C‖E(h(Xi)h(Xi)
>)‖2 ≤ C ′, (D.7)
where C,C ′ are some positive constants. Following the similar arguments to (D.7),
‖EB(Xi)h(Xi)>∆i(m
∗(Xi))∆i(m∗(Xi))h(Xi)B(Xi)
>‖2≤ CK · sup
‖u‖2=1|Eu>B(Xi)B(Xi)
>u| ≤ CK · ‖EB(Xi)B(Xi)>‖2 ≤ C ′K,
for some constants C,C ′ > 0. This implies ‖∑n
i=1 E(ZiZ>i )‖2 ≤ CK/n. Thus, Lemma D.3 implies
‖II‖2 = Op(√K logK/n). Next, we consider the term Iβ. Following the similar arguments to
(D.7), we can show that
supβ∈B(r)
‖Iβ‖2 = supβ∈B(r)
sup‖u‖2=1,‖v‖2=1
∣∣∣ 1n
n∑i=1
u>B(Xi)h(Xi)>[∆i(β
>B(Xi))−∆i(m∗(Xi))]v
∣∣∣≤∥∥∥ 1
n
n∑i=1
B(Xi)B(Xi)>∥∥∥1/2
2·∥∥∥ 1
n
n∑i=1
h(Xi)h(Xi)>∥∥∥1/2
2
· supβ∈B(r)
max1≤i≤n
‖∆i(β>B(Xi))−∆i(m
∗(Xi))‖2
≤ C supβ∈B(r)
supx∈X|(β∗ − β)>B(x)|+ C sup
x∈X|m∗(x)− β∗>B(x)|
≤ C ′(K1/2r +K−rb) ≤ C ′′K1/2r,
40
for some C,C ′, C ′′ > 0, where the second inequality follows from Lemma D.4 and the Lipschitz
property of ξi(·) and φi(·), and the third inequality is due to the Cauchy-Schwarz inequality and
approximation assumption of the sieve estimator. This completes the proof of the second result.
For the third result, let
ηi(m(Xi)) =( TiYi(1)
J2(m(Xi))+
(1− Ti)Yi(0)
(1− J(m(Xi)))2
)J(m(Xi)).
Thus, the following decomposition holds,
1
n
n∑i=1
ηi(β>B(Xi))B(Xi) + G∗>α∗ = T1β + T2 + T3,
where
T1β =1
n
n∑i=1
[ηi(β>B(Xi))− ηi(m∗(B(Xi)))]B(Xi)
T2 =1
n
n∑i=1
[ηi(m
∗(B(Xi)))B(Xi)− Eηi(m∗(B(Xi)))B(Xi)]
T3 = Eηi(m∗(B(Xi)))B(Xi) + G∗>α∗.
Similar to the proof for supβ∈B(r) ‖Iβ‖2 previously, we can easily show that supβ∈B(r) ‖T1β‖2 =
Op(K1/2r). Again, the key step is to use the results from Lemma D.4. For the second term
T2, we can use the similar arguments in the proof of the first result to show that E‖T2‖22 ≤CK ·E[ηi(m
∗(B(Xi))2]/n = O(K/n). The Markov inequality implies ‖T2‖2 = Op(K
1/2/n1/2). For
the third term T3, after some algebra, we can show that
‖T3‖2 ≤ C(
supx∈X|K(x)−α∗>1 h1(x)|+ sup
x∈X|L(x)−α∗>2 h2(x)|
)= Op(K
−rh).
Combining the L2 error bound for T1β, T2 and T3, we obtain the last result. This completes the
whole proof.
Lemma D.6. Under the conditions in Theorem 4.1, it holds that
‖β − β∗‖2 = op(1).
Proof of Lemma D.6. Recall that βo is the minimizer of Q(β). We now decompose Q(β)−Q(βo)
as
Q(β)−Q(βo) = [Q(β)−Qn(β)]︸ ︷︷ ︸I
+ [Qn(β)−Qn(βo)]︸ ︷︷ ︸II
+ [Qn(βo)−Q(βo)]︸ ︷︷ ︸III
. (D.8)
In the following, we study the terms I, II and III one by one. For the term I, Lemma D.2 implies
|Q(β) − Qn(β)| ≤ supβ∈Θ
∣∣∣Qn(β) − Q(β)∣∣∣ = op(1). This shows that |I| = op(1) and the same
argument yields |III| = op(1). For the term II, by the definition of β, it is easy to see that II ≤ 0.
Thus, combining with (D.8), we have for any constant η > 0 to be chosen later, Q(β)−Q(βo) < η
41
with probability tending to one. For any ε > 0, define Eε = Θ∩‖β−βo‖2 ≥ ε. By the uniqueness
of βo, for any β ∈ Eε, we have Q(β) > Q(βo). Since Eε is a compact set, we have infβ∈Eε Q(β) >
Q(βo). This implies that for any ε > 0, there exists η′ > 0 such that Q(β) > Q(βo) + η′ for
any β ∈ Eε. If β ∈ Eε, then Q(βo) + η > Q(β) > Q(βo) + η′ with probability tending to one.
Apparently, this does not holds if we take η < η′. Thus, we have proved that β /∈ Eε, that is
‖β − βo‖2 ≤ ε for any ε > 0. Thus, we have ‖β − βo‖2 = op(1).
Next, we shall show that ‖βo − β∗‖2 = op(1). It is easily seen that these together lead to the
desired consistency result
‖β − β∗‖2 ≤ ‖βo − β∗‖2 + ‖β − βo‖2 = op(1).
To show ‖βo − β∗‖2 = op(1), we use the similar strategy. That is we want to show that for any
constant η > 0, Q(β∗) − Q(βo) < η. In the following, we prove that Q(β∗) = O(K1−2rb). Note
that
Q(β∗) ≤ C2K−2rb
K∑j=1
E|hj(X)|2 = O(K1−2rb),
where the first inequality follows from the Cauchy-Schwarz inequality and the last step uses the as-
sumption that supx∈X ‖h(x)‖2 = O(K1/2). In addition, it holds that Q(βo) ≤ Q(β∗) = O(K1−2rb).
As K → ∞, it yields Q(β∗) − Q(βo) < η, for any constant η > 0. The same arguments yield
‖βo − β∗‖2 = op(1). This completes the proof of the consistency result.
Lemma D.7. Under the conditions in Theorem 4.1, there exists a global minimizer β (if Qn(β)
has multiple minimizers), such that
‖β − β∗‖2 = Op(K1/2/n1/2 +K−rb). (D.9)
Proof of Lemma D.7. We first prove that there exists a local minimizer ∆ of Qn(β∗ + ∆), such
that ∆ ∈ C, where C = ∆ ∈ RK : ‖∆‖2 ≤ r, and r = C(K1/2/n1/2 +K−rb) for some constant C
large enough. To this end, it suffices to show that
P
inf∆∈∂C
Qn(β∗ + ∆)−Qn(β∗) > 0→ 1, as n→∞, (D.10)
where ∂C = ∆ ∈ RK : ‖∆‖2 = r. Applying the mean value theorem to each component of
gβ∗+∆(T ,X),
gβ∗+∆(T ,X) = gβ∗(T ,X) + G∆,
where G =∂gβ(T ,X)
∂β and for notational simplicity we assume there exists a common β = vβ∗ +
(1 − v)β for some 0 ≤ v ≤ 1 lies between β∗ and β∗ + ∆ (Rigorously speaking, we need different
β for different component of gβ∗+∆(T ,X)). Thus, for any ∆ ∈ ∂C,
Qn(β∗ + ∆)−Qn(β∗) = 2gβ∗(T ,X)G∆ + ∆>(G>G)∆
≥ −2‖gβ∗(T ,X)‖2 · ‖G‖2 · ‖∆‖2 + ‖∆‖22 · λmin(G>G)
≥ −C(K1/2/n1/2 +K−rb) · r + C · r2, (D.11)
42
for some constant C > 0. In the last step, we first use the results that ‖gβ∗(T ,X)‖2 = Op(K1/2/n1/2+
K−rb), which is derived by combining Lemma D.5 with the arguments similar to (D.14) in the proof
of Lemma D.8. In addition, ‖G‖2 ≤ ‖G−G∗‖2+‖G∗‖2 ≤ C, since ‖G∗‖2 is bounded by a constant
and ‖G−G∗‖2 = op(1) by Lemma D.5. By the Weyl inequality and Lemma D.5,
λmin(G>G) ≥ λmin(G∗>G∗)− ‖G>G−G∗>G∗‖2≥ C − ‖G−G∗‖2 · ‖G‖2 − ‖G−G∗‖2 · ‖G∗‖2 ≥ C/2,
for n sufficiently large. By (D.11), if r = C(K1/2/n1/2 +K−rb) for some constant C large enough,
the right hand side is positive for n large enough. This establishes (D.10). Next, we show that
β = β∗ + ∆ is a global minimizer of Qn(β). This is true because the first order condition implies
(∂gβ
(T ,X)
∂β
)gβ
(T ,X) = 0, =⇒ gβ
(T ,X) = 0,
provided ∂gβ
(T ,X)/∂β is invertible. Following the similar arguments by applying the Weyl in-
equality, ∂gβ
(T ,X)/∂β is invertible with probability tending to one. Since gβ
(T ,X) = 0, it implies
Qn(β) = 0. Noting that Qn(β) ≥ 0 for any β, we obtain that β is indeed a global minimizer of
Qn(β).
Lemma D.8. Under the conditions in Theorem 4.1, β satisfies the following asymptotic expansion
β − β∗ = −G−1U + ∆n, (D.12)
where U = 1n
∑ni=1Ui, Ui = (U>i1 ,U
>i2)>, with
Ui1 =
(Tiπ∗i− 1− Ti
1− π∗i
)h1(Xi), Ui2 =
(Tiπ∗i− 1
)h2(Xi),
and
‖∆n‖2 = Op
(K1/2 ·
(K1/2
n1/2+
1
Krb
)2+
√K logK
n·(K1/2
n1/2+
1
Krb
)).
Proof of Lemma D.8. Similar to the proof of Lemma D.7, we apply the mean value theorem to
each component of gβ
(T ,X),
gβ∗(T ,X) +(∂gβ(T ,X)
∂β
)(β − β∗) = 0,
where for notational simplicity we assume there exists a common β = vβ∗ + (1 − v)β for some
0 ≤ v ≤ 1 lies between β∗ and β. After rearrangement, we derive
β − β∗ = −G∗−1gβ∗(T ,X) +[G∗−1 −
(∂gβ(T ,X)
∂β
)−1]gβ∗(T ,X)
= −G∗−1U + ∆n1 + ∆n2 + ∆n3, (D.13)
43
where
∆n1 = G∗−1[U − gβ∗(T ,X)], ∆n2 =[G∗−1 −
(∂gβ(T ,X)
∂β
)−1]U
and
∆n3 =[G∗−1 −
(∂gβ(T ,X)
∂β
)−1]· [gβ∗(T ,X)− U ].
We first consider ∆n1 in (D.13). Let ξ = (ξ1, ..., ξn)>, where
ξi = Ti
( 1
π∗i− 1
J∗i
)− (1− Ti)
( 1
1− π∗i− 1
1− J∗i
), for 1 ≤ i ≤ m1,
and
ξi = Ti
( 1
π∗i− 1
J∗i
), for m1 + 1 ≤ i ≤ K.
Let H = (h(X1), ...,h(Xn))> be a n×K matrix. Then, for some constants C,C ′ > 0,
‖∆n1‖22 = n−2ξ>HG∗−1G∗−1H>ξ ≤ n−2‖ξ‖22 · ‖HG∗−1G∗−1H>‖2≤ Cn−1‖ξ‖22 · ‖H>H/n‖2 ≤ C ′n−1‖ξ‖22, (D.14)
where the third step follows from the fact that ‖G∗−1‖2 is bounded and the last step follows from
Lemma D.4 and the maximum eigenvalue of E[h(Xi)h(Xi)>] is bounded. Since |∂J(v)/∂v| is upper
bounded by a constant for any v ≤ supx∈X |m∗(x)|, then there exist some constants C,C ′ > 0, suc
that for any m1 + 1 ≤ i ≤ K,
|ξi| ≤ C|π∗i − J∗i | ≤ C ′ supx∈X|m∗(x)− β∗>B(x)| ≤ C ′K−rb .
Similarly, |ξi| ≤ 2C ′K−rb for any 1 ≤ i ≤ m1. Thus, it yields n−1‖ξ‖22 = Op(K−2rb). Combining
with (D.14), we conclude that ‖∆n1‖2 = Op(K−rb).
Next, we consider ∆n2. Since ‖G∗−1‖2 is bounded, we have
‖∆n2‖2 ≤ ‖G∗−1‖2 ·∥∥∥(∂gβ(T ,X)
∂β
)−1∥∥∥2·∥∥∥G∗ − ∂gβ(T ,X)
∂β
∥∥∥2· ‖U‖2
≤ C(‖β − β∗‖2K1/2 +
√K logK
n
)·√K
n,
where the last step follows from Lemma D.5.
Finally, we consider ∆n3. By the same arguments in the control of terms ∆n1 and ∆n2, we can
prove that
‖∆n3‖2 ≤∥∥∥G∗−1 −
(∂gβ(T ,X)
∂β
)−1∥∥∥2· ‖gβ∗(T ,X)− U‖2
≤ C(‖β − β∗‖2K1/2 +
√K logK
n
)·K−rb .
44
Combining the rates of ‖∆n1‖2, ‖∆n2‖2 and ‖∆n3‖2 with (D.13), by Lemma D.5, we obtain
‖β − β∗‖2 ≤ ‖G∗−1gβ∗(T ,X)‖2 + ‖∆n1‖2 + ‖∆n2‖2 + ‖∆n3‖2
≤ C(K1/2
n1/2+
1
Krb
)+ C ′
(‖β − β∗‖2K1/2 +
√K logK
n
)·(K1/2
n1/2+
1
Krb
),
for some constants C,C ′ > 0. Therefore, (D.12) holds with ∆n = ∆n1 + ∆n2 + ∆n3, where
‖∆n‖2 = Op
(K1/2 ·
(K1/2
n1/2+
1
Krb
)2+
√K logK
n·(K1/2
n1/2+
1
Krb
)).
This completes the proof.
Proof of Theorem 4.1. We now consider the following decomposition of µβ− µ,
µβ− µ =
1
n
n∑i=1
[Ti(Yi(1)−K(Xi)− L(Xi))
Ji− (1− Ti)(Yi(0)−K(Xi))
1− Ji
]+
1
n
n∑i=1
(TiJi− 1− Ti
1− Ji
)K(Xi) +
1
n
n∑i=1
(TiJi− 1)L(Xi) +
1
n
n∑i=1
L(Xi)− µ
=1
n
n∑i=1
[Ti(Yi(1)−K(Xi)− L(Xi))
Ji− (1− Ti)(Yi(0)−K(Xi))
1− Ji
]+
1
n
n∑i=1
(TiJi− 1− Ti
1− Ji
)∆K(Xi) +
1
n
n∑i=1
(TiJi− 1)
∆L(Xi) +1
n
n∑i=1
L(Xi)− µ,
where Ji = J(β>B(Xi)), ∆K(Xi) = K(Xi) − α∗>1 h1(Xi) and ∆L(Xi) = L(Xi) − α∗>2 h2(Xi).
Here, the second equality holds by the definition of β. Thus, we have
µβ− µ =
1
n
n∑i=1
Si +R0 +R1 +R2 +R3
where
Si =Tiπ∗i
[Yi(1)−K(Xi)− L(Xi)
]− 1− Ti
1− π∗i
[Yi(0)−K(Xi)
]+ L(Xi)− µ,
R0 =1
n
n∑i=1
Ti(Yi(1)−K(Xi)− L(Xi))
Jiπ∗i(π∗i − Ji),
R1 =1
n
n∑i=1
(1− Ti)(Yi(0)−K(Xi))
(1− Ji)(1− π∗i )(π∗i − Ji),
R2 =1
n
n∑i=1
(TiJi− 1− Ti
1− Ji
)∆K(Xi), R3 =
1
n
n∑i=1
(TiJi− 1)
∆L(Xi).
In the following, we will show that Rj = op(n−1/2) for 0 ≤ j ≤ 3. Thus, the asymptotic normality of
n1/2(µβ−µ) follows from the previous decomposition. In addition, Si agrees with the efficient score
45
function for estimating µ (Hahn, 1998). Thus, the proposed estimator µβ
is also semiparametrically
efficient.
Now, we first focus on R0. Consider the following empirical process Gn(f0) = n1/2(Pn −P)f0(T, Y (1),X), where Pn stands for the empirical measure and P stands for the expectation, and
f0(T, Y (1),X) =T (Y (1)−K(X)− L(X))
J(m(X))π∗(X)[π∗(X)− J(m(X))].
By Lemma D.7, we can easily show that
supx∈X|J(β>B(x))− π∗(x)| . sup
x∈X|β>B(x)− β∗>B(x)|
+ supx∈X|m∗(x)− β∗>B(x)| = Op(K/n
1/2 +K1/2−rb) = op(1).
For notational simplicity, we denote ‖f‖∞ = supx∈X |f(x)|. Define the set of functions F = f0 :
‖m − m∗‖∞ ≤ δ, where δ = C(K/n1/2 + K1/2−rb) for some constant C > 0. By the strong
ignorability of the treatment assignment, we have that Pf0(T, Y (1),X) = 0. By the Markov
inequality and the maximal inequality in Corollary 19.35 of Van der Vaart (2000),
n1/2R0 ≤ supf0∈F
Gn(f0) . E supf0∈F
Gn(f0) . J[ ](‖F0‖P,2,F , L2(P )),
where J[ ](‖F0‖P,2,F , L2(P )) is the bracketing integral, and F0 is the envelop function. Since J is
bounded away from 0, we have |f0(T, Y (1),X)| . δ|Y (1)−K(X)−L(X)| := F0. Then ‖F0‖P,2 ≤δE|Y (1)|21/2 . δ. Next, we consider N[ ](ε,F , L2(P )). Define F0 = f0 : ‖m −m∗‖∞ ≤ C for
some constant C > 0. Thus, it is easily seen that logN[ ](ε,F , L2(P )) . logN[ ](ε,F0δ, L2(P )) =
logN[ ](ε/δ,F0, L2(P )) . logN[ ](ε/δ,M, L2(P )) . (δ/ε)1/k1 , where we use the fact that J is
bounded away from 0 and J is Lipschitz. The last step follows from the assumption on the brack-
eting number of M. Then
J[ ](‖F0‖P,2,F , L2(P )) .∫ δ
0
√logN[ ](ε,F , L2(P ))dε .
∫ δ
0(δ/ε)1/(2k1)dε,
which goes to 0, as δ → 0, because 2k1 > 1 by assumption and thus the integral converges. Thus,
this shows that n1/2R0 = op(1). By the similar argument, we can show that n1/2R1 = op(1).
Next, we consider R2. Define the following empirical process Gn(f2) = n1/2(Pn − P)f2(T,X),
where
f2(T,X) =T − J(m(X))
J(m(X))(1− J(m(X)))∆K(X).
By the assumption on the approximation property of the basis functions, we have ‖∆K‖∞ . K−rh .
In addition,
‖J(β>B(X))− π∗(X)‖P,2 ≤ ‖J(β>B(X))− J(β∗>B(X))‖P,2 + ‖J(β∗>B(X))− π∗(X)‖P,2. ‖β>B(X)− β∗>B(X)‖P,2 + sup
x∈X|m∗(x)− β∗>B(x)|
= Op(K1/2/n1/2 +K−rb),
46
where the last step follows from Lemma D.7.
Define the set of functions F = f2 : ‖m−m∗‖P,2 ≤ δ1, ‖∆‖∞ ≤ δ2, where δ1 = C(K1/2/n1/2 +
K−rb) and δ2 = CK−rh for some constant C > 0. Thus,
n1/2R2 ≤ supf2∈F
Gn(f2) + n1/2 supf2∈F
Pf2.
We first consider the second term n1/2 supf2∈F Pf2. Let G1 = m ∈ M : ‖m −m∗‖P,2 ≤ δ1 and
G2 = ∆ ∈ H − α∗>1 h1 : ‖∆‖∞ ≤ δ2. By the definition of the propensity score and Cauchy
inequality,
n1/2 supf2∈F
Pf2 = n1/2 supm∈G1,∆∈G2
Eπ∗(X)− J(m(X))
J(m(X))(1− J(m(X)))∆(X)
. n1/2 supm∈G1
‖π∗ − J(m)‖P,2 sup∆∈G2
‖∆‖P,2
. n1/2δ1δ2 . n1/2(K1/2/n1/2 +K−rb)K−rh = o(1),
where the last step follows from rh > 1/2 and the scaling assumption n1/2 . Krb+rh in this
theorem. Next, we need to control the maximum of the empirical process supf2∈F Gn(f2). Fol-
lowing the similar argument to that for R0, we only need to upper bound the bracketing integral
J[ ](‖F2‖P,2,F , L2(P )). Since J is bounded away from 0 and 1, we can set the envelop func-
tion to be F2 := Cδ2 for some constant C > 0 and thus ‖F2‖P,2 . δ2. Define F0 = f2 :
‖m−m∗‖P,2 ≤ C, ‖∆‖P,2 ≤ 1 for some constant C > 0, G10 = m ∈ M+m∗ : ‖m‖P,2 ≤ C and
G20 = ∆ ∈ H −α∗>1 h1 : ‖∆‖P,2 ≤ 1. Similarly, we have
logN[ ](ε,F , L2(P )) . logN[ ](ε/δ2,F0, L2(P ))
. logN[ ](ε/δ2,G10, L2(P )) + logN[ ](ε/δ2,G20, L2(P ))
. logN[ ](ε/δ2,M, L2(P )) + logN[ ](ε/δ2,H, L2(P ))
. (δ2/ε)1/k1 + (δ2/ε)
1/k2 ,
where the second step follows from the boundness assumption on J and its Lipschitz property, the
third step is due to G10 −m∗ ⊂ M and G20 + α∗>1 h1 ⊂ H and the last step is by the bracketing
number condition in our assumption. Since 2k1 > 1 and 2k2 > 1, it is easily seen that the bracketing
integral J[ ](‖F2‖P,2,F , L2(P )) = o(1). This shows that supf2∈F Gn(f2) = op(1). Thus, we conclude
that n1/2R2 = op(1). By the similar argument, we can show that n1/2R3 = op(1). This completes
the whole proof.
47