nonlinear and nonparametric regression and instrumental
TRANSCRIPT
Nonlinear and Nonparametric Regression and
Instrumental Variables
Raymond J. Carroll, David Ruppert, Ciprian M. Crainiceanu,
Tor D. Tosteson and Margaret R. Karagas
February 25, 2003
Abstract
We consider regression when the predictor is measured with error and an in-
strumental variable is available. The regression function can be modeled linearly,
nonlinearly, or nonparametrically. Our major new result shows that the regres-
sion function and all parameters in the measurement error model are identified
under relatively weak conditions, much weaker than previously known to imply
identifiability. In addition, we develop an apparently new characterization of the
instrumental variable estimator: it is in fact a classical ”correction for attenua-
tion” method based on a particular estimate of the variance of the measurement
error. This estimate of the measurement error variance allows us to construct
functional nonparametric regression estimates, by which we mean that no as-
sumptions are made about the distribution of the unobserved predictor. The
general identifiability results also allow us to construct structural methods of
estimation under parametric assumptions on the distribution of the unobserved
predictor. The functional method uses SIMEX and the structural method uses
Bayesian computing machinery. The Bayesian estimator is found to significantly
outperform the functional approach.
KEY WORDS: Bayesian methods; Errors in variables; Functional method; Generalizedlinear models; Identifiability; Instrumental variables; Measurement error; Nonparametricregression; P–splines; Regression Splines; SIMEX; Smoothing Splines; Structural modeling.
Short title. Regression with Instrumental Variables
Author Affiliations
Raymond J. Carroll (E-mail: [email protected]), Distinguished Professor, Department
of Statistics and Faculties of Nutrition and Toxicology, Texas A&M University, College
Station TX 77843–3143, USA.
David Ruppert (E-mail: [email protected]), Professor, School of Operations Research
& Industrial Engineering, Cornell University, Ithaca, NY 14853-3801, USA.
Ciprian M. Crainiceanu (E-mail: [email protected]), Graduate student, Department of
Statistical Science, Cornell University, Malott Hall, NY 14853, USA.
Tor D. Tosteson (E-mail: [email protected]), Associate Professor, Com-
munity and Family Medicine, Dartmouth Medical School, Hanover, NH 03755, USA.
Margaret R. Karagas (E-mail: [email protected]), Associate Professor,
Community and Family Medicine, Dartmouth Medical School, Hanover, NH 03755, USA.
Acknowledgments
Carroll’s research was supported by a grant from the National Cancer Institute (CA57030),
and by the Texas A&M Center for Environmental and Rural Health via a grant from the
National Institute of Environmental Health Sciences (P30–ES09106). Ruppert, Tosteson
and Karagas were supported by the National Cancer Institute (CA50597), and Tosteson
and Karagas were supported by the National Cancer institute (CA50597, CA57494) and the
National Institute of Environmental Health Sciences (ES07373). We thank Professor Tailen
Hsing for showing us the Durrent’s (1996) counterexample used in our Appendix.
1 Introduction
1.1 Background and Problem Statement
The measurement error literature, already large, has continue to expand into complex mod-
eling situations; see Reeves, et al. (1998), Domenici, et al. (2000), and Strauss, et al. (2001)
for recent environmental problems where measurement error plays a major role.
Motivated by a problem in environmental epidemiology (Section 5), we consider the
problem of measurement error in regression where the regression function could be modeled
linearly, nonlinearly, or even nonparametrically. In the case that the measurement error
variance is known or can be estimated by replication of the error–prone predictor, functional
(Carroll, et al., 1999) and structural/Bayesian (Berry, et al., 2002) methods have been
developed.
In our example, based on an epidemiologic study of skin cancer and arsenic exposure
(Karagas et. al, 2001), the error-prone predictor is not replicated so another approach is
needed. In this study, information on the measurement error is provided in the form of a
second measure of exposure, which we use as an instrumental variable. Estimation using an
instrumental variable is a surprisingly difficult problem because even in polynomial regres-
sion, identifiability of the regression function is a major issue, as described by Hausman, et
al. (1991). In our approach, we use a slight modification of their model for the instrument.
We provide simple and explicit assumptions under which identifiability is assured.
There is some work on parametric but not necessarily linear regression with an instrumen-
tal variable (Hausman, et al., 1991; Amemiya, 1990; Carroll and Stefanski, 1994; Stefanski
and Buzas, 1995; Buzas and Stefanski, 1996). These methods are either only applicable for
special parametric models, or for general parametric models they rely on small–error ap-
proximations that are known to fail for some nonlinear and nonparametric models (Carroll,
et al., 1995). To the best of our knowledge, there are no techniques presently available for
nonparametrically specified regression functions in the instrumental variable context.
Our identifiability result is related to a simple yet apparently new characterization of
the instrumental variable estimator. Specifically, we show that in simple linear regression
1
with a scalar instrument, the usual instrumental variables estimator is in fact a version of the
classical “correction for attenuation” method based on a specific estimate of the measurement
error variance. Because we can thus estimate the measurement error variance, this means
that we can apply methods from the classical error literature, particularly functional methods
that make no assumptions about the distribution of the unobserved predictor.
1.2 Consistent Estimation in Nonparametric Regression
Some readers of this paper have asked us to show the consistency of our estimators in non-
parametric regression. We now discuss why even attempting to create consistent estimators
is not a useful idea in practice and should not be pursued. More complete discussions are
given by Carroll, et al. (1999) and Berry, et al. (2002).
Consider a regression function m(·) that is to be estimated consistently but nonparamet-
rically. If the true covariate were observable, then there are a host of competing methods for
estimating m(·) consistently but nonparametrically, e.g., kernels, splines, orthogonal series,
local methods, etc.
However, if the true covariate is not observable, and is instead measured with additive,
normal error, then globally consistent estimation of m(·) is effectively impossible. This prob-
lem has been addressed previously, most notably by Fan and Truong (1993). Suppose that
we allow m(·) to have up to k derivatives. They showed that, if the measurement error is
normally distributed, even with known error variance, then, based on a sample of size n,
no consistent nonparametric estimator of m(·) converges faster than the rate {log(n)}−k.
Since, for example, log(10, 000, 000) ≈ 16, effectively this result suggests that globally con-
sistent nonparametric regression function estimation in the presence of measurement error
is impractical.
Given this fact about globally consistent estimation, it seems to us that the only practical
alternative is to construct estimators that are “approximately” consistent. By this we mean
that either (a) in large samples, as the error variance → 0, the estimator should have smaller
order bias than the naive estimator that ignores measurement error; or (b) the estimator
should be consistent in a smaller class of problems, in particular a flexible parametric class.
2
Carroll, et al. (1999) choose (a), while Berry, et al. (2002) choose (b). Our Bayesian esti-
mator assumes that the regression function is a spline with a moderate number of knots,
e.g., 20, and wiil be consistent not for the true regression function but rather for the best
spline approximation thereof. However, Ruppert (2002) has shown that the bias caused by
spline approximation is generally negligible compared to variability of the estimator or the
smoothing bias, even for sample sizes in the tens of thousands.
1.3 Outline
The outline of the paper is as follows. In Section 2, we define our model and the basic
characterization of identifiability.
The implications of the identifiability result is that we can construct methods with some
assurance that such methods will have a chance of reflecting some of the main features of
the data. In Section 3, we outline the methods used, some of which make no assumptions
about the distribution of the latent variable (functional case) while other methods assume
a specific form for this distribution (structural case). Section 4 presents a small simulation
study of nonparametric Gaussian and binary regression. In Section 5 we illustrate the meth-
ods on an example involving binary regression and environmental arsenic. In response to
referee concerns about the small sample properties of the estimators, in Section 6 describes
some asymptotic calculations in the polynomial regression case, which illustrate just how
really difficult the estimation problem of instrumental variables for not–linear models can
be. Section 7 contains concluding remarks.
2 Model and Identifiability
2.1 Introduction
In contrast to the small–error approach to instrumental variables estimation, Hausman, et al.
(1991) consider the most basic ”not–linear” model, polynomial regression. A polynomial is of
course linear in the parameters, but it is nonlinear in the independent variable and therefore
nonlinear in the measurement error. Let Y be the response, W the unbiased measure of X,
3
and S the instrument. They assume that the observed data are an iid sequence of vectors
(Y, W, S) that satisfy
Y = mpoly(X, β) + ε; (1)
W = X + U ; (2)
S = α0 + α1X + ν. (3)
In (1), the function mpoly(x, β) is a polynomial in X. In addition, ε, U , and ν have zero
means, and ν is independent of (X, ε, U). Model (2) is the classical error model. Hausman,
et al. show that the model (1)–(3) is identified essentially under the condition that α1 in (3)
is known. While this result is obviously impressive, in our experience α1 is rarely known in
instrumental variable applications.
2.2 More Details on the Result of Hausman, et al.
Previous readers of this paper have misinterpreted our claim that “essentially”, Hausman,
et al. require that α1 be known. We now clarify what we mean by this phrase.
Hausman, et al.’s main identifiability conditions are the following. In their equation
(2.4), they require that there be a known function a(·) such that a(α0, α1) = 0. In their
equation (2.9), they note that E(S) = α0 +α1E(W ) = α0 +α1E(X). They then require that
these two conditions identify (α0, α1). In an example after their (2.9), they illustrate that if
a(α0, α1) = κ0 +κ1α0 +κ2α1 with (κ0, κ1, κ2) known, then (α0, α1) would be identified under
some conditions, particularly if E(X) 6= 0.
It stretches the imagination to think of any practical context such that (α0, α1) have a
known relationship. In addition, suppose that we force E(X) = E(W ) = E(S) = 0 by the
common device of standardizing the mean of W and S to have mean zero. This changes only
the scale of the data, but not the model. Then (2.9) of Hausman, et al. is trivial, α0 = 0
and the condition that (α0, α1) have a known relationship reduces to α1 being known. This
is what we mean by ”essentially”.
We point out that Hausman, et al. consider the case of differential measurement error,
i.e., that ε and U are correlated. When α1 is known, or identified via a relationship with α0,
4
our methods are easily adapted to this case.
2.3 Main Identifiability Results
Our proposed methods are based on the following simple but apparently new observation
that, in fact, (1)–(3) is identified without prior knowledge of α1 even if the regression function
is not a polynomial. Rather, α1 can be determined from moments of the observable variable
in (1)–(3). This is an important result since it means that m(·) can be estimated without
any prior knowledge of parameters provided only that the instrument S as well as the proxy
W are observed.
Suppose that
(X,U, ε, ν) are mutually uncorrelated. (4)
Replace (1) by
Y = m(X) + ε. (5)
Then for any function m(x) (not just polynomials), α0, α1, µx = E(X), σ2x = var(X), σ2
u =
var(U), σ2ν = var(ν) are all identified if α1 6= 0 and if
cov(Y, W ) = cov{X,m(X)} 6= 0. (6)
Specifically, α1 = cov(Y, S)/cov(Y,W ); µx = E(W ); α0 = E(S − α1W ); σ2x = cov(W,S)/α1;
σ2u = var(W ) − σ2
x and σ2ν = var(S) − α2
1σ2x. Therefore, all parameters are functions of
the moments of observables and so are identified. It is interesting to note that if we inter-
changed the roles of Y and S, so that S is the “response” and Y is the “instrument”, then
identifiability of α1 under (6) follows from the usual instrumental variable calculations.
Of course, there are examples where (6) fails, e.g., m(X) ≡ constant or m(·) is an even
function and X symmetrically distributed about 0. However, we expect cov{X,m(X)} 6= 0
in the vast majority of applications. Moreover, if we add to (4) the assumption that
X is independent of U and ν, (7)
5
then (6) can be weakened to
cov[{X − E(X)}k,m(X)] exists and is non-zero for some positive integer k. (8)
Specifically, we have the following.
Theorem: Assume (2), (3), (4), (5), (7), (8), that α1 6= 0, and that σ2x > 0. Then
α0, α1, µx = E(X), σ2x = var(X), σ2
u = var(U), σ2ν = var(ν) are all identified, that is, they are
determined by moments of observable variables.
The proof of the theorem is given in Appendix A.1. In Appendix A.2 we show that under
weak assumptions, (8) will hold unless m(·) is constant. When m(·) is constant, m(·) is still
identified, but α1 appears to be not identified; see Section 3.4. Our theorem does not state
explicitly whether m is identified, because this follows from Fan and Truong (1993), who
exhibit conditions under which m(·) can be consistently estimated, and hence is identified,
if var(U) is identified. As we have argued in Section 1.2, (globally) consistent estimation of
m(·) is not feasible practically, and is thus not the main goal of our work.
Some comments on the assumptions and implications of the characterization are in order.
Assume now that (6) does hold.
1. In the linear case where m(x) = β0 + β1x, the usual IV slope estimate is the sam-
ple version of β1,iv = cov(Y, S)/cov(W,S). If σ2u were known, then the usual correc-
tion for attenuation estimate is the sample version of β1,ca = cov(Y,W )/(var(W ) −σ2
u). In our IV model the estimate of σ2u is the sample version of σ2
u,iv = var(W ) −cov(W,S)cov(Y,W )/cov(Y, S). Substituting σ2
u,iv into the formula for β1,ca yields βi,iv.
Thus, the usual IV estimator of the slope in linear regression can be looked upon as the
correction for attenuation estimator when the measurement error variance is estimated
via our proposal.
2. Conceptually, the connection between correction for attenuation and instrumental vari-
ables estimation offers the hope of more stable estimation. In particular, the attenua-
tion is
λ = σ2x/(σ
2x + σ2
u) = (σ2w − σ2
u)/σ2w. (9)
6
The correction for attenuation estimator is simply the least squares slope ignoring
measurement error divided by an estimate of the attenuation. Because of this division,
one can at least in principle improve the usual instrumental variables estimator by
bounding the attenuation away from zero.
3. The model (5) is more general than it looks, since the distribution of ε can depend on
X. For example, in our application Y is binary, so that we can write the model as
logit{pr(Y = 1|X)} = g(X), where g(X) = logit{m(X)}. Then ε is a Bernoulli variate
minus its mean.
4. Because (5) is an unstructured regression model, the assumption of additivity in (2)–
(3) is not as strong as it may seem. Instead, we are only assuming that there is a
common smooth transformation of the original data to X, W and S that satisfies these
equations. If (5) holds for the original data, then it will also hold for the transformed
data. For example, in our application to environmental epidemiology, we log transform
the data.
5. In practice, the methods are necessarily restricted to cases that Y and W are clearly
related, else α1 will be poorly estimated. Indeed, if Y and X are independent then the
parameters in the (W,S) model are unidentifiable if (W,S) are jointly normal.
3 Methods of Estimation
This section describes in broad detail three methods of estimation. Section 3.1 describes the
basic method we use to fit nonparametric regression, namely fixed–knot regression splines.
In Section 3.2, we propose a functional method of estimation that makes no assumptions
about the distributions of the random variables (X, U, ε, ν). We can do this because of our
new result that gives us an estimate of the measurement error variance. The method is
simple. We use the estimate of var(U) derived by moments calculations outlined above, and
then we apply the SIMEX method (Cook and Stefanski, 1994) to the (Y, W ) data, as in for
example Carroll, et al. (1999). In the nonparametric regression problem with error variance
known or estimated by direct replication of W , Berry, et al. (2002) showed that a Bayesian
7
approach using regression and smoothing splines could achieve significant gains in efficiency
when compared to the SIMEX method. Section 3.3 we show how to extend their Bayesian
method to the IV problem and also to problems such as binary regression.
3.1 Fixed Knot Regression Splines
A general approach to spline fitting is to use penalized splines or simply P–splines, a term
we borrow from Eilers and Marx (1996). In this section, we introduce the idea. The full
specification of the spline estimators proposed in our context comes later in the paper; see
for example (10).
Let C(x) = {B1(x), . . . , BN(x)}T, N ≤ n be a spline basis, i.e., a set of linearly indepen-
dent piecewise polynomial functions; a specific example will be given shortly. The P-spline
model specifies that m(·) is in the vector space of splines generated by this basis, i.e., that
for some N -dimensional β, m(x) = m(x, β) = C(x)Tβ.
Classes of P-splines that are especially convenient for modeling are the penalized B-
splines of Eilers and Marx (1996) and the closely related truncated power series basis of
Ruppert (2002). B-splines are more stable numerically than the truncated power basis, but
the roughness penalty we use adds numerically stability and makes use of the truncated
power basis computationally feasible. See Ruppert (2002) for a discussion of computation
with the truncated power basis. The latter are pth degree polynomial splines with k fixed
knots, t1, . . . , tk. We choose the knots at the quantiles of the W ’s. These functions have p−1
continuous derivatives and their pth derivatives are piecewise constant and take jumps at the
knots. A convenient basis for these splines is the set of monomials plus the truncated power
functions so that C(x) = (1, x, x2, ..., xp, (x− t1)p+, . . . , (x− tk)
p+)T, where ap
+ = {max(0, a)}p.
Then, N = 1 + p + k, β1, . . . , βp+1 are the monomial coefficients, and β2+p, . . . , βN are the
sizes of the jumps in the pth derivative of g(x) = C(x)Tβ at the knots.
The choice of the number of knots k is discussed by Ruppert (2002) who finds that for P–
splines the exact value of the number of knots k is not important, provided that k is at least
a certain minimum value. Generally, k = 20 more than suffices for the types of regression
functions found in practice and that can be recouped when there is measurement error. Of
8
course, there will be exceptions where more knots are required, e.g., a long periodic time
series. However, measurement error often occurs in situations where m is not too complex
and 10–20 knots or often even far less will suffice in such cases.
We add for completeness that there are a host of ways to fit spline functions. We have
found that for many functions, knot selection is not too important if the number of knots
is reasonably large. Berry, et al. (2002) found that P–splines and smoothing splines gen-
erally give very similar answers. Of course, at least in principle researchers interested in
knot selection can generalize our work to include either knot selection or smoothing splines.
Whether this is necessary or even practical in the context of measurement error remains an
open problem.
If measurement error is ignored, it is typical to fit the function m(x, β) by penalized
maximum likelihood; see for example Hastie and Tibshirani (1990). Consider the truncated
power series basis defined above. Let D∗ be the N × N diagonal matrix with p + 1 zeros
followed by k ones along the diagonal. Let γ be a smoothing parameter. The penalized
estimator β(γ) ignoring measurement error minimizes the loglikelihood in (Y, W ) minus
γβTD∗β. More formally, suppose that the loglikelihood in (Y, X) is L(Y,X, β). Then the
penalized regression spline ignoring measurement error is the solution to
maxβ
[{n∑
i=1
L(Yi, Wi,β)
}− γβTD∗β
]. (10)
One can use cross validation (CV) or generalized cross validation (GCV) to choose γ.
See, for example, Hastie and Tibshirani (1990, page 159) for definitions of CV and GCV.
Other penalties such as on the integral of the squared second derivative can be imposed by
other choices of D∗.
3.2 Functional Methods: SIMEX
As described in Section 2, as part of our work we have derived a new, simple nonparametric
estimator of the measurement error variance, σ2u. This estimate, however, is not guaranteed
to be positive, and it is entirely possible that it will be either negative or much too large.
We thus suggest the following simple modification. Set a user–specified lower bound on the
9
attenuation (9), say λL. Let λ be the estimate of λ obtained by replacing σ2w in (9) by the
sample variance σ2w of the W ’s, and by replacing σ2
u by its estimate. If λL ≤ λ, then use this
estimate of σ2u. If λ < λL, then form a new estimate of σ2
u by solving λL = (σ2w − σ2
u)/σ2w so
that σ2u = σ2
w(1− λL).
SIMEX needs a base estimator, i.e., an estimator one would use if there were no mea-
surement error. Carroll, et al. (1999) describe two such methods: (a) local linear kernel
regression of Y on W with bandwidths estimated either by GCV or by EBBS; (b) regression
splines of Y on W as described previously with smoothing parameter estimated by GCV.
We now define our SIMEX–IV estimator: apply the SIMEX method of Cook and Stefan-
ski (1994), using σ2u as the estimate of error variance. Briefly, for any fixed ζ > 0, suppose
one repeatedly ‘adds on’ to W , via simulation, additional error with mean zero and variance
σ2uζ, forming what are called pseudovalues. Then using these pseudovalues as predictors, one
computes one’s favorite nonparametric regression estimator, e.g., kernel or regression spline
as described above. With this estimator in hand, one generates pseudovalues repeatedly and
averages the estimators, calling the average g(ζ, x): generally, 50–200 times will suffice, and
one uses λ = 0.0, 0.5, 1.0, 1.5, 2.0. The idea is to plot g(ζ, x) against ζ ≥ 0, fit a model to
this plot, and then extrapolate back to ζ = −1. In our calculations, we used a quadratic
function to model the plot of g(ζ, x) against ζ.
3.2.1 Asympotic Theory for SIMEX
Asymptotic theory for the SIMEX method is easy if one uses kernel methods. Since σ2u
converges at the rate Op(n−1/2), and the kernel estimator converges at a slower rate, the
asymptotics are the same as if σ2u were known. This means that the SIMEX–kernel instru-
mental variables estimator has the same asymptotic distribution and expansion as described
in Carroll, et al. (1999). In the interest of space, we do not rewrite the details of this result.
There are no known limiting results for penalized regression splines with a fixed number
of knots and an estimated smoothing parameter. If the smoothing parameter γ in (10) is
held fixed, or is known and converges to zero at a specified rate, then the solution to (10) is
10
the solution to an estimating equation, i.e., an equation of the form
n∑
i=1
Ψn(Yi,Wi,β, γ). (11)
The limiting distribution of SIMEX for estimating equations such as (11) is already known;
see Stefanski and Cook (1995) and Carroll, et al. (1996).
3.3 Bayesian Methods
Our Bayesian methods are similar to those in Berry, et al. (2002). Partition C(x) =
{CT1 (x),CT
2 (x)}T, where CT1 (x) = (1, x, x2, ..., xp). Partition β = (βT
1 ,βT2 )T similarly. As
is common with regression splines, we will assume that β2 = Normal(0, σ2I), where I is the
identity matrix. Other formulations are possible and are described in the appendix. The
parameters then become α0, α1, µx, σ2x, σ2
u, σ2ν , β and σ2.
The formulae to implement the Gibbs sample are detailed. In Sections A.3 and A.4 we
exhibit these formulae for the Gaussian and probit models. Section 4.1.3 and A.5 describe
implementation in BUGS and our experience with it.
A reader has asked that we comment on the asymptotic properties of the Bayesian meth-
ods. We know of no general results for Bayesian P–splines even without measurement error,
but can appeal to standard theory connecting Bayesian and frequentist methods under the
assumptions that the model that drives the Bayesian calculations actually holds.
3.4 The Case that cov[m(X), {X − E(X)}k] = 0 for all k
In general, it would appear that when cov[m(X), {X − E(X)}k] = 0 for all k, the function
m(·) may not be identifiable. However, in the most important subcase, namely that m(·) is
constant so that m(X) ≡ c, m(·) is generally identifiable, at least when a lower bound on
the attenuation is specified.
Detailed proof of the assertion above is highly technical, but the main idea can be seen for
the SIMEX estimator. We assume that the extrapolant function is parametric and includes
the constant function as a special case. To this end, recall that if the regression function is
11
constant, then E(Y |X) ≡ E(Y |W ) ≡ c. Thus, the naive estimator that ignores measurement
error consistently estimates m(·).Now consider what happens in the SIMEX algorithm. If the attenuation is bounded
below by λL, then for sufficiently large samples we must have that (1/2)λL ≤ (σ2w − σ2
u)/σ2w.
This means that for sufficiently large samples we can find an interval [a, b] for σ2u, and the
attenuation on this interval of values always exceeds zero. Fix any σ2u∗ in this interval.
Consider the construction of pseudovalues W (pseudo, ζ, σ2u∗) = W + ζ1/2σu∗Z, where Z is
a computer–generated standard normal random variable. Of course, the pseudovalues also
satisfy E{Y |W (pseudo, ζ, σ2u∗)} ≡ c. We have thus shown that the naive estimator applied
to the pseudovalues consistently estimates m(·) ≡ c. Since the extrapolant function includes
the constant function as a special case, we have shown that for any σ2u∗ in [a, b], applying
SIMEX leads to a consistent estimate of m(·).What is now required to complete the proof is to show that this argument holds uniformly
in σ2u∗ ∈ [a, b], and hence that SIMEX is consistent as long as one bounds the attenuation
away from zero. Providing precise technical conditions to make this argument rigorous is
likely to be tedious and quite possibly uninteresting.
4 A Small Simulation Study
In this section we describe simulation results for Gaussian nonparametric regression and
binary nonparametric regression. In our simulation, we took n = 100 for the Gaussian case
and n = 500 for the logistic case. These are small sample sizes given the difficulty of the
instrumental variables problem for nonparametric regression; see Section 6.
We took σ2x = 1, σ2
u = .33, σ2ν = 1, α0 = 0 and α1 = 1. For the Gaussian case, the error
variance in (5) was σ2ε = 0.09. In this simulation, the attenuation was λ = 0.75. In our
calculations, the attenuation λ was constrained to lie in [0.60, 1.00]. Mean squared biases
and mean squared errors were calculated for x ∈ [−2.0, 2.0].
While we assumed that the X’s were normally distributed, to test robustness for the
Gaussian case we consider three distributions for the X’s: normal, uniform on [−2, 2] and
Skew Normal with index α = 5. The skewed normal distribution has density proportional to
12
f(x|α) = 2φ(x)Φ(αx), where φ and Φ represent the standard normal density and distribution
(Azzalini 1985). This density is reasonably skewed for any value of α ≥ 5.
4.1 Gaussian Nonparametric Regression
For the Gaussian case, we considered three regression models. In Case 1, the regression
function is 1/{1 + exp(4x)}. In Case 2, the regression function is sin(πx/2)/(1 + [2x2{1 +
sin(πx/2)}]). In Case 3, the regression function is sin(πx/2)/(1 + [2x2{1 + sign(x)}]).
4.1.1 Bias–Variance Tradeoffs in Regression Spline Estimation
Carroll, et al. (Section 4.4) describe theoretical calculations in the classical measurement
error problem that show that if one uses the truncated power series basis for regression
splines and maximum likelihood estimation, then the variance of the fits “blows up” as the
smoothing parameter → 0.
What does this mean, and why is it important? The essential point of this theoretical
calculation is than in a sample of size 100 for Gaussian cases, our methods must necessarily
penalize the spline in order to make it reasonably stable. There is a cost for such smoothing,
however, and that is bias. Specifically, for such sample sizes in the Gaussian case, it is
hopeless to believe that we will be able to reproduce difficult functions with deep valleys
such as Cases 2 and 3.
4.1.2 Results
The results for the Gaussian case are given in Table 1, for a 25–knot quadratic regression
spline: similar results were obtained for the linear spline. In this table, mean squared bias
and mean squared error are averages over 101 grid points on the interval [-2,2] and over all
Monte Carlo samples.
We see that the Bayes estimator clearly dominates the SIMEX estimators and the naive
estimator that ignores measurement error, both in terms of bias and in terms of mean squared
error. The SIMEX estimator with a quadratic extrapolant is far less biased than the naive
estimator, but it has large variance.
13
Figure 1 corresponds to Table 1, normal distribution, Case 3. The top left, top right, and
bottom left are 3 simulated data sets. The bottom right is the mean over all simulated data
sets. The lines are: solid = true dashed = naive dash-dot = simex, and dotted = Bayes.
This is a problem for which the naive estimator is only somewhat worse than the Bayes
estimator (from Table 1, naive squared bias = 2.99, Bayes squared bias = 1.29, naive mse
= 3.72, Bayes mse = 2.97). Careful inspection of the plot shows that the naive estimate
often misses or just barely finds the inflection points. The SIMEX estimator has excess
variability as shown in Table 1. Basically, this means that when the naive estimator is not
too bad relative to Bayes, the differences between the SIMEX and Bayes estimates are real
but subtle.
Figure 2 corresponds to Table 1, normal distribution, Case 2. In this case, the Bayes
estimator is a large improvement over the SIMEX estimator. This can be seen in the top
left panel, which is a data set where the naive estimate is poor, and the SIMEX is then
even worse. The Table 1 shows the same thing: real dominance by Bayes. Notice that in
the bottom right the mean of SIMEX is close to that of the Bayes estimator, so that these
two estimators have similar bias. This implies that the substantial MSE improvement of
the Bayes estimator over SIMEX seen in Table 1 is due to the lower variability of the Bayes
estimator.
4.1.3 Implementation and Comparison with WinBUGS
We have implemented the methods in MATLAB, programs that are available at the web site
not given to preserve anonymity.
In addition, at the web sitenot given to preserve anonymity
, we have constructed 20
simulated data sets for each of the cases in the simulation, along with Case 4, m(x) = x2.
This case is interesting because (6) is violated, and one would expect difficulties or at least
small sample instabilities in the fits. We have provided the Naive and Bayes estimates of
the regression functions. Readers may wish to try their favorite approaches to ours on these
data sets.
It is also possible to implement the Bayesian method using software designed for MCMC
14
simulations, such as WinBUGS (Bayesian Analysis Using Gibbs Sampling for Windows).
In Appendix A.5 we provide our implementation of the Gaussian model. WinBUGS is
very intuitive and flexible, allowing quick changes in the model. For example, changing
the model from a Normal to a Bernoulli/Logit model can be done by simply replacing the
line Y[i] ∼ dnorm(m[i],taueps) with the lines Y[i] ∼ dbern(p[i]) and logit(p[i]) < - m[i].
Similar changes can be made for Binomial, Poisson and other distributions of interest. An
important advantage of WinBUGS is that it does not require one to write code for the
Metropolis–Hastings step of simulations.
The MATLAB and the WinBUGS implementations use the same set of priors for pa-
rameters but different proposal distributions. The Matlab program takes advantage of the
specific features of the model for which all but two complete conditionals are explicit. Care-
fully tailored Metropolis–Hastings steps are used for these two complete conditionals.
These features of the Matlab program have important effects on both simulation speed
(number of simulations per second) and, more importantly, on the MCMC mixing (the
property of the chain to move rapidly through the support of the target distribution). For
example, for a data set sample size 100 for the normal Case 2, 1,000 MCMC simulations
were obtained in 14 seconds with Matlab and in 48 seconds with WinBUGS (2.66 GHZ CPU,
1GB RAM). For the Matlab program 30,000 simulations including 10,000 burn-in proved to
be enough to achieve convergence. Due to differences in simulated chains mixing quality,
the WinBUGS program required 1,000,000 simulations including 500,000 burn-in to achieve
the same results. In the end, WinBUGS needed approximately 13 hours to achieve the same
results obtained by the Matlab program in 7 minutes.
One should note that coding in WinBUGS requires only a low level of expertise and
coding times are far superior to expert programs (hours versus weeks or even months). In
our experience WinBUGS proves to be a valuable tool in the initial phase of research, when
many models are considered and compared. Moreover, WinBUGS programs can be used to
validate expert programs in the process of program refining and debugging.
15
4.2 Logistic Nonparametric Regression
We generated data according to the logistic model pr(Y = 1|X) = [1 + exp {−m(x)}]−1,
although the data were fit via probit regression, and the logits computed from the probit fit.
In this table, 13 cases were considered. Effectively, these were four basic monotone functions:
m(x) = κx, κ = 1.0, 0.75 and 0.50 = Cases 1, 2, 3;
m(x) = κ([4/{1 + exp(−x)}]− 2), κ = 1.0, 0.75 and 0.50 = Cases 4, 5, 6;
m(x) = (x− κ)+ − (−κ− x)+, κ = 1.0 and 0.75 = Cases 7, 8;
m(x) = κx+, κ = 1.0, 0.75, 0.50, 0.25 and 0.00 = Cases 9, 10, 11, 12, 13.
The constant κ basically makes the function increasingly or decreasingly constant. The
last case, m(x) ≡ 0, is a null case, to which the discussion in Section 3.4 applies. For this
case, the naive estimate has no bias, and would be expected to have smaller mean squared
error, since the effect of trying to correct an already consistent estimator for non–existent
bias caused by measurement error is to increase variance.
The results are given in Table 2 for a 10–knot linear regression spline. Basically, we
see that when the functions are monotone and far from constant, the Bayes estimator has
smaller bias and mean squared error than the naive method, sometimes much smaller. Of
course, as the functions become increasingly close to constant, the naive method becomes
increasingly competitive.
5 The Arsenic Example
Arsenic exposure has been clearly linked with skin, bladder, and lung cancer occurrence
in populations highly exposed either occupationally, medicinally, or through contaminated
drinking water (National Research Council, 1999; IARC, 1987). An ongoing population
based study in New Hampshire (Karagas et al., 1998, Karagas et al., 2001) is examining the
effects of arsenic on the incidence skin and bladder cancer in response to low to moderate
exposures, primarily due to natural sources of arsenic contamination in well water. Because
of intense regulatory interest in the effects of abatement strategies, the shape of the exposure
16
response relationship at lower exposures is important and strategies for nonlinear modeling
are being explored actively (Karagas and Tosteson, 2002).
Exposure assessment is accomplished through the measurement of arsenic concentrations
in both tap water from home water supplies and toenail samples for individuals newly di-
agnosed with skin or bladder cancer (cases) and individuals belonging to an age and gender
matched sample of other state residents (controls). For our example, we consider data for
215 controls and 233 basal cell skin cancer cases having both water and toenail samples.
Because we are interested in characterizing changes in cancer incidence due to changes in
arsenic water contamination, we specify the water measurement as the unbiased exposure,
taking X to be log(0.005 + level of arsenic in tap water sample) and W to be the measured
value of this quantity. The toenail arsenic measurements are interpreted as the instrumental
variable, so that S is specified as log(0.005 + level of arsenic in the toenail sample). Log
transformations were chosen to make W and S both reasonably close to normally distributed,
although some skewness remains.
Preliminary analysis ignoring measurement error showed a positive but not statistically
significant linear trend between arsenic in tap water and basal cell cancer incidence. For
the purposes of this analysis, the results were not adjusted for possible confounding factors
such as age and gender. The results for the regression spline analysis are given in Figure 3.
The naive fit ignoring measurement error shows a modest increase in the logit of basal cell
cancer incidence over the range of observed tap water arsenic levels, with some indication of
nonlinearity. The Bayes fit adjusting for measurement error shows a somewhat more uniform
increase, with the impression of less nonlinearity. The confidence bands indicate that the
overall increase is not statistically significant.
The posterior means were α0 = −2.0, α1 = 0.20 σ2x = 2.61 σ2
u = 0.54 σ2ν = 0.28,
µx = −1.11 and λ = 0.83, the latter indicating that the amount of attenuation is not as
great as might be supposed.
17
6 Some Asymptotic Calculations in Polynomial Re-
gression
The tradeoff between bias and variance is familiar to all who work in nonparametric regres-
sion. Less well known is the bias–variance tradeoff in measurement error modeling, but the
effect is even more profound. Ignoring measurement error leads to bias, often in the form of
attenuation, namely the estimates tend to be shrunken towards zero. To correct this bias,
one typically must unshrink the estimator, an operation that causes an increase in variability.
Thus, in almost any practical context, the naive estimator is biased but much less variable
than any estimator that attempts to remove this bias.
To get some idea of the asymptotic behavior of the estimates, we performed some exact
calculations. For each of the 3 cases in Table 1, and with Case 4 being m(x) = x2, we fit the
polynomial that best captures the function on the interval µx±3(σ2x +σ2
u)1/2, i.e., we set up a
grid on this interval, and fit polynomials to the function y = m(x) on the grid. The degrees
of the polynomials chosen were 7, 7, 7 and 5 for Cases 1–4, respectively. The polynomial
functions are given in Figure 4 on the interval µx ± 2σx. While not perfect representations
of Cases 1–4, then are sufficiently close to yield some insight.
With X and U normally distributed, define σ2x|w = var(X|W ) = λσ2
u. Recall that if Z
has a standard normal distribution, then E(Z2r) = (2r)!/(2rr!). Then, if the true polyno-
mial is m(x, β) =∑d
k=1 βkxk, the observed data have the regression function E(Y |W ) =
∑dj=0 βj,naiveW
j, where βj,naive =∑d
k=j βkk!λjσk−jx|w E(Zk−j)/{j!(k − j)!}. In general, if the
true regression function is m(x, β), then the naive estimator of β converges to βnaive, the
minimizer of E[{m(X, β)−m(W,βnaive)}2
]. Once the expectation is computed as a func-
tion of βnaive, it can be minimized by any standard minimizer. The expectation itself is given
as
(σ2xσ
2u)−1/2
∫{m(x, β)−m(x + u, βnaive)}2 φ {(x− µx)/σx}φ (u/σu) dxdu
= (πσ2u)−1/2
∫ {m(µx +
√2σxz, β)−m(µx +
√2σxz + u, βnaive)
}2exp(−z2)φ (u/σu) dzdu
π−1/2E{∫ {
m(µx +√
2σxz, β)−m(µx +√
2σxz + U, βnaive)}2
exp(−z2)dz}
,
where the expectation is over the distribution of U . The integral can be computed via
18
Gaussian quadrature, and the expectation via simulation.
Let m(1)(·) be the derivative of m(·) with respect to β: note that because m(·) is linear
in β, this derivative does not involve β. In a sample of size n, the naive estimator is the
solution to the equation∑n
i=1 m(1)(Wi) {Yi −m(Wi, β∗)}. Asymptotically, its variance is
n−1B−1naiveAnaiveB
−1naive, where
Bnaive = E[m(1)(W )
{m(1)(W )
}T];
Anaive = E[{Y −m(W,βnaive)}2 m(1)(W )
{m(1)(W )
}T]
= E([
σ2ε + {m(X, β)−m(W,βnaive)}2
]m(1)(W )
{m(1)(W )
}T)
.
Both Anaive and Bnaive can be computed either directly by simulation or by a combination
of simulation and Gaussian quadrature as described above.
Under a parametric model the Bayes estimator will be asymptotically equivalent to the
maximum likelihood estimator, and hence it will be asymptotically consistent and its vari-
ance is n−1I−1, where I is the information matrix for β and the measurement error model
parameters µx, α0, α1, σ2x, σ2
u, σ2ε , σ2
ν . Again, a combination of Gaussian quadrature and
simulation is used. By simple calculations, the likelihood is
(σ2ε σ
2uσ
2νπ)−1/2
∫φ
{Y −m(µx +
√2σxz, β)
σε
}φ
(W − µx +
√2σxz
σu
)
×φ
{S − α0 − α1(µx +
√2σxz)
σν
}exp(−z2)dz.
We computed the score L(Y, W, S, ·), the derivative of the loglikelihood, via numerical dif-
ferentiation. The information, I = E{L(Y,W, S, ·)LT(Y, W, S, ·)
}can be computed by
simulation.
On a grid of values xi for i = 1, ..., ngrid, the mean squared bias of the naive estimator
is n−1grid
∑ngrid
i=1 {m(xi, β)−mnaive(xi, βnaive)}2. Since m(·) and mnaive(·) are linear in β, the
average variance based on a sample of size n is n−1trace(B−1
naiveAnaiveB−1naiveC
), where C =
n−1grid
∑ngrid
i=1 m(1)(xi){m(1)(xi)
}T. Ignoring any small–sample bias in the maximum likelihood
estimator, its asymptotic variance is n−1trace (I−1C).
For a sample of size n = 100, ignoring any possible small–sample bias in the maximum
likelihood estimator, the results are given in Table 3. Basically, the message from this table
19
is the same that we have made previously: for such small sample sizes, the excess variance
of the (asymptotically) best parametric method for correcting bias due to measurement
error makes the naive approach at least comparable in terms of mean squared error. This
(asymptotic) fact of life shows up somewhat in our simulations, although we have noted
there that the Bayesian methods actually perform quite a bit better than our asymptotics
would suggest.
Careful readers will note that the numbers in Tables 1 and 3 are not identical. This is
because the latter uses asymptotics, different functions, and different estimation methods.
The qualitative message is, however, the same.
7 Summary and Further Discussion
Our main theoretical contribution is to show that all parameters including the regression
function are identified in the instrumental variables model without prior knowledge of the
slope of the regression of the instrument on the true X. This result extends the applicability
of IV estimation to many interesting examples including our case study of the risk of skin
cancer due to arsenic exposure.
Our second main result is the characterization of instrumental variables estimation as a
correction for attenuation, so that the measurement error variance can be estimated from
moments of the observed instruments. This allows us to use some of the methods from the
classical measurement error literature.
We have developed two IV estimates, a functional estimator that applies SIMEX and
a Bayesian structural estimator that uses MCMC. Simulation shows that the Bayesian
structural estimator outperforms the functional estimator. Moreover, the structural estima-
tor appears robust to misspecification of the distribution of the true covariate X, although
there are surely situations for which an X-distribution is so far from normal that the Bayes
estimator will be badly affected.
In our example, the designation of tap water as the unbiased exposure measure reflects
a certain interpretation of the fitted regression curve, that the curve is the probability of
skin cancer given a level of exposure in drinking water. However, in practice, total arsenic
20
exposure includes not only the amount consumed but exposure from other sources such as
food.
Another formulation would focus on the dose response for a biologically active arsenic
exposure, for which toenail concentrations could be taken as an unbiased measure. Concep-
tually, this would introduce an additional latent variable to represent the biologically active
exposure, D, which would depend on true tap water concentrations in a linear fashion. Re-
taining the designation of W for transformed value of measured tap water arsenic and S for
toenail, we could rewrite our model as
Y = mpoly(D, β) + ε
W = X + U
D = α0 + α1X + ν
S = D + ξ.
This model will be the focus of future research.
References
Albert, J. H. and Chib, S. (1993), “Bayesian analysis of binary and polychotomous response
data”, Journal of the American Statistical Association, 88, 669–679.
Amemiya, Y. (1990), “Two-stage instrumental variable estimators for the nonlinear errors-
in-variables model”, Journal of Econometrics, 44, 311–332.
Azzalini, A. (1985), “A class of distributions which includes the normal ones”, Scandinavian
Journal of Statistics 12, 171-178.
Berry, S. A., Carroll, R. J. and Ruppert, D. (2002), “Bayesian smoothing and regression
splines for measurement error problems”, Journal of the American Statistical Associa-
tion, 97, 160–169.
Box, G. E. P. and Tiao, G. (1973), Bayesian Inference in Statistical Analysis, Addison–
Wesley, London.
Buzas, J. S. and Stefanski, L. A. (1996), “Instrumental variable estimation in generalized
linear measurement error models”, Journal of the American Statistical Association, 91,
999–1006.
Carroll, R. J., Kuchenhoff, H., Lombard, F., and Stefanski, L. A. (1996), “Asymptotics for
the SIMEX estimator in structural measurement error models,” Journal of the American
21
Statistical Association, 91, 242–250.
Carroll, R. J., Maca, J. D. and Ruppert, D. (1999), “Nonparametric regression with errors
in covariates”, Biometrika, 86, 541–554.
Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995), Measurement Error in Nonlinear
Models, Chapman and Hall, New York.
Carroll, R. J. and Stefanski, L. A. (1994), “Meta–analysis, measurement error and corrections
for attenuation”, Statistics in Medicine, 13, 1265–1282.
Cook, J. R. and Stefanski, L. A. (1994), “Simulation–extrapolation estimation in parametric
measurement error models”, Journal of the American Statistical Association, 89, 1314–
1328.
Dominici F., Zeger S. L. and Samet J. M. (2000), “Combining evidence on air pollution and
daily mortality from the largest 20 US cities: a hierarchical modeling strategy”, Journal
of the Royal Statistical Society, Series A, 163, 263–302.
Durrett, R. (1996), Probability:Theory and Examples, 2nd ed., Duxbury, Belmont, CA.
Eilers, P. H. C. and Marx, B. D. (1996), “Flexible smoothing with B–splines and penalties”
(with discussion), Statistical Science, 11, 89–102.
Fan, J. and Truong, Y. K. (1993), “Nonparametric regression with errors in variables”,
Annals of Statistics, 21, 1900–25.
Green, P. J. and Silverman, B. W. (1994), Nonparametric Regression and Generalized Linear
Models: A Roughness Penalty Approach, Chapman and Hall, London.
Hastie, T. and Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall: New
York.
Hausman, J. A., Newey, W. K., Ichimura, H. and Powell, J. L. (1991), “Identification and
estimation of polynomial errors–in–variables models”, Journal of Econometrics, 50, 273–
295.
IARC, (1987), “Arsenic and Arsenic compounds (Group 1),” In: Monographs on the Evalu-
ation of Carcinogenic Risk of Chemicals to Humans, Supplement 7, pp. 100-106, Inter-
national Agency for Research on Cancer.
Karagas, M. R., Stukel, T. A., Morris, J. S., Tosteson, T. D., Weiss, J. A., Spencer, S. K. and
Greenberg, E. R., (2001), “Skin cancer risk in relation to toenail arsenic concentrations
in a US population-based case-control study”, American Journal of Epidemiology, 153,
559-65.
Karagas M.R. and Tosteson, T. D. (2002), “Assessment of cancer risk and environmental lev-
22
els of arsenic in New Hampshire”, International Journal of Hygiene and Environmental
Health (in press).
Karagas, M. R., Tosteson, T. D., Blum, J., Morris, J. S., Baron, J. A. and Klaue, B.,
(1998), “Design of an epidemiologic study of drinking water arsenic exposure and skin
and bladder cancer risk in a US population”, Environmental Health Perspectives, 106,
1047-1050.
National Research Council, (1999), Arsenic in Drinking Water, National Academy Press,
Washington, DC.
Reeves, G. K., Cox, D. R., Darby, S. C. and Whitley, E. (1998), “Some aspects of mea-
surement error in explanatory variables for continuous and binary regression models”,
Statistics in Medicine, 17, 2157–2177.
Robert, C. P. (1995). “Simulation of truncated normal variables”, Statistics and Computing,
5, 121–125.
Ruppert, D. (2002), “Selecting the number of knots for penalized splines”, Journal of Com-
putational and Graphical Statistics, 11, 735–757.
Stefanski, L. A. and Buzas, J. S. (1995), “Instrumental variable estimation in binary regres-
sion measurement error variables”, Journal of the American Statistical Association, 90,
541–550.
Stefanski, L. A. and Cook, J. R. (1995), “Simulation–extrapolation: the measurement error
jackknife,” Journal of the American Statistical Association, 90, 1247–1256.
Strauss, W. J., Carroll, R. J., Bortnick, S. M., Menkedick, J. R. and Schulz, B. D. (2001).
“Combining datasets to predict the effects of regulation of environmental lead exposure
in housing stock”, Biometrics, 57, 203–210.
Appendix
A.1 Proof of Theorem 1
Let M be the smallest positive integer k such that cov[m(X), {X − E(X)}k] is not zero.
Then
cov[Y, {S − E(S)}M ] = cov[m(X), {α1(X − E(X)) + ν}M ]
=M∑
j=0
M
j
αj
1 cov[m(X), {X − E(X)}jνM−j]
23
= αM1 cov[m(x), {X − E(X)}M ]. (A.1)
since by (7) cov[m(X), {X −E(X)}jνM−j] = cov[m(X), {X −E(X)}j]E(νM−j) and by the
definition of M we have cov[m(X), {X − E(X)}j = 0 for 1 ≤ j < M .
By an identical calculation,
cov[Y, {W − E(W )}M ] = cov[m(x), {X − E(X)}M ]. (A.2)
Then, by (A.1) and (A.2)
αM1 =
cov[Y, {S − E(S)}M ]
cov[Y, {W − E(W )}M ]. (A.3)
If M is odd, then (A.3) determines α1 from moments of observables. If M is even, then α1 is
only determined up to its sign by (A.3), but then its sign can be determined by the relation
cov(W,S) = α1σ2x and the assumption that σ2
x > 0.
A.2 On Condition (8)
We now prove the following result showing that if X is compactly supported and m is
continuous, then (8) holds unless m(·) is constant. The condition that X is compactly
supported cannot be removed. A counterexample can be constructed using Counterexample
1 on page 107 of Durrett (1996). In that counterexample, it is shown that there are densities
distinct from the lognormal density but with the same moments as the lognormal. If fX is
the density of X and if m · fX is the difference between two distinct densities with the same
moments, then clearly E[m(X){X − E(X)}k] = 0 for all k.
Theorem 2: Suppose that the support of X is contained in a compact interval [a, b] and that
m(·) is continuous on [a, b]. If
cov[m(X), {X − E(X)}k] = 0 for all k, (A.4)
then var{m(X)} = 0 so that P [m(X) = E{m(X)}] = 1.
Proof: By the Weierstrass approximation theorem, for all δ > 0 there exists a polynomial
mpoly(·) such that |m(x)− E{m(X)} −mpoly(x)| < δ for all x ∈ [a, b]. By (A.4), m(X) and
mpoly(X) have zero covariance so that
var{m(X)} = E [m(X)− E{m(X)} −mpoly(X)]2 − E{mpoly(X)}2
≤ δ2(b− a)− E{mpoly(X)}2 ≤ δ2(b− a). (A.5)
Since δ > 0 is arbitrary, the result follows.
24
A.3 MCMC Calculations in the Gaussian Case
In the Gaussian case, m(x) = C1(x)β1 + C2(x)β2, where β2 ∼ Normal(0, σ2D), and D is a
k × k matrix. For the regression spline, D was chosen as the identity matrix. Priors for α0,
α1, µx and β1 were independent normals with mean zero and (large) variances σ2α, σ2
α, σ2µ
and σ2βI, respectively. The prior for the attenuation λ was uniform on the interval [λL, λH ].
Of course, by simple algebra, σ2x = λσ2
u/(1 − λ). Priors for σ2ε , σ2
u, σ2ν and σ2 were inverse
Gamma with parameters (aε, bε), (au, bu), (aν , bν), (aσ, bσ), respectively, where the IG(A,B)
density is given by {Γ(A)BAxA+1}−1 exp{−1/(Bx)}. Let C(x) = {CT1 (X), CT
2 (X)}T, H =∑n
i=1 C(Xi)Yi/σ2ε ,
Q =
{n∑
i=1
C(Xi)CT(Xi)/σ
2ε + diag(I/σ2
β, Ik/σ2)
}−1
,
D ={∑n
i=1(1, Xi)T(1, Xi)
σ2ν
+I2
σ2α
}−1
, and A =
∑ni=1(1, Xi)
TSi
σ2ν
. The joint density of the data
and the parameters, i.e., the unnormalized posterior density, is proportional to
exp[−
∑ni=1{Yi −C1(Xi)β1 −C2(Xi)β2}2
2σ2ε
−∑n
i=1(Wi −Xi)2
2σ2u
−∑n
i=1(Si − α0 − α1Xi)2
2σ2ν
−∑n
i=1(1− λ)(Xi − µx)2
2λσ2u
− µ2x
2σ2µ
− α20
2σ2α
− α21
2σ2α
− βT1 β1
2σ2β
− βT2 D−1β2
2σ2− bε
σ2ε
− bu
σ2u
− bν
σ2ν
− b
σ2
]
×(σ2ε )−(aε−n/2)(σ2
u)−(au−n)(σ2
ν)−(aν−n/2)(σ2)−(a−k/2){(1− λ)/λ)}n/2.
The complete conditionals are as follows:
µx = Normal
{X
n(1− λ)σ2µ
n(1− λ)σ2µ + λσ2
u
,λσ2
uσ2µ
λσ2u + n(1− λ)σ2
µ
};
σ2u = IG
(au + n, [1/bu + (1/2){(1− λ)/λ}
n∑
i=1
(Xi − µx)2 + (1/2)
n∑
i=1
(Wi −Xi)2]−1
);
σ2ε = IG
(aε + n/2, [1/bε + (1/2)
n∑
i=1
{Yi −C1(Xi)β1 −C2(Xi)β2}2]−1);
σ2ν = IG
[aν + n/2, {1/bν + (1/2)
n∑
i=1
(Si − α0 − α1Xi)2}−1
];
σ2 = IG[aσ + k/2, {1/b + (1/2)βT
2 D−1β2}−1];
(α0, α1) = Normal(DA,D);
(βT1 , βT
2 )T = Normal(QH ,Q);
Xi ∝ exp[−{Yi −C1(Xi)β1 −C2(Xi)β2}2
2σ2ε
− (Wi −Xi)2
2σ2u
− (Si − α0 − α1Xi)2
2σ2ν
−(1− λ)(Xi − µx)2
2λσ2u
].
25
In addition,
λ ∝ I(λL ≤ λ ≤ 1){(1− λ)/λ)}n/2 exp{−∑n
i=1(1− λ)(Xi − µx)2
2λσ2u
}. (A.6)
All the complete conditionals except for λ and the X’s are easily generated. For λ,
in our simulations, we discretized the set λ ∈ [λL, λH ] into 41 different values, computed
(A.6) for these values, turned the result into probabilities, and sampled λ according to
these probabilities. This gridded Gibbs estimator is not strictly correct, of course, but it
is convenient and provides good mixing. We also implemented a full Metropolis–Hastings
step: mixing was not quite as good, thus requiring somewhat more MCMC samples, but
in selective test cases we found that the final fits to the regression function were virtually
identical to our gridded method.
For the X’s, the complete conditional is not explicit. We used Metropolis–Hasting steps
where the candidate density was normal with the current value of X as the mean and the
variance being 1/2 times the conditional variance for X given (W,S), the latter variance
evaluated at the current parameter values.
In our simulations, the prior distributions were as follows: σ2x = IG(1, 1), σ2
ν = IG(1, 1),
σ2ε = IG(1, 1), λ = U [0.60, 1.00], µx = Normal(0, 100), α0 = Normal(0, 100), α1 = Normal(0, 100),
σ2 = IG(1, 1000). We also used σ2 = IG(0.01, 100) without appreciable differences in some
test cases.
The model can be extended to incorporate possible prior information on the parameters
µx, α1, α2,β1, β2. Since we have no such prior information, we did not implement the fol-
lowing calculations. An additional enhancement of the model is to consider the covariance
matrix D of β2 unknown and allow an inverse Wishart prior for its distribution. This is
equivalent to assuming a multivariate t-distribution on the coefficients of the spline basis
function (e.g. Box and Tiao, 1973, Theorem 8.5.1) instead of the normal distribution. For
these parameters, consider the following new set of priors
µx = Normal(µ0, σ2µ0
);
(α0, α1) = Normal(a,Σa);
(βT1 ,βT
2 )T = Normal{(βT
1,0,0T )T , diag(Σβ1 ,Σβ2)
};
Σβ2 = Inverse Wishart(R0, q0).
Here 0 is a vector of zeros representing the mean of the vector β2, and a Wishart distribution
with parameters (R0, q0) has pdf proportional to
(detΣ)q0/2−1 exp{−1
2trace(ΣR0)
}.
26
With these new priors the posterior distributions for µx, α0, α1 β1, β2 become
µx = Normal
{X
n(1− λ)σ2µ0
n(1− λ)σ2µ0
+ λσ2u
+ µ0λσ2
u
n(1− λ)σ2µ0
+ λσ2u
,λσ2
uσ2µ0
λσ2u + n(1− λ)σ2
µ0
};
(α0, α1) = Normal(DA,D);
(βT1 , βT
2 )T = Normal(QH ,Q),
where D ={∑n
i=1(1, Xi)T(1, Xi)
σ2ν
+ Σ−1a
}−1
, A =
∑ni=1(1, Xi)
TSi
σ2ν
+ Σ−1a a,
H =n∑
i=1
C(Xi)Yi/σ2ε +
Σ−1β1
β1,0
0
,
and
Q =
{n∑
i=1
C(Xi)CT(Xi)/σ
2ε + diag(Σ−1
β1,Σ−1
β2)
}−1
.
Finally, the posterior distribution of Σβ2 is
Σβ2 = Inverse Wishart(R0 + β2βT2 , q0),
and all the other posterior distributions remain unchanged.
A.4 MCMC Calculations in the Probit Model
We fit a probit regression model, turning it into a logistic fit by the usual device: if the
probability is p, then the logit function is log{p/(1−p)}. Note that we are not approximating
the logit model by a probit model. Rather, our method is exact since if the logit of P (Y =
1|X) is a smooth function of X, then the probit of P (Y = 1|X) is another smooth function
of X.
For the probit model, we modified the method of Albert and Chib (1993). Specifically,
one defines latent variable Zi that are normally distributed with mean C1(Xi)β1+C2(Xi)β2
and variance 1.0, so that Yi = I(Zi > 0). Given the values of Zi, the MCMC steps of Section
A.3 apply without change, with two exceptions: (a) σ2ε = 1 is known a priori; and (b)
Zi replaces Yi in that section. This means that the only thing necessary in the MCMC
steps is to generate values of the Zi from their complete conditional distribution. Write
µi = C1(Xi)β1 + C2(Xi)β2. The density of Zi given the rest is
f(Zi|rest) ∝ {YiI(Zi > 0) + (1− Yi)I(Zi ≤ 0)} exp{−(1/2)(Zi − µi)2}.
This means that if Yi = 1, then Zi is a truncated normal random variable, i.e., a normal
random variable with mean µi, variance 1.0, with left truncation point 0.0. Also, if Yi = 0,
27
Zi a normal random variable with mean µi, variance 1.0, with right truncation point 0.0.
Define Ri = 1− 2I(Yi = 1), and TN(a, b) to be a normal random variable with mean a and
variance 1.0 with left truncation point b. Then it follows that complete conditional of Zi is
Zi ∼ µi − RiTNL(0, Riµi). To generate these truncated normals, we used the accept–reject
algorithm of Robert (1995), with the following modification. If we want to generate a normal
random variable truncated from the left (right) at 0 and with a positive (negative) mean,
we did not use Robert’s algorithm but instead generated normals at random until one was
positive (negative).
While the candidate density for X discussed in the Gaussian case (Section A.3) worked
well enough in that case, we found that it was not nearly so efficient in the probit model. The
following gave better mixing and faster convergence of the sampler. Suppose that the current
Xi is Xcurr,I , and the latent variable is Z. Let βlin be the simple linear regression estimate of
{Z}ni=1 on {Xcurr}n
i=1. Our candidate density was the density of X given (Z, W, S) assuming
a linear model for Z and X with coefficients βlin, and assuming the current values of µx, σ2x,
α0, α1, σ2u and σ2
ν . In all cases investigated, the percentage of mixing for X was over 95%.
28
A.5 WinBUGS Code for the Gaussian Modelmodel
{#Likelihood description corresponding to (W,S)#Notations are the same as the ones in the paper
for (i in 1:N){W[i]~dnorm(X[i],tauu)S[i]~dnorm(mS[i],taunu)X[i]~dnorm(mux,taux)mS[i]<-alpha0+alpha1*X[i]}
#In general, tau denotes the precision (1/variance)#taux is the precision of the distribution of X#tauu is the precision of the distribution of U#lambda is the attenuation
taux<-(1-lambda)*tauu/lambda
#Construct the matrix Z of truncated spline polynomials#pow() is the power function, second argument is the exponent#step() is the truncation (plus) function. Is equal to 1 if argument#is greater than 0 and 0 otherwise
for (i in 1:N){for (k in 1:K)
{Z[i,k]<-pow(X[i]-Knots[k],2)*step(X[i]-Knots[k])}
}
#Likelihood description corresponding to the Y observations#m1[] stores the quadratic polynomial of X#m2[] stores the truncated spline part of the regression#m[] is the mean of Y[]#taueps is the precision of Y[]
for (i in 1:N){m1[i]<-beta0+beta1*X[i]+beta2*pow(X[i],2)m2[i]<-inprod(b[],Z[i,])m[i]<-m1[i]+m2[i]Y[i]~dnorm(m[i],taueps)}
#Prior structure on the coefficients of spline basis polynomials#b[] are the coefficients of the truncated spline basis
for (k in 1:K){b[k]~dnorm(0,tau)}
#Priors on the parameters of the model#Gamma(a,b) has mean a/b and variance a/b^2#In normal distributions the second parameter is the precision
lambda~dunif(0.6,1)taueps~dgamma(1,1)tauu~dgamma(1,1)taunu~dgamma(1,1)tau~dgamma(0.01,0.01)mux~dnorm(0,0.01)alpha0~dnorm(0,0.01)alpha1~dnorm(0,0.01)beta0~dnorm(0,0.01)beta1~dnorm(0,0.01)beta2~dnorm(0,0.01)
#This part contains only deterministic transformations#Here MASE is computed#BA[] is the grid of X’s where the regression function is computed#ZBA[] is like Z[] but for the grid points
for (i in 1:NB){for (k in 1:K)
{ZBA[i,k]<-pow(BA[i]-Knots[k],2)*step(BA[i]-Knots[k])}
}
#Here the regression function at all grid points is computed#and the square difference from the true function func[]
for (i in 1:NB){meanBA[i]<-beta0+beta1*BA[i]+beta2*pow(BA[i],2)+inprod(b[],ZBA[i,])distsquare[i]<-pow(meanBA[i]-func[i],2)}
#Compute MASE
MASE<-(NB-1)*mean(distsquare[])
}
30
Case 1 Case 2 Case 3
Mean Mean Mean Mean Mean Mean
Sample Squared Squared Squared Squared Squared Squared
Size Distribution Method Bias Error Bias Error Bias Error
100 Normal Naive 1.40 1.98 7.27 8.43 2.99 3.72
SIMEX(L) 0.82 1.61 6.56 8.19 2.72 3.77
SIMEX(Q) 0.52 3.31 4.60 11.25 1.92 5.90
Bayes 0.21 1.02 2.51 4.40 1.29 2.97
Uniform Naive 0.91 1.64 5.94 7.09 2.61 3.34
SIMEX(L) 0.57 1.40 5.32 6.59 2.31 3.14
SIMEX(Q) 0.43 3.33 2.86 7.34 1.29 4.20
Bayes 0.19 0.78 2.61 3.80 1.62 2.44
Skew Normal Naive 1.38 2.11 9.64 10.91 3.28 4.12
SIMEX(L) 0.84 1.68 9.87 11.26 3.36 4.26
SIMEX(Q) 0.58 3.57 8.36 13.17 2.59 5.34
Bayes 0.29 1.21 4.71 6.76 1.44 3.28
Table 1: 100 × Mean squared bias and 100 × Mean squared error for the simulation for thespline Gaussian error model. In Case 1, the regression function is 1/{1 + exp(4x)}. In Case2, the regression function is sin(πx/2)/(1 + [2x2{1 + sin(πx/2)}]). In Case 3, the regressionfunction is sin(πx/2)/(1 + [2x2{1 + sign(x)}]).
Naive Bayes
Case Mean Squared Bias Mean Squared Error Mean Squared Bias Mean Squared Error
1 9.91 12.72 0.53 5.95
2 4.82 9.43 0.17 4.30
3 1.76 4.43 0.12 2.95
4 6.39 9.92 1.02 5.33
5 3.45 7.03 0.43 3.78
6 1.70 5.10 0.14 3.00
7 7.16 11.08 3.68 6.66
8 4.98 9.21 3.38 6.15
9 7.14 11.19 3.70 8.63
10 4.79 7.92 2.81 6.10
11 1.72 4.60 1.56 4.33
12 0.45 2.96 0.40 2.82
13 0.02 2.21 0.01 2.34
Table 2: 100 × Mean squared bias and 100 × Mean squared error for the simulation forthe spline logistic error model. In this table, the 13 cases were as follows: Case 1 meansm(x) = x, Case 2 means m(x) = 0.75x, Case 3 means m(x) = 0.50x Case 4 meansm(x) = [4/{1 + exp(−x)}]− 2, Case 5 means m(x) = 0.75([4/{1 + exp(−x)}]− 2), Case 6means m(x) = 0.50([4/{1+exp(−x)}]−2), Case 7 means m(x) = (x− .75)+− (−.75−x)+,Case 8 means m(x) = (x − 1.0)+ − (−1.0 − x)+, Case 9 means m(x) = x+, Case 10means m(x) = .75x+, Case 11 means m(x) = .50x+, Case 12 means m(x) = .25x+, Case13 means m(x) ≡ 0. Here n = 500, there were 200 simulated data sets, there were 8, 000steps of which the first 4, 000 was burn–in, the degree of the polynomial was d = 1, thenumber of knots was 10, the functions were evaluated on a grid from −2.0 to 2.0, σ2
x = 1,σ2
u = 0.32, σ2v = 1, α0 = 0, α1 = 1, µx = 0, and the attenuation was confined to the
interval λ = σ2x/(σ
2x + σ2
u) ∈ (0.60, 1.00). Biases and mean squared errors were computed forx ∈ [−2.0, 2.0].
Naive Naive MLE Variance100× 100× 100× Ratio: MLE
Case RASB RMSE RMSE to Naive
1 9.82 12.49 13.43 3.03
2 14.09 16.43 14.70 3.03
3 9.33 11.92 14.74 3.95
4 63.42 68.65 35.44 1.82
Table 3: Asymptotic calculations for polynomial approximations to 4 functions in the Gaus-sian case. Here ”RASB” means 100 times the square root of the average squared bias, while”RMSE” is 100 times the square root of the mean squared error. In this calculations, it wasassumed that the sample size was n = 100, and that the MLE had no small–sample bias. InCase 1, the target regression function is 1/{1 + exp(4x)}. In Case 2, the target regressionfunction is sin(πx/2)/(1 + [2x2{1 + sin(πx/2)}]). In Case 3, the target regression function issin(πx/2)/(1 + [2x2{1 + sign(x)}]). In Case 4, the regression function is x2.
−2 −1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4
−2 −1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4
−2 −1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4
−2 −1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4TrueNaiveSIMEXBayes
Figure 1: Results from the simulations corresponding to Table 1, Normal case, Case 3. The
top left, top right, and bottom left are 3 simulated data sets. The bottom right is the mean
over all simulated data sets.
34
−2 −1 0 1 2−1.5
−1
−0.5
0
0.5
1
1.5
−2 −1 0 1 2−1
−0.5
0
0.5
−2 −1 0 1 2−1
−0.5
0
0.5
−2 −1 0 1 2−1
−0.5
0
0.5TrueNaiveSIMEXBayes
Figure 2: Results from the simulations corresponding to Table 1, Normal case, Case 2. The
top left, top right, and bottom left are 3 simulated data sets. The bottom right is the mean
over all simulated data sets.
35
−4 −3 −2 −1 0 1 2 3 4−2
−1
0
1
2
3
4Nail Arsenic = Instrument, Basal Cell Cancers, degree = 2
Uniform Bayes CIBayes FitUniform Bayes CINaive
Figure 3: Logit of the probability of basal cell cancer as a function of X, the transformed
value of the arsenic concentration in the drinking water.
36
−2 −1 0 1 2−0.2
0
0.2
0.4
0.6
0.8
1
1.2Case 1
ActualTrue
−2 −1 0 1 2−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4Case 2
ActualTrue
−2 −1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4Case 3
ActualTrue
−2 −1 0 1 2−1
0
1
2
3
4
5Case 4
ActualTrue
Figure 4: The ”actual” functions (solid lines) from Cases 1–4 for the Gaussian simulation
and their ”true” polynomial approximations (dashed lines) used in computing theoretical
asymptotic distributions.