nonlinear and nonparametric regression and instrumental

Nonlinear and Nonparametric Regression and

Instrumental Variables

Raymond J. Carroll, David Ruppert, Ciprian M. Crainiceanu,

Tor D. Tosteson and Margaret R. Karagas

February 25, 2003

Abstract

We consider regression when the predictor is measured with error and an in-

strumental variable is available. The regression function can be modeled linearly,

nonlinearly, or nonparametrically. Our major new result shows that the regres-

sion function and all parameters in the measurement error model are identified

under relatively weak conditions, much weaker than previously known to imply

identifiability. In addition, we develop an apparently new characterization of the

instrumental variable estimator: it is in fact a classical ”correction for attenua-

tion” method based on a particular estimate of the variance of the measurement

error. This estimate of the measurement error variance allows us to construct

functional nonparametric regression estimates, by which we mean that no as-

sumptions are made about the distribution of the unobserved predictor. The

general identifiability results also allow us to construct structural methods of

estimation under parametric assumptions on the distribution of the unobserved

predictor. The functional method uses SIMEX and the structural method uses

Bayesian computing machinery. The Bayesian estimator is found to significantly

outperform the functional approach.

KEY WORDS: Bayesian methods; Errors in variables; Functional method; Generalizedlinear models; Identifiability; Instrumental variables; Measurement error; Nonparametricregression; P–splines; Regression Splines; SIMEX; Smoothing Splines; Structural modeling.

Short title. Regression with Instrumental Variables

Author Affiliations

Raymond J. Carroll (E-mail: [email protected]), Distinguished Professor, Department

of Statistics and Faculties of Nutrition and Toxicology, Texas A&M University, College

Station TX 77843–3143, USA.

David Ruppert (E-mail: [email protected]), Professor, School of Operations Research

& Industrial Engineering, Cornell University, Ithaca, NY 14853-3801, USA.

Ciprian M. Crainiceanu (E-mail: [email protected]), Graduate student, Department of

Statistical Science, Cornell University, Malott Hall, NY 14853, USA.

Tor D. Tosteson (E-mail: [email protected]), Associate Professor, Com-

munity and Family Medicine, Dartmouth Medical School, Hanover, NH 03755, USA.

Margaret R. Karagas (E-mail: [email protected]), Associate Professor,

Community and Family Medicine, Dartmouth Medical School, Hanover, NH 03755, USA.

Acknowledgments

Carroll’s research was supported by a grant from the National Cancer Institute (CA57030),

and by the Texas A&M Center for Environmental and Rural Health via a grant from the

National Institute of Environmental Health Sciences (P30–ES09106). Ruppert, Tosteson

and Karagas were supported by the National Cancer Institute (CA50597), and Tosteson

and Karagas were supported by the National Cancer institute (CA50597, CA57494) and the

National Institute of Environmental Health Sciences (ES07373). We thank Professor Tailen

Hsing for showing us the Durrent’s (1996) counterexample used in our Appendix.

1 Introduction

1.1 Background and Problem Statement

The measurement error literature, already large, has continue to expand into complex mod-

eling situations; see Reeves, et al. (1998), Domenici, et al. (2000), and Strauss, et al. (2001)

for recent environmental problems where measurement error plays a major role.

Motivated by a problem in environmental epidemiology (Section 5), we consider the

problem of measurement error in regression where the regression function could be modeled

linearly, nonlinearly, or even nonparametrically. In the case that the measurement error

variance is known or can be estimated by replication of the error–prone predictor, functional

(Carroll, et al., 1999) and structural/Bayesian (Berry, et al., 2002) methods have been

developed.

In our example, based on an epidemiologic study of skin cancer and arsenic exposure

(Karagas et. al, 2001), the error-prone predictor is not replicated so another approach is

needed. In this study, information on the measurement error is provided in the form of a

second measure of exposure, which we use as an instrumental variable. Estimation using an

instrumental variable is a surprisingly difficult problem because even in polynomial regres-

sion, identifiability of the regression function is a major issue, as described by Hausman, et

al. (1991). In our approach, we use a slight modification of their model for the instrument.

We provide simple and explicit assumptions under which identifiability is assured.

There is some work on parametric but not necessarily linear regression with an instrumen-

tal variable (Hausman, et al., 1991; Amemiya, 1990; Carroll and Stefanski, 1994; Stefanski

and Buzas, 1995; Buzas and Stefanski, 1996). These methods are either only applicable for

special parametric models, or for general parametric models they rely on small–error ap-

proximations that are known to fail for some nonlinear and nonparametric models (Carroll,

et al., 1995). To the best of our knowledge, there are no techniques presently available for

nonparametrically specified regression functions in the instrumental variable context.

Our identifiability result is related to a simple yet apparently new characterization of

the instrumental variable estimator. Specifically, we show that in simple linear regression

1

with a scalar instrument, the usual instrumental variables estimator is in fact a version of the

classical “correction for attenuation” method based on a specific estimate of the measurement

error variance. Because we can thus estimate the measurement error variance, this means

that we can apply methods from the classical error literature, particularly functional methods

that make no assumptions about the distribution of the unobserved predictor.

1.2 Consistent Estimation in Nonparametric Regression

Some readers of this paper have asked us to show the consistency of our estimators in non-

parametric regression. We now discuss why even attempting to create consistent estimators

is not a useful idea in practice and should not be pursued. More complete discussions are

given by Carroll, et al. (1999) and Berry, et al. (2002).

Consider a regression function m(·) that is to be estimated consistently but nonparamet-

rically. If the true covariate were observable, then there are a host of competing methods for

estimating m(·) consistently but nonparametrically, e.g., kernels, splines, orthogonal series,

local methods, etc.

However, if the true covariate is not observable, and is instead measured with additive,

normal error, then globally consistent estimation of m(·) is effectively impossible. This prob-

lem has been addressed previously, most notably by Fan and Truong (1993). Suppose that

we allow m(·) to have up to k derivatives. They showed that, if the measurement error is

normally distributed, even with known error variance, then, based on a sample of size n,

no consistent nonparametric estimator of m(·) converges faster than the rate {log(n)}−k.

Since, for example, log(10, 000, 000) ≈ 16, effectively this result suggests that globally con-

sistent nonparametric regression function estimation in the presence of measurement error

is impractical.

Given this fact about globally consistent estimation, it seems to us that the only practical

alternative is to construct estimators that are “approximately” consistent. By this we mean

that either (a) in large samples, as the error variance → 0, the estimator should have smaller

order bias than the naive estimator that ignores measurement error; or (b) the estimator

should be consistent in a smaller class of problems, in particular a flexible parametric class.

2

Carroll, et al. (1999) choose (a), while Berry, et al. (2002) choose (b). Our Bayesian esti-

mator assumes that the regression function is a spline with a moderate number of knots,

e.g., 20, and wiil be consistent not for the true regression function but rather for the best

spline approximation thereof. However, Ruppert (2002) has shown that the bias caused by

spline approximation is generally negligible compared to variability of the estimator or the

smoothing bias, even for sample sizes in the tens of thousands.

1.3 Outline

The outline of the paper is as follows. In Section 2, we define our model and the basic

characterization of identifiability.

The implications of the identifiability result is that we can construct methods with some

assurance that such methods will have a chance of reflecting some of the main features of

the data. In Section 3, we outline the methods used, some of which make no assumptions

about the distribution of the latent variable (functional case) while other methods assume

a specific form for this distribution (structural case). Section 4 presents a small simulation

study of nonparametric Gaussian and binary regression. In Section 5 we illustrate the meth-

ods on an example involving binary regression and environmental arsenic. In response to

referee concerns about the small sample properties of the estimators, in Section 6 describes

some asymptotic calculations in the polynomial regression case, which illustrate just how

really difficult the estimation problem of instrumental variables for not–linear models can

be. Section 7 contains concluding remarks.

2 Model and Identifiability

2.1 Introduction

In contrast to the small–error approach to instrumental variables estimation, Hausman, et al.

(1991) consider the most basic ”not–linear” model, polynomial regression. A polynomial is of

course linear in the parameters, but it is nonlinear in the independent variable and therefore

nonlinear in the measurement error. Let Y be the response, W the unbiased measure of X,

3

and S the instrument. They assume that the observed data are an iid sequence of vectors

(Y, W, S) that satisfy

Y = mpoly(X, β) + ε; (1)

W = X + U ; (2)

S = α0 + α1X + ν. (3)

In (1), the function mpoly(x, β) is a polynomial in X. In addition, ε, U , and ν have zero

means, and ν is independent of (X, ε, U). Model (2) is the classical error model. Hausman,

et al. show that the model (1)–(3) is identified essentially under the condition that α1 in (3)

is known. While this result is obviously impressive, in our experience α1 is rarely known in

instrumental variable applications.

2.2 More Details on the Result of Hausman, et al.

Previous readers of this paper have misinterpreted our claim that “essentially”, Hausman,

et al. require that α1 be known. We now clarify what we mean by this phrase.

Hausman, et al.’s main identifiability conditions are the following. In their equation

(2.4), they require that there be a known function a(·) such that a(α0, α1) = 0. In their

equation (2.9), they note that E(S) = α0 +α1E(W ) = α0 +α1E(X). They then require that

these two conditions identify (α0, α1). In an example after their (2.9), they illustrate that if

a(α0, α1) = κ0 +κ1α0 +κ2α1 with (κ0, κ1, κ2) known, then (α0, α1) would be identified under

some conditions, particularly if E(X) 6= 0.

It stretches the imagination to think of any practical context such that (α0, α1) have a

known relationship. In addition, suppose that we force E(X) = E(W ) = E(S) = 0 by the

common device of standardizing the mean of W and S to have mean zero. This changes only

the scale of the data, but not the model. Then (2.9) of Hausman, et al. is trivial, α0 = 0

and the condition that (α0, α1) have a known relationship reduces to α1 being known. This

is what we mean by ”essentially”.

We point out that Hausman, et al. consider the case of differential measurement error,

i.e., that ε and U are correlated. When α1 is known, or identified via a relationship with α0,

4

our methods are easily adapted to this case.

2.3 Main Identifiability Results

Our proposed methods are based on the following simple but apparently new observation

that, in fact, (1)–(3) is identified without prior knowledge of α1 even if the regression function

is not a polynomial. Rather, α1 can be determined from moments of the observable variable

in (1)–(3). This is an important result since it means that m(·) can be estimated without

any prior knowledge of parameters provided only that the instrument S as well as the proxy

W are observed.

Suppose that

(X,U, ε, ν) are mutually uncorrelated. (4)

Replace (1) by

Y = m(X) + ε. (5)

Then for any function m(x) (not just polynomials), α0, α1, µx = E(X), σ2x = var(X), σ2

u =

var(U), σ2ν = var(ν) are all identified if α1 6= 0 and if

cov(Y, W ) = cov{X,m(X)} 6= 0. (6)

Specifically, α1 = cov(Y, S)/cov(Y,W ); µx = E(W ); α0 = E(S − α1W ); σ2x = cov(W,S)/α1;

σ2u = var(W ) − σ2

x and σ2ν = var(S) − α2

1σ2x. Therefore, all parameters are functions of

the moments of observables and so are identified. It is interesting to note that if we inter-

changed the roles of Y and S, so that S is the “response” and Y is the “instrument”, then

identifiability of α1 under (6) follows from the usual instrumental variable calculations.

Of course, there are examples where (6) fails, e.g., m(X) ≡ constant or m(·) is an even

function and X symmetrically distributed about 0. However, we expect cov{X,m(X)} 6= 0

in the vast majority of applications. Moreover, if we add to (4) the assumption that

X is independent of U and ν, (7)

5

then (6) can be weakened to

cov[{X − E(X)}k,m(X)] exists and is non-zero for some positive integer k. (8)

Specifically, we have the following.

Theorem: Assume (2), (3), (4), (5), (7), (8), that α1 6= 0, and that σ2x > 0. Then

α0, α1, µx = E(X), σ2x = var(X), σ2

u = var(U), σ2ν = var(ν) are all identified, that is, they are

determined by moments of observable variables.

The proof of the theorem is given in Appendix A.1. In Appendix A.2 we show that under

weak assumptions, (8) will hold unless m(·) is constant. When m(·) is constant, m(·) is still

identified, but α1 appears to be not identified; see Section 3.4. Our theorem does not state

explicitly whether m is identified, because this follows from Fan and Truong (1993), who

exhibit conditions under which m(·) can be consistently estimated, and hence is identified,

if var(U) is identified. As we have argued in Section 1.2, (globally) consistent estimation of

m(·) is not feasible practically, and is thus not the main goal of our work.

Some comments on the assumptions and implications of the characterization are in order.

Assume now that (6) does hold.

1. In the linear case where m(x) = β0 + β1x, the usual IV slope estimate is the sam-

ple version of β1,iv = cov(Y, S)/cov(W,S). If σ2u were known, then the usual correc-

tion for attenuation estimate is the sample version of β1,ca = cov(Y,W )/(var(W ) −σ2

u). In our IV model the estimate of σ2u is the sample version of σ2

u,iv = var(W ) −cov(W,S)cov(Y,W )/cov(Y, S). Substituting σ2

u,iv into the formula for β1,ca yields βi,iv.

Thus, the usual IV estimator of the slope in linear regression can be looked upon as the

correction for attenuation estimator when the measurement error variance is estimated

via our proposal.

2. Conceptually, the connection between correction for attenuation and instrumental vari-

ables estimation offers the hope of more stable estimation. In particular, the attenua-

tion is

λ = σ2x/(σ

2x + σ2

u) = (σ2w − σ2

u)/σ2w. (9)

6

The correction for attenuation estimator is simply the least squares slope ignoring

measurement error divided by an estimate of the attenuation. Because of this division,

one can at least in principle improve the usual instrumental variables estimator by

bounding the attenuation away from zero.

3. The model (5) is more general than it looks, since the distribution of ε can depend on

X. For example, in our application Y is binary, so that we can write the model as

logit{pr(Y = 1|X)} = g(X), where g(X) = logit{m(X)}. Then ε is a Bernoulli variate

minus its mean.

4. Because (5) is an unstructured regression model, the assumption of additivity in (2)–

(3) is not as strong as it may seem. Instead, we are only assuming that there is a

common smooth transformation of the original data to X, W and S that satisfies these

equations. If (5) holds for the original data, then it will also hold for the transformed

data. For example, in our application to environmental epidemiology, we log transform

the data.

5. In practice, the methods are necessarily restricted to cases that Y and W are clearly

related, else α1 will be poorly estimated. Indeed, if Y and X are independent then the

parameters in the (W,S) model are unidentifiable if (W,S) are jointly normal.

3 Methods of Estimation

This section describes in broad detail three methods of estimation. Section 3.1 describes the

basic method we use to fit nonparametric regression, namely fixed–knot regression splines.

In Section 3.2, we propose a functional method of estimation that makes no assumptions

about the distributions of the random variables (X, U, ε, ν). We can do this because of our

new result that gives us an estimate of the measurement error variance. The method is

simple. We use the estimate of var(U) derived by moments calculations outlined above, and

then we apply the SIMEX method (Cook and Stefanski, 1994) to the (Y, W ) data, as in for

example Carroll, et al. (1999). In the nonparametric regression problem with error variance

known or estimated by direct replication of W , Berry, et al. (2002) showed that a Bayesian

7

approach using regression and smoothing splines could achieve significant gains in efficiency

when compared to the SIMEX method. Section 3.3 we show how to extend their Bayesian

method to the IV problem and also to problems such as binary regression.

3.1 Fixed Knot Regression Splines

A general approach to spline fitting is to use penalized splines or simply P–splines, a term

we borrow from Eilers and Marx (1996). In this section, we introduce the idea. The full

specification of the spline estimators proposed in our context comes later in the paper; see

for example (10).

Let C(x) = {B1(x), . . . , BN(x)}T, N ≤ n be a spline basis, i.e., a set of linearly indepen-

dent piecewise polynomial functions; a specific example will be given shortly. The P-spline

model specifies that m(·) is in the vector space of splines generated by this basis, i.e., that

for some N -dimensional β, m(x) = m(x, β) = C(x)Tβ.

Classes of P-splines that are especially convenient for modeling are the penalized B-

splines of Eilers and Marx (1996) and the closely related truncated power series basis of

Ruppert (2002). B-splines are more stable numerically than the truncated power basis, but

the roughness penalty we use adds numerically stability and makes use of the truncated

power basis computationally feasible. See Ruppert (2002) for a discussion of computation

with the truncated power basis. The latter are pth degree polynomial splines with k fixed

knots, t1, . . . , tk. We choose the knots at the quantiles of the W ’s. These functions have p−1

continuous derivatives and their pth derivatives are piecewise constant and take jumps at the

knots. A convenient basis for these splines is the set of monomials plus the truncated power

functions so that C(x) = (1, x, x2, ..., xp, (x− t1)p+, . . . , (x− tk)

p+)T, where ap

+ = {max(0, a)}p.

Then, N = 1 + p + k, β1, . . . , βp+1 are the monomial coefficients, and β2+p, . . . , βN are the

sizes of the jumps in the pth derivative of g(x) = C(x)Tβ at the knots.

The choice of the number of knots k is discussed by Ruppert (2002) who finds that for P–

splines the exact value of the number of knots k is not important, provided that k is at least

a certain minimum value. Generally, k = 20 more than suffices for the types of regression

functions found in practice and that can be recouped when there is measurement error. Of

8

course, there will be exceptions where more knots are required, e.g., a long periodic time

series. However, measurement error often occurs in situations where m is not too complex

and 10–20 knots or often even far less will suffice in such cases.

We add for completeness that there are a host of ways to fit spline functions. We have

found that for many functions, knot selection is not too important if the number of knots

is reasonably large. Berry, et al. (2002) found that P–splines and smoothing splines gen-

erally give very similar answers. Of course, at least in principle researchers interested in

knot selection can generalize our work to include either knot selection or smoothing splines.

Whether this is necessary or even practical in the context of measurement error remains an

open problem.

If measurement error is ignored, it is typical to fit the function m(x, β) by penalized

maximum likelihood; see for example Hastie and Tibshirani (1990). Consider the truncated

power series basis defined above. Let D∗ be the N × N diagonal matrix with p + 1 zeros

followed by k ones along the diagonal. Let γ be a smoothing parameter. The penalized

estimator β(γ) ignoring measurement error minimizes the loglikelihood in (Y, W ) minus

γβTD∗β. More formally, suppose that the loglikelihood in (Y, X) is L(Y,X, β). Then the

penalized regression spline ignoring measurement error is the solution to

maxβ

[{n∑

i=1

L(Yi, Wi,β)

}− γβTD∗β

]. (10)

One can use cross validation (CV) or generalized cross validation (GCV) to choose γ.

See, for example, Hastie and Tibshirani (1990, page 159) for definitions of CV and GCV.

Other penalties such as on the integral of the squared second derivative can be imposed by

other choices of D∗.

3.2 Functional Methods: SIMEX

As described in Section 2, as part of our work we have derived a new, simple nonparametric

estimator of the measurement error variance, σ2u. This estimate, however, is not guaranteed

to be positive, and it is entirely possible that it will be either negative or much too large.

We thus suggest the following simple modification. Set a user–specified lower bound on the

9

attenuation (9), say λL. Let λ be the estimate of λ obtained by replacing σ2w in (9) by the

sample variance σ2w of the W ’s, and by replacing σ2

u by its estimate. If λL ≤ λ, then use this

estimate of σ2u. If λ < λL, then form a new estimate of σ2

u by solving λL = (σ2w − σ2

u)/σ2w so

that σ2u = σ2

w(1− λL).

SIMEX needs a base estimator, i.e., an estimator one would use if there were no mea-

surement error. Carroll, et al. (1999) describe two such methods: (a) local linear kernel

regression of Y on W with bandwidths estimated either by GCV or by EBBS; (b) regression

splines of Y on W as described previously with smoothing parameter estimated by GCV.

We now define our SIMEX–IV estimator: apply the SIMEX method of Cook and Stefan-

ski (1994), using σ2u as the estimate of error variance. Briefly, for any fixed ζ > 0, suppose

one repeatedly ‘adds on’ to W , via simulation, additional error with mean zero and variance

σ2uζ, forming what are called pseudovalues. Then using these pseudovalues as predictors, one

computes one’s favorite nonparametric regression estimator, e.g., kernel or regression spline

as described above. With this estimator in hand, one generates pseudovalues repeatedly and

averages the estimators, calling the average g(ζ, x): generally, 50–200 times will suffice, and

one uses λ = 0.0, 0.5, 1.0, 1.5, 2.0. The idea is to plot g(ζ, x) against ζ ≥ 0, fit a model to

this plot, and then extrapolate back to ζ = −1. In our calculations, we used a quadratic

function to model the plot of g(ζ, x) against ζ.

3.2.1 Asympotic Theory for SIMEX

Asymptotic theory for the SIMEX method is easy if one uses kernel methods. Since σ2u

converges at the rate Op(n−1/2), and the kernel estimator converges at a slower rate, the

asymptotics are the same as if σ2u were known. This means that the SIMEX–kernel instru-

mental variables estimator has the same asymptotic distribution and expansion as described

in Carroll, et al. (1999). In the interest of space, we do not rewrite the details of this result.

There are no known limiting results for penalized regression splines with a fixed number

of knots and an estimated smoothing parameter. If the smoothing parameter γ in (10) is

held fixed, or is known and converges to zero at a specified rate, then the solution to (10) is

10

the solution to an estimating equation, i.e., an equation of the form

n∑

i=1

Ψn(Yi,Wi,β, γ). (11)

The limiting distribution of SIMEX for estimating equations such as (11) is already known;

see Stefanski and Cook (1995) and Carroll, et al. (1996).

3.3 Bayesian Methods

Our Bayesian methods are similar to those in Berry, et al. (2002). Partition C(x) =

{CT1 (x),CT

2 (x)}T, where CT1 (x) = (1, x, x2, ..., xp). Partition β = (βT

1 ,βT2 )T similarly. As

is common with regression splines, we will assume that β2 = Normal(0, σ2I), where I is the

identity matrix. Other formulations are possible and are described in the appendix. The

parameters then become α0, α1, µx, σ2x, σ2

u, σ2ν , β and σ2.

The formulae to implement the Gibbs sample are detailed. In Sections A.3 and A.4 we

exhibit these formulae for the Gaussian and probit models. Section 4.1.3 and A.5 describe

implementation in BUGS and our experience with it.

A reader has asked that we comment on the asymptotic properties of the Bayesian meth-

ods. We know of no general results for Bayesian P–splines even without measurement error,

but can appeal to standard theory connecting Bayesian and frequentist methods under the

assumptions that the model that drives the Bayesian calculations actually holds.

3.4 The Case that cov[m(X), {X − E(X)}k] = 0 for all k

In general, it would appear that when cov[m(X), {X − E(X)}k] = 0 for all k, the function

m(·) may not be identifiable. However, in the most important subcase, namely that m(·) is

constant so that m(X) ≡ c, m(·) is generally identifiable, at least when a lower bound on

the attenuation is specified.

Detailed proof of the assertion above is highly technical, but the main idea can be seen for

the SIMEX estimator. We assume that the extrapolant function is parametric and includes

the constant function as a special case. To this end, recall that if the regression function is

11

constant, then E(Y |X) ≡ E(Y |W ) ≡ c. Thus, the naive estimator that ignores measurement

error consistently estimates m(·).Now consider what happens in the SIMEX algorithm. If the attenuation is bounded

below by λL, then for sufficiently large samples we must have that (1/2)λL ≤ (σ2w − σ2

u)/σ2w.

This means that for sufficiently large samples we can find an interval [a, b] for σ2u, and the

attenuation on this interval of values always exceeds zero. Fix any σ2u∗ in this interval.

Consider the construction of pseudovalues W (pseudo, ζ, σ2u∗) = W + ζ1/2σu∗Z, where Z is

a computer–generated standard normal random variable. Of course, the pseudovalues also

satisfy E{Y |W (pseudo, ζ, σ2u∗)} ≡ c. We have thus shown that the naive estimator applied

to the pseudovalues consistently estimates m(·) ≡ c. Since the extrapolant function includes

the constant function as a special case, we have shown that for any σ2u∗ in [a, b], applying

SIMEX leads to a consistent estimate of m(·).What is now required to complete the proof is to show that this argument holds uniformly

in σ2u∗ ∈ [a, b], and hence that SIMEX is consistent as long as one bounds the attenuation

away from zero. Providing precise technical conditions to make this argument rigorous is

likely to be tedious and quite possibly uninteresting.

4 A Small Simulation Study

In this section we describe simulation results for Gaussian nonparametric regression and

binary nonparametric regression. In our simulation, we took n = 100 for the Gaussian case

and n = 500 for the logistic case. These are small sample sizes given the difficulty of the

instrumental variables problem for nonparametric regression; see Section 6.

We took σ2x = 1, σ2

u = .33, σ2ν = 1, α0 = 0 and α1 = 1. For the Gaussian case, the error

variance in (5) was σ2ε = 0.09. In this simulation, the attenuation was λ = 0.75. In our

calculations, the attenuation λ was constrained to lie in [0.60, 1.00]. Mean squared biases

and mean squared errors were calculated for x ∈ [−2.0, 2.0].

While we assumed that the X’s were normally distributed, to test robustness for the

Gaussian case we consider three distributions for the X’s: normal, uniform on [−2, 2] and

Skew Normal with index α = 5. The skewed normal distribution has density proportional to

12

f(x|α) = 2φ(x)Φ(αx), where φ and Φ represent the standard normal density and distribution

(Azzalini 1985). This density is reasonably skewed for any value of α ≥ 5.

4.1 Gaussian Nonparametric Regression

For the Gaussian case, we considered three regression models. In Case 1, the regression

function is 1/{1 + exp(4x)}. In Case 2, the regression function is sin(πx/2)/(1 + [2x2{1 +

sin(πx/2)}]). In Case 3, the regression function is sin(πx/2)/(1 + [2x2{1 + sign(x)}]).

4.1.1 Bias–Variance Tradeoffs in Regression Spline Estimation

Carroll, et al. (Section 4.4) describe theoretical calculations in the classical measurement

error problem that show that if one uses the truncated power series basis for regression

splines and maximum likelihood estimation, then the variance of the fits “blows up” as the

smoothing parameter → 0.

What does this mean, and why is it important? The essential point of this theoretical

calculation is than in a sample of size 100 for Gaussian cases, our methods must necessarily

penalize the spline in order to make it reasonably stable. There is a cost for such smoothing,

however, and that is bias. Specifically, for such sample sizes in the Gaussian case, it is

hopeless to believe that we will be able to reproduce difficult functions with deep valleys

such as Cases 2 and 3.

4.1.2 Results

The results for the Gaussian case are given in Table 1, for a 25–knot quadratic regression

spline: similar results were obtained for the linear spline. In this table, mean squared bias

and mean squared error are averages over 101 grid points on the interval [-2,2] and over all

Monte Carlo samples.

We see that the Bayes estimator clearly dominates the SIMEX estimators and the naive

estimator that ignores measurement error, both in terms of bias and in terms of mean squared

error. The SIMEX estimator with a quadratic extrapolant is far less biased than the naive

estimator, but it has large variance.

13

Figure 1 corresponds to Table 1, normal distribution, Case 3. The top left, top right, and

bottom left are 3 simulated data sets. The bottom right is the mean over all simulated data

sets. The lines are: solid = true dashed = naive dash-dot = simex, and dotted = Bayes.

This is a problem for which the naive estimator is only somewhat worse than the Bayes

estimator (from Table 1, naive squared bias = 2.99, Bayes squared bias = 1.29, naive mse

= 3.72, Bayes mse = 2.97). Careful inspection of the plot shows that the naive estimate

often misses or just barely finds the inflection points. The SIMEX estimator has excess

variability as shown in Table 1. Basically, this means that when the naive estimator is not

too bad relative to Bayes, the differences between the SIMEX and Bayes estimates are real

but subtle.

Figure 2 corresponds to Table 1, normal distribution, Case 2. In this case, the Bayes

estimator is a large improvement over the SIMEX estimator. This can be seen in the top

left panel, which is a data set where the naive estimate is poor, and the SIMEX is then

even worse. The Table 1 shows the same thing: real dominance by Bayes. Notice that in

the bottom right the mean of SIMEX is close to that of the Bayes estimator, so that these

two estimators have similar bias. This implies that the substantial MSE improvement of

the Bayes estimator over SIMEX seen in Table 1 is due to the lower variability of the Bayes

estimator.

4.1.3 Implementation and Comparison with WinBUGS

We have implemented the methods in MATLAB, programs that are available at the web site

not given to preserve anonymity.

In addition, at the web sitenot given to preserve anonymity

, we have constructed 20

simulated data sets for each of the cases in the simulation, along with Case 4, m(x) = x2.

This case is interesting because (6) is violated, and one would expect difficulties or at least

small sample instabilities in the fits. We have provided the Naive and Bayes estimates of

the regression functions. Readers may wish to try their favorite approaches to ours on these

data sets.

It is also possible to implement the Bayesian method using software designed for MCMC

14

simulations, such as WinBUGS (Bayesian Analysis Using Gibbs Sampling for Windows).

In Appendix A.5 we provide our implementation of the Gaussian model. WinBUGS is

very intuitive and flexible, allowing quick changes in the model. For example, changing

the model from a Normal to a Bernoulli/Logit model can be done by simply replacing the

line Y[i] ∼ dnorm(m[i],taueps) with the lines Y[i] ∼ dbern(p[i]) and logit(p[i]) < - m[i].

Similar changes can be made for Binomial, Poisson and other distributions of interest. An

important advantage of WinBUGS is that it does not require one to write code for the

Metropolis–Hastings step of simulations.

The MATLAB and the WinBUGS implementations use the same set of priors for pa-

rameters but different proposal distributions. The Matlab program takes advantage of the

specific features of the model for which all but two complete conditionals are explicit. Care-

fully tailored Metropolis–Hastings steps are used for these two complete conditionals.

These features of the Matlab program have important effects on both simulation speed

(number of simulations per second) and, more importantly, on the MCMC mixing (the

property of the chain to move rapidly through the support of the target distribution). For

example, for a data set sample size 100 for the normal Case 2, 1,000 MCMC simulations

were obtained in 14 seconds with Matlab and in 48 seconds with WinBUGS (2.66 GHZ CPU,

1GB RAM). For the Matlab program 30,000 simulations including 10,000 burn-in proved to

be enough to achieve convergence. Due to differences in simulated chains mixing quality,

the WinBUGS program required 1,000,000 simulations including 500,000 burn-in to achieve

the same results. In the end, WinBUGS needed approximately 13 hours to achieve the same

results obtained by the Matlab program in 7 minutes.

One should note that coding in WinBUGS requires only a low level of expertise and

coding times are far superior to expert programs (hours versus weeks or even months). In

our experience WinBUGS proves to be a valuable tool in the initial phase of research, when

many models are considered and compared. Moreover, WinBUGS programs can be used to

validate expert programs in the process of program refining and debugging.

15

4.2 Logistic Nonparametric Regression

We generated data according to the logistic model pr(Y = 1|X) = [1 + exp {−m(x)}]−1,

although the data were fit via probit regression, and the logits computed from the probit fit.

In this table, 13 cases were considered. Effectively, these were four basic monotone functions:

m(x) = κx, κ = 1.0, 0.75 and 0.50 = Cases 1, 2, 3;

m(x) = κ([4/{1 + exp(−x)}]− 2), κ = 1.0, 0.75 and 0.50 = Cases 4, 5, 6;

m(x) = (x− κ)+ − (−κ− x)+, κ = 1.0 and 0.75 = Cases 7, 8;

m(x) = κx+, κ = 1.0, 0.75, 0.50, 0.25 and 0.00 = Cases 9, 10, 11, 12, 13.

The constant κ basically makes the function increasingly or decreasingly constant. The

last case, m(x) ≡ 0, is a null case, to which the discussion in Section 3.4 applies. For this

case, the naive estimate has no bias, and would be expected to have smaller mean squared

error, since the effect of trying to correct an already consistent estimator for non–existent

bias caused by measurement error is to increase variance.

The results are given in Table 2 for a 10–knot linear regression spline. Basically, we

see that when the functions are monotone and far from constant, the Bayes estimator has

smaller bias and mean squared error than the naive method, sometimes much smaller. Of

course, as the functions become increasingly close to constant, the naive method becomes

increasingly competitive.

5 The Arsenic Example

Arsenic exposure has been clearly linked with skin, bladder, and lung cancer occurrence

in populations highly exposed either occupationally, medicinally, or through contaminated

drinking water (National Research Council, 1999; IARC, 1987). An ongoing population

based study in New Hampshire (Karagas et al., 1998, Karagas et al., 2001) is examining the

effects of arsenic on the incidence skin and bladder cancer in response to low to moderate

exposures, primarily due to natural sources of arsenic contamination in well water. Because

of intense regulatory interest in the effects of abatement strategies, the shape of the exposure

16

response relationship at lower exposures is important and strategies for nonlinear modeling

are being explored actively (Karagas and Tosteson, 2002).

Exposure assessment is accomplished through the measurement of arsenic concentrations

in both tap water from home water supplies and toenail samples for individuals newly di-

agnosed with skin or bladder cancer (cases) and individuals belonging to an age and gender

matched sample of other state residents (controls). For our example, we consider data for

215 controls and 233 basal cell skin cancer cases having both water and toenail samples.

Because we are interested in characterizing changes in cancer incidence due to changes in

arsenic water contamination, we specify the water measurement as the unbiased exposure,

taking X to be log(0.005 + level of arsenic in tap water sample) and W to be the measured

value of this quantity. The toenail arsenic measurements are interpreted as the instrumental

variable, so that S is specified as log(0.005 + level of arsenic in the toenail sample). Log

transformations were chosen to make W and S both reasonably close to normally distributed,

although some skewness remains.

Preliminary analysis ignoring measurement error showed a positive but not statistically

significant linear trend between arsenic in tap water and basal cell cancer incidence. For

the purposes of this analysis, the results were not adjusted for possible confounding factors

such as age and gender. The results for the regression spline analysis are given in Figure 3.

The naive fit ignoring measurement error shows a modest increase in the logit of basal cell

cancer incidence over the range of observed tap water arsenic levels, with some indication of

nonlinearity. The Bayes fit adjusting for measurement error shows a somewhat more uniform

increase, with the impression of less nonlinearity. The confidence bands indicate that the

overall increase is not statistically significant.

The posterior means were α0 = −2.0, α1 = 0.20 σ2x = 2.61 σ2

u = 0.54 σ2ν = 0.28,

µx = −1.11 and λ = 0.83, the latter indicating that the amount of attenuation is not as

great as might be supposed.

17

6 Some Asymptotic Calculations in Polynomial Re-

gression

The tradeoff between bias and variance is familiar to all who work in nonparametric regres-

sion. Less well known is the bias–variance tradeoff in measurement error modeling, but the

effect is even more profound. Ignoring measurement error leads to bias, often in the form of

attenuation, namely the estimates tend to be shrunken towards zero. To correct this bias,

one typically must unshrink the estimator, an operation that causes an increase in variability.

Thus, in almost any practical context, the naive estimator is biased but much less variable

than any estimator that attempts to remove this bias.

To get some idea of the asymptotic behavior of the estimates, we performed some exact

calculations. For each of the 3 cases in Table 1, and with Case 4 being m(x) = x2, we fit the

polynomial that best captures the function on the interval µx±3(σ2x +σ2

u)1/2, i.e., we set up a

grid on this interval, and fit polynomials to the function y = m(x) on the grid. The degrees

of the polynomials chosen were 7, 7, 7 and 5 for Cases 1–4, respectively. The polynomial

functions are given in Figure 4 on the interval µx ± 2σx. While not perfect representations

of Cases 1–4, then are sufficiently close to yield some insight.

With X and U normally distributed, define σ2x|w = var(X|W ) = λσ2

u. Recall that if Z

has a standard normal distribution, then E(Z2r) = (2r)!/(2rr!). Then, if the true polyno-

mial is m(x, β) =∑d

k=1 βkxk, the observed data have the regression function E(Y |W ) =

∑dj=0 βj,naiveW

j, where βj,naive =∑d

k=j βkk!λjσk−jx|w E(Zk−j)/{j!(k − j)!}. In general, if the

true regression function is m(x, β), then the naive estimator of β converges to βnaive, the

minimizer of E[{m(X, β)−m(W,βnaive)}2

]. Once the expectation is computed as a func-

tion of βnaive, it can be minimized by any standard minimizer. The expectation itself is given

as

(σ2xσ

2u)−1/2

∫{m(x, β)−m(x + u, βnaive)}2 φ {(x− µx)/σx}φ (u/σu) dxdu

= (πσ2u)−1/2

∫ {m(µx +

√2σxz, β)−m(µx +

√2σxz + u, βnaive)

}2exp(−z2)φ (u/σu) dzdu

π−1/2E{∫ {

m(µx +√

2σxz, β)−m(µx +√

2σxz + U, βnaive)}2

exp(−z2)dz}

,

where the expectation is over the distribution of U . The integral can be computed via

18

Gaussian quadrature, and the expectation via simulation.

Let m(1)(·) be the derivative of m(·) with respect to β: note that because m(·) is linear

in β, this derivative does not involve β. In a sample of size n, the naive estimator is the

solution to the equation∑n

i=1 m(1)(Wi) {Yi −m(Wi, β∗)}. Asymptotically, its variance is

n−1B−1naiveAnaiveB

−1naive, where

Bnaive = E[m(1)(W )

{m(1)(W )

}T];

Anaive = E[{Y −m(W,βnaive)}2 m(1)(W )

{m(1)(W )

}T]

= E([

σ2ε + {m(X, β)−m(W,βnaive)}2

]m(1)(W )

{m(1)(W )

}T)

.

Both Anaive and Bnaive can be computed either directly by simulation or by a combination

of simulation and Gaussian quadrature as described above.

Under a parametric model the Bayes estimator will be asymptotically equivalent to the

maximum likelihood estimator, and hence it will be asymptotically consistent and its vari-

ance is n−1I−1, where I is the information matrix for β and the measurement error model

parameters µx, α0, α1, σ2x, σ2

u, σ2ε , σ2

ν . Again, a combination of Gaussian quadrature and

simulation is used. By simple calculations, the likelihood is

(σ2ε σ

2uσ

2νπ)−1/2

∫φ

{Y −m(µx +

√2σxz, β)

σε

}φ

(W − µx +

√2σxz

σu

)

×φ

{S − α0 − α1(µx +

√2σxz)

σν

}exp(−z2)dz.

We computed the score L(Y, W, S, ·), the derivative of the loglikelihood, via numerical dif-

ferentiation. The information, I = E{L(Y,W, S, ·)LT(Y, W, S, ·)

}can be computed by

simulation.

On a grid of values xi for i = 1, ..., ngrid, the mean squared bias of the naive estimator

is n−1grid

∑ngrid

i=1 {m(xi, β)−mnaive(xi, βnaive)}2. Since m(·) and mnaive(·) are linear in β, the

average variance based on a sample of size n is n−1trace(B−1

naiveAnaiveB−1naiveC

), where C =

n−1grid

∑ngrid

i=1 m(1)(xi){m(1)(xi)

}T. Ignoring any small–sample bias in the maximum likelihood

estimator, its asymptotic variance is n−1trace (I−1C).

For a sample of size n = 100, ignoring any possible small–sample bias in the maximum

likelihood estimator, the results are given in Table 3. Basically, the message from this table

19

is the same that we have made previously: for such small sample sizes, the excess variance

of the (asymptotically) best parametric method for correcting bias due to measurement

error makes the naive approach at least comparable in terms of mean squared error. This

(asymptotic) fact of life shows up somewhat in our simulations, although we have noted

there that the Bayesian methods actually perform quite a bit better than our asymptotics

would suggest.

Careful readers will note that the numbers in Tables 1 and 3 are not identical. This is

because the latter uses asymptotics, different functions, and different estimation methods.

The qualitative message is, however, the same.

7 Summary and Further Discussion

Our main theoretical contribution is to show that all parameters including the regression

function are identified in the instrumental variables model without prior knowledge of the

slope of the regression of the instrument on the true X. This result extends the applicability

of IV estimation to many interesting examples including our case study of the risk of skin

cancer due to arsenic exposure.

Our second main result is the characterization of instrumental variables estimation as a

correction for attenuation, so that the measurement error variance can be estimated from

moments of the observed instruments. This allows us to use some of the methods from the

classical measurement error literature.

We have developed two IV estimates, a functional estimator that applies SIMEX and

a Bayesian structural estimator that uses MCMC. Simulation shows that the Bayesian

structural estimator outperforms the functional estimator. Moreover, the structural estima-

tor appears robust to misspecification of the distribution of the true covariate X, although

there are surely situations for which an X-distribution is so far from normal that the Bayes

estimator will be badly affected.

In our example, the designation of tap water as the unbiased exposure measure reflects

a certain interpretation of the fitted regression curve, that the curve is the probability of

skin cancer given a level of exposure in drinking water. However, in practice, total arsenic

20

exposure includes not only the amount consumed but exposure from other sources such as

food.

Another formulation would focus on the dose response for a biologically active arsenic

exposure, for which toenail concentrations could be taken as an unbiased measure. Concep-

tually, this would introduce an additional latent variable to represent the biologically active

exposure, D, which would depend on true tap water concentrations in a linear fashion. Re-

taining the designation of W for transformed value of measured tap water arsenic and S for

toenail, we could rewrite our model as

Y = mpoly(D, β) + ε

W = X + U

D = α0 + α1X + ν

S = D + ξ.

This model will be the focus of future research.

References

Albert, J. H. and Chib, S. (1993), “Bayesian analysis of binary and polychotomous response

data”, Journal of the American Statistical Association, 88, 669–679.

Amemiya, Y. (1990), “Two-stage instrumental variable estimators for the nonlinear errors-

in-variables model”, Journal of Econometrics, 44, 311–332.

Azzalini, A. (1985), “A class of distributions which includes the normal ones”, Scandinavian

Journal of Statistics 12, 171-178.

Berry, S. A., Carroll, R. J. and Ruppert, D. (2002), “Bayesian smoothing and regression

splines for measurement error problems”, Journal of the American Statistical Associa-

tion, 97, 160–169.

Box, G. E. P. and Tiao, G. (1973), Bayesian Inference in Statistical Analysis, Addison–

Wesley, London.

Buzas, J. S. and Stefanski, L. A. (1996), “Instrumental variable estimation in generalized

linear measurement error models”, Journal of the American Statistical Association, 91,

999–1006.

Carroll, R. J., Kuchenhoff, H., Lombard, F., and Stefanski, L. A. (1996), “Asymptotics for

the SIMEX estimator in structural measurement error models,” Journal of the American

21

Statistical Association, 91, 242–250.

Carroll, R. J., Maca, J. D. and Ruppert, D. (1999), “Nonparametric regression with errors

in covariates”, Biometrika, 86, 541–554.

Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995), Measurement Error in Nonlinear

Models, Chapman and Hall, New York.

Carroll, R. J. and Stefanski, L. A. (1994), “Meta–analysis, measurement error and corrections

for attenuation”, Statistics in Medicine, 13, 1265–1282.

Cook, J. R. and Stefanski, L. A. (1994), “Simulation–extrapolation estimation in parametric

measurement error models”, Journal of the American Statistical Association, 89, 1314–

1328.

Dominici F., Zeger S. L. and Samet J. M. (2000), “Combining evidence on air pollution and

daily mortality from the largest 20 US cities: a hierarchical modeling strategy”, Journal

of the Royal Statistical Society, Series A, 163, 263–302.

Durrett, R. (1996), Probability:Theory and Examples, 2nd ed., Duxbury, Belmont, CA.

Eilers, P. H. C. and Marx, B. D. (1996), “Flexible smoothing with B–splines and penalties”

(with discussion), Statistical Science, 11, 89–102.

Fan, J. and Truong, Y. K. (1993), “Nonparametric regression with errors in variables”,

Annals of Statistics, 21, 1900–25.

Green, P. J. and Silverman, B. W. (1994), Nonparametric Regression and Generalized Linear

Models: A Roughness Penalty Approach, Chapman and Hall, London.

Hastie, T. and Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall: New

York.

Hausman, J. A., Newey, W. K., Ichimura, H. and Powell, J. L. (1991), “Identification and

estimation of polynomial errors–in–variables models”, Journal of Econometrics, 50, 273–

295.

IARC, (1987), “Arsenic and Arsenic compounds (Group 1),” In: Monographs on the Evalu-

ation of Carcinogenic Risk of Chemicals to Humans, Supplement 7, pp. 100-106, Inter-

national Agency for Research on Cancer.

Karagas, M. R., Stukel, T. A., Morris, J. S., Tosteson, T. D., Weiss, J. A., Spencer, S. K. and

Greenberg, E. R., (2001), “Skin cancer risk in relation to toenail arsenic concentrations

in a US population-based case-control study”, American Journal of Epidemiology, 153,

559-65.

Karagas M.R. and Tosteson, T. D. (2002), “Assessment of cancer risk and environmental lev-

22

els of arsenic in New Hampshire”, International Journal of Hygiene and Environmental

Health (in press).

Karagas, M. R., Tosteson, T. D., Blum, J., Morris, J. S., Baron, J. A. and Klaue, B.,

(1998), “Design of an epidemiologic study of drinking water arsenic exposure and skin

and bladder cancer risk in a US population”, Environmental Health Perspectives, 106,

1047-1050.

National Research Council, (1999), Arsenic in Drinking Water, National Academy Press,

Washington, DC.

Reeves, G. K., Cox, D. R., Darby, S. C. and Whitley, E. (1998), “Some aspects of mea-

surement error in explanatory variables for continuous and binary regression models”,

Statistics in Medicine, 17, 2157–2177.

Robert, C. P. (1995). “Simulation of truncated normal variables”, Statistics and Computing,

5, 121–125.

Ruppert, D. (2002), “Selecting the number of knots for penalized splines”, Journal of Com-

putational and Graphical Statistics, 11, 735–757.

Stefanski, L. A. and Buzas, J. S. (1995), “Instrumental variable estimation in binary regres-

sion measurement error variables”, Journal of the American Statistical Association, 90,

541–550.

Stefanski, L. A. and Cook, J. R. (1995), “Simulation–extrapolation: the measurement error

jackknife,” Journal of the American Statistical Association, 90, 1247–1256.

Strauss, W. J., Carroll, R. J., Bortnick, S. M., Menkedick, J. R. and Schulz, B. D. (2001).

“Combining datasets to predict the effects of regulation of environmental lead exposure

in housing stock”, Biometrics, 57, 203–210.

Appendix

A.1 Proof of Theorem 1

Let M be the smallest positive integer k such that cov[m(X), {X − E(X)}k] is not zero.

Then

cov[Y, {S − E(S)}M ] = cov[m(X), {α1(X − E(X)) + ν}M ]

=M∑

j=0

M

j

αj

1 cov[m(X), {X − E(X)}jνM−j]

23

= αM1 cov[m(x), {X − E(X)}M ]. (A.1)

since by (7) cov[m(X), {X −E(X)}jνM−j] = cov[m(X), {X −E(X)}j]E(νM−j) and by the

definition of M we have cov[m(X), {X − E(X)}j = 0 for 1 ≤ j < M .

By an identical calculation,

cov[Y, {W − E(W )}M ] = cov[m(x), {X − E(X)}M ]. (A.2)

Then, by (A.1) and (A.2)

αM1 =

cov[Y, {S − E(S)}M ]

cov[Y, {W − E(W )}M ]. (A.3)

If M is odd, then (A.3) determines α1 from moments of observables. If M is even, then α1 is

only determined up to its sign by (A.3), but then its sign can be determined by the relation

cov(W,S) = α1σ2x and the assumption that σ2

x > 0.

A.2 On Condition (8)

We now prove the following result showing that if X is compactly supported and m is

continuous, then (8) holds unless m(·) is constant. The condition that X is compactly

supported cannot be removed. A counterexample can be constructed using Counterexample

1 on page 107 of Durrett (1996). In that counterexample, it is shown that there are densities

distinct from the lognormal density but with the same moments as the lognormal. If fX is

the density of X and if m · fX is the difference between two distinct densities with the same

moments, then clearly E[m(X){X − E(X)}k] = 0 for all k.

Theorem 2: Suppose that the support of X is contained in a compact interval [a, b] and that

m(·) is continuous on [a, b]. If

cov[m(X), {X − E(X)}k] = 0 for all k, (A.4)

then var{m(X)} = 0 so that P [m(X) = E{m(X)}] = 1.

Proof: By the Weierstrass approximation theorem, for all δ > 0 there exists a polynomial

mpoly(·) such that |m(x)− E{m(X)} −mpoly(x)| < δ for all x ∈ [a, b]. By (A.4), m(X) and

mpoly(X) have zero covariance so that

var{m(X)} = E [m(X)− E{m(X)} −mpoly(X)]2 − E{mpoly(X)}2

≤ δ2(b− a)− E{mpoly(X)}2 ≤ δ2(b− a). (A.5)

Since δ > 0 is arbitrary, the result follows.

24

A.3 MCMC Calculations in the Gaussian Case

In the Gaussian case, m(x) = C1(x)β1 + C2(x)β2, where β2 ∼ Normal(0, σ2D), and D is a

k × k matrix. For the regression spline, D was chosen as the identity matrix. Priors for α0,

α1, µx and β1 were independent normals with mean zero and (large) variances σ2α, σ2

α, σ2µ

and σ2βI, respectively. The prior for the attenuation λ was uniform on the interval [λL, λH ].

Of course, by simple algebra, σ2x = λσ2

u/(1 − λ). Priors for σ2ε , σ2

u, σ2ν and σ2 were inverse

Gamma with parameters (aε, bε), (au, bu), (aν , bν), (aσ, bσ), respectively, where the IG(A,B)

density is given by {Γ(A)BAxA+1}−1 exp{−1/(Bx)}. Let C(x) = {CT1 (X), CT

2 (X)}T, H =∑n

i=1 C(Xi)Yi/σ2ε ,

Q =

{n∑

i=1

C(Xi)CT(Xi)/σ

2ε + diag(I/σ2

β, Ik/σ2)

}−1

,

D ={∑n

i=1(1, Xi)T(1, Xi)

σ2ν

+I2

σ2α

}−1

, and A =

∑ni=1(1, Xi)

TSi

σ2ν

. The joint density of the data

and the parameters, i.e., the unnormalized posterior density, is proportional to

exp[−

∑ni=1{Yi −C1(Xi)β1 −C2(Xi)β2}2

2σ2ε

−∑n

i=1(Wi −Xi)2

2σ2u

−∑n

i=1(Si − α0 − α1Xi)2

2σ2ν

−∑n

i=1(1− λ)(Xi − µx)2

2λσ2u

− µ2x

2σ2µ

− α20

2σ2α

− α21

2σ2α

− βT1 β1

2σ2β

− βT2 D−1β2

2σ2− bε

σ2ε

− bu

σ2u

− bν

σ2ν

− b

σ2

]

×(σ2ε )−(aε−n/2)(σ2

u)−(au−n)(σ2

ν)−(aν−n/2)(σ2)−(a−k/2){(1− λ)/λ)}n/2.

The complete conditionals are as follows:

µx = Normal

{X

n(1− λ)σ2µ

n(1− λ)σ2µ + λσ2

u

,λσ2

uσ2µ

λσ2u + n(1− λ)σ2

µ

};

σ2u = IG

(au + n, [1/bu + (1/2){(1− λ)/λ}

n∑

i=1

(Xi − µx)2 + (1/2)

n∑

i=1

(Wi −Xi)2]−1

);

σ2ε = IG

(aε + n/2, [1/bε + (1/2)

n∑

i=1

{Yi −C1(Xi)β1 −C2(Xi)β2}2]−1);

σ2ν = IG

[aν + n/2, {1/bν + (1/2)

n∑

i=1

(Si − α0 − α1Xi)2}−1

];

σ2 = IG[aσ + k/2, {1/b + (1/2)βT

2 D−1β2}−1];

(α0, α1) = Normal(DA,D);

(βT1 , βT

2 )T = Normal(QH ,Q);

Xi ∝ exp[−{Yi −C1(Xi)β1 −C2(Xi)β2}2

2σ2ε

− (Wi −Xi)2

2σ2u

− (Si − α0 − α1Xi)2

2σ2ν

−(1− λ)(Xi − µx)2

2λσ2u

].

25

In addition,

λ ∝ I(λL ≤ λ ≤ 1){(1− λ)/λ)}n/2 exp{−∑n

i=1(1− λ)(Xi − µx)2

2λσ2u

}. (A.6)

All the complete conditionals except for λ and the X’s are easily generated. For λ,

in our simulations, we discretized the set λ ∈ [λL, λH ] into 41 different values, computed

(A.6) for these values, turned the result into probabilities, and sampled λ according to

these probabilities. This gridded Gibbs estimator is not strictly correct, of course, but it

is convenient and provides good mixing. We also implemented a full Metropolis–Hastings

step: mixing was not quite as good, thus requiring somewhat more MCMC samples, but

in selective test cases we found that the final fits to the regression function were virtually

identical to our gridded method.

For the X’s, the complete conditional is not explicit. We used Metropolis–Hasting steps

where the candidate density was normal with the current value of X as the mean and the

variance being 1/2 times the conditional variance for X given (W,S), the latter variance

evaluated at the current parameter values.

In our simulations, the prior distributions were as follows: σ2x = IG(1, 1), σ2

ν = IG(1, 1),

σ2ε = IG(1, 1), λ = U [0.60, 1.00], µx = Normal(0, 100), α0 = Normal(0, 100), α1 = Normal(0, 100),

σ2 = IG(1, 1000). We also used σ2 = IG(0.01, 100) without appreciable differences in some

test cases.

The model can be extended to incorporate possible prior information on the parameters

µx, α1, α2,β1, β2. Since we have no such prior information, we did not implement the fol-

lowing calculations. An additional enhancement of the model is to consider the covariance

matrix D of β2 unknown and allow an inverse Wishart prior for its distribution. This is

equivalent to assuming a multivariate t-distribution on the coefficients of the spline basis

function (e.g. Box and Tiao, 1973, Theorem 8.5.1) instead of the normal distribution. For

these parameters, consider the following new set of priors

µx = Normal(µ0, σ2µ0

);

(α0, α1) = Normal(a,Σa);

(βT1 ,βT

2 )T = Normal{(βT

1,0,0T )T , diag(Σβ1 ,Σβ2)

};

Σβ2 = Inverse Wishart(R0, q0).

Here 0 is a vector of zeros representing the mean of the vector β2, and a Wishart distribution

with parameters (R0, q0) has pdf proportional to

(detΣ)q0/2−1 exp{−1

2trace(ΣR0)

}.

26

With these new priors the posterior distributions for µx, α0, α1 β1, β2 become

µx = Normal

{X

n(1− λ)σ2µ0

n(1− λ)σ2µ0

+ λσ2u

+ µ0λσ2

u

n(1− λ)σ2µ0

+ λσ2u

,λσ2

uσ2µ0

λσ2u + n(1− λ)σ2

µ0

};

(α0, α1) = Normal(DA,D);

(βT1 , βT

2 )T = Normal(QH ,Q),

where D ={∑n

i=1(1, Xi)T(1, Xi)

σ2ν

+ Σ−1a

}−1

, A =

∑ni=1(1, Xi)

TSi

σ2ν

+ Σ−1a a,

H =n∑

i=1

C(Xi)Yi/σ2ε +

Σ−1β1

β1,0

0

,

and

Q =

{n∑

i=1

C(Xi)CT(Xi)/σ

2ε + diag(Σ−1

β1,Σ−1

β2)

}−1

.

Finally, the posterior distribution of Σβ2 is

Σβ2 = Inverse Wishart(R0 + β2βT2 , q0),

and all the other posterior distributions remain unchanged.

A.4 MCMC Calculations in the Probit Model

We fit a probit regression model, turning it into a logistic fit by the usual device: if the

probability is p, then the logit function is log{p/(1−p)}. Note that we are not approximating

the logit model by a probit model. Rather, our method is exact since if the logit of P (Y =

1|X) is a smooth function of X, then the probit of P (Y = 1|X) is another smooth function

of X.

For the probit model, we modified the method of Albert and Chib (1993). Specifically,

one defines latent variable Zi that are normally distributed with mean C1(Xi)β1+C2(Xi)β2

and variance 1.0, so that Yi = I(Zi > 0). Given the values of Zi, the MCMC steps of Section

A.3 apply without change, with two exceptions: (a) σ2ε = 1 is known a priori; and (b)

Zi replaces Yi in that section. This means that the only thing necessary in the MCMC

steps is to generate values of the Zi from their complete conditional distribution. Write

µi = C1(Xi)β1 + C2(Xi)β2. The density of Zi given the rest is

f(Zi|rest) ∝ {YiI(Zi > 0) + (1− Yi)I(Zi ≤ 0)} exp{−(1/2)(Zi − µi)2}.

This means that if Yi = 1, then Zi is a truncated normal random variable, i.e., a normal

random variable with mean µi, variance 1.0, with left truncation point 0.0. Also, if Yi = 0,

27

Zi a normal random variable with mean µi, variance 1.0, with right truncation point 0.0.

Define Ri = 1− 2I(Yi = 1), and TN(a, b) to be a normal random variable with mean a and

variance 1.0 with left truncation point b. Then it follows that complete conditional of Zi is

Zi ∼ µi − RiTNL(0, Riµi). To generate these truncated normals, we used the accept–reject

algorithm of Robert (1995), with the following modification. If we want to generate a normal

random variable truncated from the left (right) at 0 and with a positive (negative) mean,

we did not use Robert’s algorithm but instead generated normals at random until one was

positive (negative).

While the candidate density for X discussed in the Gaussian case (Section A.3) worked

well enough in that case, we found that it was not nearly so efficient in the probit model. The

following gave better mixing and faster convergence of the sampler. Suppose that the current

Xi is Xcurr,I , and the latent variable is Z. Let βlin be the simple linear regression estimate of

{Z}ni=1 on {Xcurr}n

i=1. Our candidate density was the density of X given (Z, W, S) assuming

a linear model for Z and X with coefficients βlin, and assuming the current values of µx, σ2x,

α0, α1, σ2u and σ2

ν . In all cases investigated, the percentage of mixing for X was over 95%.

28

A.5 WinBUGS Code for the Gaussian Modelmodel

{#Likelihood description corresponding to (W,S)#Notations are the same as the ones in the paper

for (i in 1:N){W[i]~dnorm(X[i],tauu)S[i]~dnorm(mS[i],taunu)X[i]~dnorm(mux,taux)mS[i]<-alpha0+alpha1*X[i]}

#In general, tau denotes the precision (1/variance)#taux is the precision of the distribution of X#tauu is the precision of the distribution of U#lambda is the attenuation

taux<-(1-lambda)*tauu/lambda

#Construct the matrix Z of truncated spline polynomials#pow() is the power function, second argument is the exponent#step() is the truncation (plus) function. Is equal to 1 if argument#is greater than 0 and 0 otherwise

for (i in 1:N){for (k in 1:K)

{Z[i,k]<-pow(X[i]-Knots[k],2)*step(X[i]-Knots[k])}

}

#Likelihood description corresponding to the Y observations#m1[] stores the quadratic polynomial of X#m2[] stores the truncated spline part of the regression#m[] is the mean of Y[]#taueps is the precision of Y[]

for (i in 1:N){m1[i]<-beta0+beta1*X[i]+beta2*pow(X[i],2)m2[i]<-inprod(b[],Z[i,])m[i]<-m1[i]+m2[i]Y[i]~dnorm(m[i],taueps)}

#Prior structure on the coefficients of spline basis polynomials#b[] are the coefficients of the truncated spline basis

for (k in 1:K){b[k]~dnorm(0,tau)}

#Priors on the parameters of the model#Gamma(a,b) has mean a/b and variance a/b^2#In normal distributions the second parameter is the precision

lambda~dunif(0.6,1)taueps~dgamma(1,1)tauu~dgamma(1,1)taunu~dgamma(1,1)tau~dgamma(0.01,0.01)mux~dnorm(0,0.01)alpha0~dnorm(0,0.01)alpha1~dnorm(0,0.01)beta0~dnorm(0,0.01)beta1~dnorm(0,0.01)beta2~dnorm(0,0.01)

#This part contains only deterministic transformations#Here MASE is computed#BA[] is the grid of X’s where the regression function is computed#ZBA[] is like Z[] but for the grid points

for (i in 1:NB){for (k in 1:K)

{ZBA[i,k]<-pow(BA[i]-Knots[k],2)*step(BA[i]-Knots[k])}

}

#Here the regression function at all grid points is computed#and the square difference from the true function func[]

for (i in 1:NB){meanBA[i]<-beta0+beta1*BA[i]+beta2*pow(BA[i],2)+inprod(b[],ZBA[i,])distsquare[i]<-pow(meanBA[i]-func[i],2)}

#Compute MASE

MASE<-(NB-1)*mean(distsquare[])

}

30

Case 1 Case 2 Case 3

Mean Mean Mean Mean Mean Mean

Sample Squared Squared Squared Squared Squared Squared

Size Distribution Method Bias Error Bias Error Bias Error

100 Normal Naive 1.40 1.98 7.27 8.43 2.99 3.72

SIMEX(L) 0.82 1.61 6.56 8.19 2.72 3.77

SIMEX(Q) 0.52 3.31 4.60 11.25 1.92 5.90

Bayes 0.21 1.02 2.51 4.40 1.29 2.97

Uniform Naive 0.91 1.64 5.94 7.09 2.61 3.34

SIMEX(L) 0.57 1.40 5.32 6.59 2.31 3.14

SIMEX(Q) 0.43 3.33 2.86 7.34 1.29 4.20

Bayes 0.19 0.78 2.61 3.80 1.62 2.44

Skew Normal Naive 1.38 2.11 9.64 10.91 3.28 4.12

SIMEX(L) 0.84 1.68 9.87 11.26 3.36 4.26

SIMEX(Q) 0.58 3.57 8.36 13.17 2.59 5.34

Bayes 0.29 1.21 4.71 6.76 1.44 3.28

Table 1: 100 × Mean squared bias and 100 × Mean squared error for the simulation for thespline Gaussian error model. In Case 1, the regression function is 1/{1 + exp(4x)}. In Case2, the regression function is sin(πx/2)/(1 + [2x2{1 + sin(πx/2)}]). In Case 3, the regressionfunction is sin(πx/2)/(1 + [2x2{1 + sign(x)}]).

Naive Bayes

Case Mean Squared Bias Mean Squared Error Mean Squared Bias Mean Squared Error

1 9.91 12.72 0.53 5.95

2 4.82 9.43 0.17 4.30

3 1.76 4.43 0.12 2.95

4 6.39 9.92 1.02 5.33

5 3.45 7.03 0.43 3.78

6 1.70 5.10 0.14 3.00

7 7.16 11.08 3.68 6.66

8 4.98 9.21 3.38 6.15

9 7.14 11.19 3.70 8.63

10 4.79 7.92 2.81 6.10

11 1.72 4.60 1.56 4.33

12 0.45 2.96 0.40 2.82

13 0.02 2.21 0.01 2.34

Table 2: 100 × Mean squared bias and 100 × Mean squared error for the simulation forthe spline logistic error model. In this table, the 13 cases were as follows: Case 1 meansm(x) = x, Case 2 means m(x) = 0.75x, Case 3 means m(x) = 0.50x Case 4 meansm(x) = [4/{1 + exp(−x)}]− 2, Case 5 means m(x) = 0.75([4/{1 + exp(−x)}]− 2), Case 6means m(x) = 0.50([4/{1+exp(−x)}]−2), Case 7 means m(x) = (x− .75)+− (−.75−x)+,Case 8 means m(x) = (x − 1.0)+ − (−1.0 − x)+, Case 9 means m(x) = x+, Case 10means m(x) = .75x+, Case 11 means m(x) = .50x+, Case 12 means m(x) = .25x+, Case13 means m(x) ≡ 0. Here n = 500, there were 200 simulated data sets, there were 8, 000steps of which the first 4, 000 was burn–in, the degree of the polynomial was d = 1, thenumber of knots was 10, the functions were evaluated on a grid from −2.0 to 2.0, σ2

x = 1,σ2

u = 0.32, σ2v = 1, α0 = 0, α1 = 1, µx = 0, and the attenuation was confined to the

interval λ = σ2x/(σ

2x + σ2

u) ∈ (0.60, 1.00). Biases and mean squared errors were computed forx ∈ [−2.0, 2.0].

Naive Naive MLE Variance100× 100× 100× Ratio: MLE

Case RASB RMSE RMSE to Naive

1 9.82 12.49 13.43 3.03

2 14.09 16.43 14.70 3.03

3 9.33 11.92 14.74 3.95

4 63.42 68.65 35.44 1.82

Table 3: Asymptotic calculations for polynomial approximations to 4 functions in the Gaus-sian case. Here ”RASB” means 100 times the square root of the average squared bias, while”RMSE” is 100 times the square root of the mean squared error. In this calculations, it wasassumed that the sample size was n = 100, and that the MLE had no small–sample bias. InCase 1, the target regression function is 1/{1 + exp(4x)}. In Case 2, the target regressionfunction is sin(πx/2)/(1 + [2x2{1 + sin(πx/2)}]). In Case 3, the target regression function issin(πx/2)/(1 + [2x2{1 + sign(x)}]). In Case 4, the regression function is x2.

−2 −1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4

−2 −1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4

−2 −1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4

−2 −1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4TrueNaiveSIMEXBayes

Figure 1: Results from the simulations corresponding to Table 1, Normal case, Case 3. The

top left, top right, and bottom left are 3 simulated data sets. The bottom right is the mean

over all simulated data sets.

34

−2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

−2 −1 0 1 2−1

−0.5

0

0.5

−2 −1 0 1 2−1

−0.5

0

0.5

−2 −1 0 1 2−1

−0.5

0

0.5TrueNaiveSIMEXBayes

Figure 2: Results from the simulations corresponding to Table 1, Normal case, Case 2. The

top left, top right, and bottom left are 3 simulated data sets. The bottom right is the mean

over all simulated data sets.

35

−4 −3 −2 −1 0 1 2 3 4−2

−1

0

1

2

3

4Nail Arsenic = Instrument, Basal Cell Cancers, degree = 2

Uniform Bayes CIBayes FitUniform Bayes CINaive

Figure 3: Logit of the probability of basal cell cancer as a function of X, the transformed

value of the arsenic concentration in the drinking water.

36

−2 −1 0 1 2−0.2

0

0.2

0.4

0.6

0.8

1

1.2Case 1

ActualTrue

−2 −1 0 1 2−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4Case 2

ActualTrue

−2 −1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4Case 3

ActualTrue

−2 −1 0 1 2−1

0

1

2

3

4

5Case 4

ActualTrue

Figure 4: The ”actual” functions (solid lines) from Cases 1–4 for the Gaussian simulation

and their ”true” polynomial approximations (dashed lines) used in computing theoretical

asymptotic distributions.

nonlinear and nonparametric regression and instrumental

Documents