variable selection for nonparametric gaussian process

Statistical Science2011, Vol. 26, No. 1, 130–149DOI: 10.1214/11-STS354© Institute of Mathematical Statistics, 2011

Variable Selection for NonparametricGaussian Process Priors: Models andComputational StrategiesTerrance Savitsky, Marina Vannucci and Naijun Sha

Abstract. This paper presents a unified treatment of Gaussian process mod-els that extends to data from the exponential dispersion family and to survivaldata. Our specific interest is in the analysis of data sets with predictors thathave an a priori unknown form of possibly nonlinear associations to the re-sponse. The modeling approach we describe incorporates Gaussian processesin a generalized linear model framework to obtain a class of nonparametricregression models where the covariance matrix depends on the predictors.We consider, in particular, continuous, categorical and count responses. Wealso look into models that account for survival outcomes. We explore alterna-tive covariance formulations for the Gaussian process prior and demonstratethe flexibility of the construction. Next, we focus on the important problemof selecting variables from the set of possible predictors and describe a gen-eral framework that employs mixture priors. We compare alternative MCMCstrategies for posterior inference and achieve a computationally efficient andpractical approach. We demonstrate performances on simulated and bench-mark data sets.

Key words and phrases: Bayesian variable selection, generalized linearmodels, Gaussian processes, latent variables, MCMC, nonparametric regres-sion, survival data.

1. INTRODUCTION

In this paper we present a unified modeling ap-proach to Gaussian processes (GP) that extends to datafrom the exponential dispersion family and to survivaldata. With the advent of kernel-based methods, modelsutilizing Gaussian processes have become very com-mon in machine learning approaches to regression andclassification problems; see Rasmussen and Williams(2006). In the statistical literature GP regression mod-

Terrance Savitsky is Associate Statistician, RANDCorporation, 1776 Main Street, Santa Monica, California90401-3208, USA (e-mail: [email protected]). MarinaVannucci is Professor, Department of Statistics, RiceUniversity, 6100 Main Street, Houston, Texas 77030, USA(e-mail: [email protected]). Naijun Sha is AssociateProfessor, Department of Mathematical Sciences,University of Texas at El Paso, 500 W University Ave, ElPaso, Texas 79968, USA (e-mail: [email protected]).

els have been used as a nonparametric approach tomodel the nonlinear relationship between a responsevariable and a set of predictors; see, for example,O’Hagan (1978). Sacks, Schiller and Welch (1989) em-ployed a stationary GP function of spatial locations ina regression model to account for residual spatial vari-ation. Diggle, Tawn and Moyeed (1998) extended thisconstruction to model the link function of the gener-alized linear model (GLM) construction of McCullaghand Nelder (1989). Neal (1999) considered linear re-gression and logit models.

We follow up on the literature cited above and in-troduce Gaussian process models as a class that broad-ens the generalized linear construction by incorporat-ing fairly complex continuous response surfaces. Thekey idea of the construction is to introduce latent vari-ables on which a Gaussian process prior is imposed. Inthe general case the GP construction replaces the lin-ear relationship in the link function of a GLM. This re-sults in a class of nonparametric regression models that

130

http://www.imstat.org/sts/

http://dx.doi.org/10.1214/11-STS354

http://www.imstat.org

mailto:[email protected]



VARIABLE SELECTION FOR GAUSSIAN PROCESS MODELS 131

can accommodate linear and nonlinear terms, as wellas noise terms that account for unexplained sources ofvariation in the data. The approach extends to latentregression models used for continuous, categorical andcount data. Here we also consider a class of models thataccount for survival outcomes. We explore alternativecovariance formulations for the GP prior and demon-strate the flexibility of the construction. In addition, weaddress practical computational issues that arise in theapplication of Gaussian processes due to numerical in-stability in the calculation of the covariance matrix.

Next, we look at the important problem of select-ing variables from a set of possible predictors and de-scribe a general framework that employs mixture pri-ors. Bayesian variable selection has been a topic ofmuch attention among researchers over the last fewyears. When a large number of predictors is availablethe inclusion of noninformative variables in the analy-sis may degrade the prediction results. Bayesian vari-able selection methods that use mixture priors were in-vestigated for the linear regression model by Georgeand McCulloch (1993, 1997), with contributions byvarious other authors on special features of the selec-tion priors and on computational aspects of the method;see Chipman, George and McCulloch (2001) for a nicereview. Extensions to linear regression models withmultivariate responses were put forward by Brown,Vannucci and Fearn (1998b) and to multinomial pro-bit by Sha et al. (2004). Early approaches to Bayesianvariable selection for generalized linear models can befound in Chen, Ibrahim and Yiannoutsos (1999) andRaftery, Madigan and Volinsky (1996). Survival mod-els were considered by Volinsky et al. (1997) and, morerecently, by Lee and Mallick (2004) and Sha, Tadesseand Vannucci (2006). As for Gaussian process models,Linkletter et al. (2006) investigated Bayesian variableselection methods in the linear regression frameworkby employing mixture priors with a spike at zero onthe parameters of the covariance matrix of the Guas-sian process prior.

Our unified treatment of Gaussian process modelsextends the line of work of Linkletter et al. (2006) tomore complex data structures and models. We trans-form the covariance parameters and explore designsand MCMC strategies that aim at producing a mini-mally correlated parameter space and efficiently con-vergent sampling schemes. In particular, we find thatMetropolis-within-Gibbs schemes achieve a substan-tial improvement in computational efficiency. Our re-sults on simulated data and benchmark data sets show

that GP models can lead to improved predictions with-out the requirement of pre-specifying higher orderand nonlinear additive functions of the predictors. Weshow, in particular, that a Gaussian process covariancematrix with a single exponential term is able to map amixture of linear and nonlinear associations with ex-cellent prediction performance.

GP models can be considered part of the broad classof nonparametric regression models of the type y =f (x) + error, with y an observed (or latent) response,f an unknown function and x a p-dimensional vectorof covariates, and where the objective is to estimate thefunction f for prediction of future responses. Amongpossible alternative choices to GP models, one famousclass is that of kernel regression models, where the es-timate of f is selected from the set of functions con-tained in the reproducing kernel Hilbert space (RKHS)induced by a chosen kernel. Kernel models have along and successful history in statistics and machinelearning [see Parzen (1963), Wahba (1990) and Shawe-Taylor and Cristianini (2004)] and include many of themost widely used statistical methods for nonparametricestimation, including spline models and methods thatuse regularized techniques. Gaussian processes can beconstructed with kernel convolutions and, therefore,GP models can be seen as contained in the class ofnonparametric kernel regression with exponential fam-ily observations. Rasmussen and Williams (2006), inparticular, note that the GP construction is equivalentto a linear basis regression employing an infinite setof Gaussian basis functions and results in a responsesurface that lies within the space of all mathemati-cally smooth, that is, infinitely mean square differen-tiable, functions spanning the RKHS. Constructions ofBayesian kernel methods in the context of GP mod-els can be found in Bishop (2006) and Rasmussen andWilliams (2006).

Another popular class of nonparametric spline re-gression models is the generalized additive models(GAM) of Ruppert, Wand and Carroll (2003), thatemploy linear projections of the unknown function f

onto a set of basis functions, typically cubic splinesor B-splines, and related extensions, such as the struc-tured additive regression (STAR) models of Fahrmeir,Kneib and Lang (2004) that, in addition, include in-teraction surfaces, spatial effects and random effects.Generally speaking, these regression models imposeadditional structure on the predictors and are thereforebetter suited for the purpose of interpretability, whileGaussian process models are better suited for predic-tion. Extensions of STAR models also enable variable

132 T. SAVITSKY, M. VANNUCCI AND N. SHA

selection based on spike and slab type priors; see, forexample, Panagiotelis and Smith (2008).

Ensamble learning models, such as bagging, boost-ing and random forest models, utilize decision trees asbasis functions; see Hastie, Tibshirani and Friedman(2001). Trees readily model interactions and nonlinear-ity subject to a maximum tree depth constraint to pre-vent overfitting. Generalized boosting models (GBMs),as an example, such as the AdaBoost of Freund andSchapire (1997), represent a nonlinear function of thecovariates by simpler basis functions typically esti-mated in a stage-wise, iterative fashion that succes-sively adds the basis functions to fit generalized orpseudo residuals obtained by minimizing a chosen lossfunction. GBMs accommodate dichotomous, contin-uous, event time and count responses. These modelswould be expected to produce similar prediction resultsto GP regression and classification models. We exploretheir behavior on one of the benchmark data sets in theapplication section of this paper. Notice that GBMs donot incorporate an explicit variable selection mecha-nism that allows to exclude nuisance covariates, as wedo with GP models, although they do provide a rela-tive measure of variable importance, averaged over alltrees.

Regression trees partition the predictor space andfit independent models in different parts of the inputspace, therefore facilitating nonstationarity and leadingto smaller local covariance matrices. “Treed GP” mod-els are constructed by Gramacy and Lee (2008) and ex-tend the constant and linear construction of Chipman,George and McCulloch (2002). A prior is specifiedover the tree process, and posterior inference is per-formed on the joint tree and leaf models. The effectof this formulation is to allow the correlation struc-ture to vary over the input space. Since each tree re-gion is composed of a portion of the observations, thereis a computational savings to generate the GP covari-ance matrix from mr < n observations for region r .The authors note that treed GP models are best suited“. . .towards problems with a smaller number of distinctpartitions. . . .” So, while it is theoretically possible toperform variable selection in a forward selection man-ner, in applications these models are often used withsingle covariates.

The rest of the paper is organized as follows: In Sec-tion 2 we formally introduce the class of GP modelsby broadening the generalized linear construction. Wealso extend this class to include models for survivaldata. Possible constructions of the GP covariance ma-trix are enumerated in Section 3. Prior distributions for

variable selection are discussed in Section 4 and pos-terior inference, including MCMC algorithms and pre-diction strategies, in Section 5. We include simulateddata illustrations for continuous, count and survivaldata regression in Section 6, followed by benchmarkapplications in Section 7. Concluding remarks and sug-gestions for future research are in Section 8. Some de-tails on computational issues and related pseudo-codeare given in the Appendix.

2. GAUSSIAN PROCESS MODELS

We introduce Gaussian process models via a unifiedmodeling approach that extends to data from the expo-nential dispersion family and to survival data.

2.1 Generalized Models

In a generalized linear model the monotone linkfunction g(·) relates the linear predictors to the canon-ical parameter as g(ηi) = x′

iβ , with ηi the canonicalparameter for the ith observation, xi = (x1, . . . , xp)′a p × 1 column vector of predictors for the ith sub-ject and β the coefficient vector β = (β1, . . . , βp)′.A broader class of models that incorporate fairly com-plex continuous response surfaces is obtained by in-troducing latent variables on which a Gaussian processprior is imposed. More specifically, the latent variablesz(xi ) define the values of the link function as

g(ηi) = z(xi ), i = 1, . . . , n,(1)

and a Gaussian process (GP) prior on the n × 1 latentvector is specified as

z(X) = (z(x1), . . . , z(xn))′ ∼ N(0,C),(2)

with the n × n covariance matrix C a fairly complexfunction of the predictors. This class of models can becast within the model-based geostatistics framework ofDiggle, Tawn and Moyeed (1998), with the dimensionof the space being equal to the number of covariates.

The class of models introduced above extends to la-tent regression models used for continuous, categori-cal and count data. We provide some details on mod-els for continuous and binary responses and for countdata, since we will be using these cases in our simu-lation studies presented below. GP regression modelsare obtained by choosing the link function in (1) as theidentity function, that is,

y = z(X) + ε,(3)

with y the n × 1 observed response vector, z(X) ann-dimensional realization from a GP as in (2), and


ε ∼ N (0, 1rIn) with r a precision parameter. A Gamma

prior can be imposed on r , that is, r ∼ G(ar , br). Lin-ear models of type (3) were studied by Neal (1999) andLinkletter et al. (2006). One notices that, by integratingz(X) out, the marginalized likelihood is

y|C, r ∼ N(

0,

[1

rIn + C

]),(4)

that is, a regression model with the covariance matrixof the response depending on the predictors. Nonlin-ear response surfaces can be generated as a function ofthose covariates for suitable choices of the covariancematrix. We discuss some of the most popular in Sec-tion 3.

In the case of a binary response, class labels ti ∈{0,1} for i = 1, . . . , n are observed. We assume ti ∼Binomial(1;pi) and define pi = P(ti = 1|z(xi )) withz(X) as in (2). For logit models, for example, we havepi = F(z(xi )) = 1/[1 + exp(−z(xi ))]. Similarly, forbinary probit we can directly define the inverse linkfunction as pi = �(z(xi )), with �(·) the cdf of stan-dard normal distribution. However, a more commonapproach to inference in probit models uses data aug-mentation; see Albert and Chib (1993). This approachdefines latent values yi which are related to the re-sponse via a regression model, that is, in our latent GPframework, yi = z(xi ) + εi , with εi ∼ N (0,1), and as-sociated to the observed classes, ti , via the rule ti = 1 ifyi > 0 and ti = 0 if yi < 0. Notice that the latent vari-able approach results in a GP on y with a covariancefunction obtained by adding a “jitter” of variance oneto C, with a similar effect of the noise component in theregression models (3) and (4). Neal (1999) argues thatan effect close to a probit model can be produced by alogit model by introducing a large amount of jitter inits covariance matrix. Extensions to multivariate mod-els for continuous and categorical responses are quitestraightforward.

As another example, count data models can beobtained by choosing the canonical link functionfor the Poisson distribution as log(λ) = z(X) withz(X) as in (2). Over-dispersion, possibly caused fromlack of inclusion of all possible predictors, is takeninto account by modeling the extra variability viarandom effects, ui , that is, λi = exp(z(xi ) + ui) =exp(z(xi )) exp(ui) = λiδi . For identifiability, one canimpose E(δi) = 1 and marginalize over δi using a con-jugate prior, δi ∼ G(τ, τ ), to achieve the negative bino-mial likelihood as in Long (1997),

π(si |λi, τ )(5)

= (si + τ)

(si + 1)(τ)

(τ

τ + λi

)τ(λi

τ + λi

)si

,

for si ∈ N ∪ {0}, with the same mean as the Poissonregression model, that is, E(si) = λi , and Var(si) =λi + λ2

i /τ , with the added parameter τ capturing thevariance inflation associated with over-dispersion.

2.2 Survival Data

The modeling approach via Gaussian processes ex-ploited above extends to other classes of models, forexample, those for survival data. In survival studiesthe task is typically to measure the effect of a setof variables on the survival time, that is, the timeto a particular event or “failure” of interest, such asdeath or occurrence of a disease. The Cox propor-tional hazard model of Cox (1972) is an extremelypopular choice. The model is defined through the haz-ard rate function h(t |xi ) = h0(t) exp(x′

iβ), where h0(·)is the baseline hazard function, t is the failure timeand β the p-dimensional regression coefficient vec-tor. The cumulative baseline hazard function is de-noted as H0(t) = ∫ t

0 h0(u) du and the survivor func-tion becomes S(t |xi ) = S0(t)

exp(x′iβ), where S0(t) =

exp{−H0(t)} is the baseline survivor function.Let us indicate the data as (t1,x1, d1), . . . , (tn,xn,

dn) with censoring index di = 0 if the observation isright censored and di = 1 if the failure time ti is ob-served. A GP model for survival data is defined as

h(ti |z(xi )) = h0(ti) exp(z(xi )), i = 1,2, . . . , n,(6)

with z(X) as in (2). In this general setting, defining aprobability model for Bayesian analysis requires theidentification of a prior formulation for the cumulativebaseline hazard function. One strategy often adoptedin the literature on survival models is to utilize the par-tial likelihood of Cox (1972) that avoids prior specifi-cation and estimation of the baseline hazard, achiev-ing a parsimonious representation of the model. Al-ternatively, Kalbfleisch (1978) employs a nonparamet-ric gamma process prior on H0(ti) and then calculatesa marginalized likelihood. This “full” likelihood for-mulation tends to behave similarly to the partial like-lihood one when the concentration parameter of thegamma process prior tends to 0, placing no confi-dence in the initial parametric guess. Sinha, Ibrahimand Chen (2003) extend this theoretical justification totime-dependent covariates and time-varying regressionparameters, as well as to grouped survival data.

3. CHOICE OF THE GP COVARIANCE MATRIX

We explore alternative covariance formulations forthe Gaussian process prior (2) and demonstrate the


flexibility of the construction. In general, any plausiblerelationship between the covariates and the responsecan be represented through the choice of C, as longas the condition of positive definiteness of the matrixis satisfied; see Thrun, Saul and Scholkopf (2004). Inthe Appendix we further address practical computa-tional issues that arise in the application of Gaussianprocesses due to numerical instability in the construc-tion of the covariance matrix and the calculation of itsinverse.

3.1 1-term vs. 2-term Exponential Forms

We consider covariance functions that include a con-stant term and a nonlinear, exponential term as

C = Cov(z(X)) = 1

λa

Jn + 1

λz

exp(−G),(7)

with Jn an n×n matrix of 1’s and exp(G) a matrix withelements exp(gij ), where gij = (xi − xj )

′P(xi − xj )

and P = diag(− log(ρ1, . . . , ρp)), with ρk ∈ [0,1] as-sociated to xk, k = 1, . . . , p. In the literature on Gaus-sian processes a noise component, called “jitter,” issometimes added to the covariance matrix C, in ad-dition to the term (1/λ)J, in order to make the matrixcomputations better conditioned; see Neal (1999). Thisis consistent with the belief that there may be unex-plained sources of variation in the data, perhaps dueto explanatory variables that were not recorded in theoriginal study. The parametrization of G we adopt al-lows simpler prior specifications (see below), and it isalso used by Linkletter et al. (2006) as a transformationof the exponential term used by Neal (1999) and Sacks,Schiller and Welch (1989) in their formulations. Neal(1999) notices that introducing an intercept in model(3), with precision parameter λa , placing a Gaussianprior on it and then marginalizing over the interceptproduces the additive covariance structure (7). The pa-rameter for the exponential term, λz, serves as a scalingfactor for this term. In our empirical investigations wefound that construction (7) is sensitive to scaling andthat best results can be obtained by normalizing X tolie in the unit cube, [0,1]p , though standardizing thecolumns to mean 0 and variance 1 produces similar re-sults.

The single-term exponential covariance provides aparsimonious representation that enables a broad classof linear and nonlinear response surfaces. Plots (a)–(c) of Figure 1 show response curves produced byutilizing a GP with the exponential covariance ma-trix (7) and three different values of ρ. One readily

notes how higher order polynomial-type response sur-faces can be generated by choosing relatively lowervalues for ρ, whereas the assignment of higher valuesprovides lower order polynomial-type that can also in-clude roughly linear response surfaces [plot (c)].

We also consider a two-term covariance obtained byadding a second exponential term to (7), that is,

C = Cov(z(X))(8)

= 1

λa

Jn + 1

λ1,z

exp(−G1) + 1

λ2,z

exp(−G2),

where G1 and G2 are parameterized as P1 =diag(− log(ρ1,1, . . . , ρ1,p)) and P2 = diag(− log(ρ2,1,

. . . , ρ2,p)), respectively. As noted in Neal (2000),adding multiple terms results in rougher, more com-plex, surfaces while retaining the relative computa-tional efficiency of the exponential formulation. Forexample, plot (d) of Figure 1 shows examples of sur-faces that can be generated by employing the 2-termcovariance formulation with (ρ1, ρ2) = (0.5,0.05) and(λ1,z = 1, λ2,z = 8).

3.2 The Matern Construction

An alternative choice to the exponential covarianceterm is the Matern formulation. This introduces an ex-plicit smoothing parameter, ν, such that the resultingGaussian process is k times differentiable for k ≤ ν,

C(z(xi ), z(xj ))(9)

= 1

2ν−1(ν)

[2√

νd(xi ,xj )]ν

Kν

[2√

νd(xi ,xj )],

with d(xi ,xj ) = (xi −xj )′P(xi −xj ), Kν(·) the Bessel

function and P parameterized as in (7). Banerjee et al.(2008) employ such a construction with ν fixed to 0.5for modeling a spacial random effects process charac-terized by roughness. One recovers the exponential co-variance term from the Matern construction in the limitas ν → ∞. However, Rasmussen and Williams (2006)point out that two formulations are essentially the samefor ν > 7

2 , as confirmed by our own simulations.

4. PRIOR MODEL FOR BAYESIAN VARIABLESELECTION

The unified modeling approach we have describedallows us to put forward a general framework for vari-able selection that employs Bayesian methods and mix-ture priors for the selection of the predictors. In par-ticular, variable selection can be achieved within theGP modeling framework by imposing “spike-and-slab”


FIG. 1. Response curves drawn from a GP. Each plot shows two (solid and dashed) random realizations. Plots (a)–(c) were obtained withthe exponential covariance (7) and plot (d) with the 2-term formulation (8). Plots (e) and (f) show realizations from the matern construction.All curves employ a one-dimensional covariate.


mixture priors on the covariance parameters in (7), thatis,

π(ρk|γk) = γkI[0 ≤ ρk ≤ 1] + (1 − γk)δ1(ρk),(10)

for k = 1, . . . , p, with δ1(·) a point mass distribution atone. Clearly, ρk = 1 causes the predictor xk to haveno effect on the computation for the GP covariancematrix. This formulation is similar in spirit to the useof selection priors for linear regression models and isemployed by Linkletter et al. (2006) in the univariateGP regression framework (3). Further Bernoulli pri-ors are imposed on the selection parameters, that is,γk ∼ Bernoulli(αk) and Gamma priors are specified onthe precision terms (λa, λz).

Variable selection with a covariance matrix that em-ploys two exponential terms as in (8) is more com-plex. In particular, one can select covariates separatelyfor each exponential term by assigning a specific setof variable selection parameters to each term, that is,(γ 1,γ 2) associated to (ρ1,ρ2), and simply extendingthe single term formulation via independent spike-and-slab priors of the form

π(ρ1,k|γ1,k)(11)

= γ1,kI[0 ≤ ρ1,k ≤ 1] + (1 − γ1,k)δ1(ρ1,k),

π(ρ2,k|γ2,k)(12)

= γ2,kI[0 ≤ ρ2,k ≤ 1] + (1 − γ2,k)δ1(ρ2,k),

with k = 1, . . . , p. Assuming a priori independenceof the two model spaces, Bernoulli priors can beimposed on the selection parameters, that is, γi,k ∼Bernoulli(αi,k), i = 1,2. This variable selection frame-work identifies the association of each covariate, xk ,to one or both terms. Final selection can then be ac-complished by choosing the covariates in the union ofthose selected by either of the two terms. An alternativestrategy for variable selection may employ a commonset of variable selection parameters, γ = (γ1, . . . , γp)

for both ρ1 and ρ2, in a joint spike-and-slab (product)prior formulation,

π(ρ1,k, ρ2,k|γk)

= γkI[0 ≤ ρ1,k ≤ 1]I[0 ≤ ρ2,k ≤ 1](13)

+ (1 − γk)δ1(ρ1,k)δ1(ρ2,k),

where we assume a priori independence of the param-eter spaces, ρ1 and ρ2. This prior choice focuses moreon overall covariate selection, rather than simultane-ous selection and assignment to each term in (8). Whilewe lose the ability to align the ρi,k to each covariance

function term, we expect to improve computational ef-ficiency by jointly sampling (γ ,ρ1,ρ2) at each itera-tion of the MCMC scheme as compared to a separatejoint sampling on (γ 1,ρ1) and (γ 2,ρ2). Some investi-gation is done in Savitsky (2010).

5. POSTERIOR INFERENCE

The methods for posterior inference we are goingto describe apply to all GP formulations, even thoughwe focus our simulation work on the continuous andcount data models. We therefore express the posteriorformulation employing a generalized notation. First,we collect all parameters of the GP covariance matrixin � and write C = C(�). For example, for covari-ance matrix of type (7) we have � = (ρ, λa, λz). Next,we extend our notation to include the selection pa-rameter γ by using �γ = (ργ , λa, λz) to indicate thatρk = 1 when γk = 0, for k = 1, . . . , p. For covarianceof type (8) we write �γ = {�γ 1

,�γ 2, λa}, where γ =

(γ 1,γ 2)′ and �γ i

= (ρiγ i, λi,z), i ∈ {1,2} for prior

of type (11)–(12) and �γ = (ρ1γ ,ρ2γ , λa, λ1,z, λ2,z)

for prior of type (13), and similarly for the Maternconstruction. Next, we define Di ∈ {yi, {si, z(xi )}} andD := {D1, . . . ,Dn} to capture the observed data aug-mented by the unobserved GP variate, z(X), for thelatent response models [such as model (5) for countdata]. Finally, we set h := {r, τ } to group unique pa-rameters /∈ �γ and we collect hyperparameters inm := {a,b}, with a = {aλa , aλz, ar , aτ } and similarlyfor b, where a and b include the shape and rate hy-perparameters of the Gamma priors on the associatedparameters. With this notation we can finally outlinea generalized expression for the full conditional of(γ ,ργ ) as

π(γ ,ργ |�γ \ργ ,D,h,m)(14)

∝ La(γ ,ργ |�γ \ργ ,D,h,m)π(γ ),

with La the augmented likelihood. Notice that the termπ(ργ |γ ) does not appear in (14) since π(ρk|γk) = 1,for k = 1, . . . , p.

5.1 Markov Chain Monte Carlo—Scheme 1

We first describe a Metropolis–Hastings schemewithin Gibbs sampling to jointly sample (γ ,ργ ),which is an adaptation of the MCMC model compar-ison (MC3) algorithm originally outlined in Madiganand York (1995) and extensively used in the variableselection literature. As we are unable to marginalizeover the parameter space, we need to modify the algo-rithm in a hierarchical fashion, using the move types


outlined below. Additionally, we need to sample all theother nuisance parameters.

A generic iteration of this MCMC procedure com-prises the following steps:

(1) Update (γ, ργ ): Randomly choose among threebetween-models transition moves:

(i) Add: set γ ′k = 1 and sample ρ′

k from a U (0,1)

proposal. Position k is randomly chosen from the set ofk’s where γk = 0 at the previous iteration.

(ii) Delete: set (γ ′k = 0, ρ ′

k = 1). This results in co-variate xk being excluded in the current iteration. Posi-tion k is randomly chosen from among those includedin the model at the previous iteration.

(iii) Swap: perform both an Add and Delete move.This move type helps to more quickly traverse a largecovariate space.

The proposed value (γ ′,ρ′γ ′) is accepted with proba-

bility,

α = min{

1,π(γ ′,ρ′

γ ′ |�γ ′\ρ′γ ′,D,h,m)q(γ |γ ′)

π(γ ,ργ |�γ \ργ ,D,h,m)q(γ ′|γ )

},

where the ratio of the proposals q(ργ )/q(ρ′γ ′) drops

out of the computation since we employ a U (0,1) pro-posal.

(2) Execute a Gibbs-type move, Keep, by samplingfrom a U (0,1) all ρ ′

k’s such that γ ′k = 1. This move is

not required for ergodicity, but it allows to perform arefinement of the parameter space within the existingmodel, for faster convergence.

(3) Update {λa,λz}: These are updated using Metro-polis–Hastings moves with Gamma proposals centeredon the previously sampled values.

(4) Update h: Individual model parameters in h areupdated using Metropolis–Hastings moves with pro-posals centered on the previously sampled values.

(5) Update z: Jointly sample z for latent responsemodels using the approach enumerated in Neal (1999)with proposal z′ = (1 − ε2)1/2z + εLu, where u is avector of i.i.d. standard Gaussian values and L is theCholesky decomposition of the GP covariance matrix.For faster convergence R consecutive updates are per-formed at each iteration.

Green (1995) introduced a Markov chain MonteCarlo method for Bayesian model determination forthe situation where the dimensionality of the parametervector varies iteration by iteration. Recently, Gottardoand Raftery (2008) have shown that the reversible jump

can be formulated in terms of a mixture of singular dis-tributions. Following the results given in their exam-ples, it is possible to show that the acceptance proba-bility of the reversible jump formulation is the same asin the Metropolis–Hastings algorithm described above,and therefore that the two algorithms are equivalent;see Savitsky (2010).

For inference, estimates of the marginal posteriorprobabilities of γk = 1, for k = 1 . . . , p, can be com-puted based on the MCMC output. A simple strategyis to compute Monte Carlo estimates by counting thenumber of appearances of each covariate across thevisited models. Alternatively, Rao–Blackwellized es-timates can be calculated by averaging the full condi-tional probabilities of γk = 1. Although computation-ally more expensive, the latter strategy may result inestimates with better precision, as noted by Guan andStephens (2011). In all simulations and examples re-ported below we obtained satisfactory results by esti-mating the marginal posterior probabilities by countsrestricted to between-models moves, to avoid over-estimation.

5.2 Markov Chain Monte Carlo—Scheme 2

Next we enumerate a Markov chain Monte Carlo al-gorithm to directly sample (γ ,ργ ) with a Gibbs scanthat employs a Metropolis acceptance step. We formu-late a proposal distribution of a similar mixture form asthe joint posterior by extending a result from Gottardoand Raftery (2008) to produce a move to (γk = 0, ρk =1), as well as to (γk = 1, ρk = [0,1)).

A generic iteration of this MCMC procedure com-prises the following steps:

(1) For k = 1, . . . , p perform a joint update for(γk, ρk) with two moves, conducted in succession:

(i) Between-models: Jointly propose a new modelsuch that if γk = 1, propose γ ′

k = 0 and set ρ′k = 1; oth-

erwise, propose γ ′k = 1 and draw ρ′

k ∼ U (0,1). Acceptthe proposal for (γ ′

k, ρ′k) with probability,

α = min{

1,π(γ ′

k, ρ′k|γ ′

(k),�γ ′(k)

,D,h,m)

π(γk, ρk|γ ′(k),�γ ′

(k),D,h,m)

},

where now γ ′(k) := (γ ′

1, . . . , γ′k−1, γk+1, . . . , γp) and

similarly for ρ′(k) ∈ �γ ′

(k). The joint proposal ratio for

(γk, ρk), reduces to 1 since we employ a U (0,1) pro-posal for ρk ∈ [0,1] and a symmetric Dirac measureproposal for γk .

(ii) Within model: This move is performed only ifwe sample γ ′

k = 1 from the between-models move, in


which case we propose γ ′′k = 1 and, as before, draw

ρ′′k ∼ U (0,1). Similar to the between-models move, ac-

cept the joint proposal for (γ ′′k , ρ′′

k ) with probability,

α = min{

1,π(γ ′′

k , ρ′′k |γ ′

(k),�γ ′(k)

,D,h,m)

π(γ ′k, ρ

′k|γ ′

(k),�γ ′(k)

,D,h,m)

},

which further reduces to just the ratio of posteriorssince we propose a move within the current model andutilize a U (0,1) proposal for ρk .

(2) Sample the parameters {λa,λz,h} and latent re-sponses z as outlined in scheme 1.

In simulations we also investigate performances ofan adaptive scheme that employs a proposal with tun-ing parameters adapted based on “learning” from thedata. In particular, we employ the method of Haario,Saksman and Tamminen (2001) for our Bernoulli pro-posal for γ |α to successively update the mean parame-ter, αk, k = 1, . . . , p, based on prior sampled values forγk . The construction does not require additional likeli-hood computations and it is expected to achieve morerapid convergence in the model space than the non-adaptive scheme. Roberts and Rosenthal (2007) andJi and Schmidler (2009) note conditions under whichadaptive schemes achieve convergence to the targetposterior distribution.

Schemes 1 and 2 we enumerated above may be easilymodified when employing the 2-term covariance for-mulation (8); see Savitsky (2010).

5.3 Prediction

Let zf = z(Xf ) be an nf × 1 latent vector of fu-ture cases. We use the regression model (3) to demon-strate prediction under the GP framework. The jointdistribution over training and test sets is defined to bez∗ := [z′, z′

f ]′ ∼ N (0,Cn+nf) with covariance,

Cn+nf:=

(C(X,X) C(X,Xf )

C(Xf ,X) C(Xf ,Xf )

),

where C(X,X) := C(X,X)(�). The conditional jointpredictive distribution over the test cases, zf |z, isalso multivariate normal distribution with expectationE[zf |z] = C(Xf ,X)C

−1(X,X)z. Estimation is based on the

posterior MCMC samples. Here we take a computa-tionally simple approach by first estimating z as themean of all sampled values of z, defining

D(�) := C(Xf ,X)C−1(X,X)z,(15)

and then estimating the response value as

yf = zf |z = 1

K

K∑t=1

D(�(t)),(16)

with K the number of MCMC iterations and wherecalculations of the covariance matrices in (15) are re-stricted to the variables selected based on the marginalposterior probabilities of γk = 1. A more coherent es-timation procedure, that may return more precise es-timates but that is also computationally more expen-sive, would compute Rao–Blackwellized estimates byaveraging the predictive probabilities over all visitedmodels; see Guan and Stephens (2011). In the simu-lations and examples reported below we have calcu-lated (16) using every 10th MCMC sampled value, toprovide a relatively less correlated sample and saveon computational time. In addition, when computingthe variance product term in (15), we have employedthe Cholesky decomposition C = LL′, following Neal(1999), to avoid direct computation of the inverse ofC(X,X).

For categorical data models, we may predict the newclass labels, tf , via the rule of largest probability inthe case of a binary logit model, with estimated latentrealizations zf , and via data augmentation based on thevalues of yf in the case of a binary probit model.

5.3.1 Survival Function Estimation. For survivaldata it is of interest to estimate the survivor functionfor a new subject with unknown event time, Ti , andassociated zf,i := zf,i(xf,i). This is defined as

P(Ti ≥ t |zf,i , z) = Si(t |zf,i , z)(17)

= S0(t |z)exp(zf,i ).

When using the partial likelihood formulation an em-pirical Bayes estimate of the baseline survivor func-tion, S0(t |z), must be calculated, since the model doesnot specifically enumerate the baseline hazard. Wengand Wong (2007), for example, propose a method thatdiscretizes the likelihood to produce an estimator withthe useful property that it cannot take negative values.Accuracy of this estimate may be potentially improvedby Rao–Blackwellizing the computation by averagingover the MCMC runs.

6. SIMULATION STUDY

6.1 Parameter Settings

In all simulations and applications reported in thispaper we set both priors on λa and λz as G(1,1). Wedid not observe any strong sensitivity to this choice.In particular, we considered different choices of thetwo parameters of these Gammma priors in the range(0.01,1), keeping the prior mean at 1 but with pro-gressively larger variances, and observed very little


change in the range of posterior sampled values. Wealso experimented with prior mean values of 10 and100, which produced only a small impact on the poste-rior. For model (3) we set r ∼ G(ar , br) with (ar , br) =(2,0.1) to reflect our a priori expected residual vari-ance. For the count model (5), we set τ ∼ G(1,1).For survival data, when using the full likelihood fromKalbfleisch (1978) we specified a G(1,1) prior for boththe parameter of the exponential base distribution andthe concentration parameter of the Gamma processprior on the baseline.

Some sensitivity on the Bernoulli priors on the γk’sis, of course, to be expected, since these priors drivethe sparsity of the model. Generally speaking, parsi-monious models can be selected by specifying γk ∼Bernoulli(αk) with αk = α and α a small percentageof the total number of variables. In our simulations weset αk to 0.025. We observed little sensitivity in the re-sults for small changes around this value, in the rangeof 0.01–0.05, though we would expect to see signifi-cant sensitivity for much higher values of α. We alsoinvestigated sensitivity to a Beta hyperprior on α; seebelow.

When running the MCMC algorithms independentchain samplers with U (0,1) proposals for the ρk’s haveworked well in all applications reported in this paper,where we have always approximately achieved the tar-get acceptance rate of 40–60% indicating efficient pos-terior sampling.

6.2 Use of Variable Selection Parameters

We first demonstrate the advantage of introducingselection parameters in the model. Figure 2 shows re-sults with and without the inclusion of the variable

selection parameter vector γ on a simulated scenariowith a kernel that incorporates both linear and nonlin-ear associations. The observed continuous response, y,is constructed from a mix of linear and nonliner rela-tionships to 4 variables, each generated from a U (0,1),

y = x1 + x2 + sin(3x3) + sin(5x4) + ε,

with ε ∼ N (0, σ 2) and σ = 0.05. Additional variablesare randomly generated, again from U (0,1). In thissimulation we used (n,p) = (80,20). We ran 70,000MCMC iterations, of which 10,000 were discarded asburn-in.

Plot (a) of Figure 2 displays box plots of the MCMCsamples for the ρ ′

ks, k = 1, . . . ,20, for the case of novariable selection, that is, by using a simple “slab”prior on the ρk’s. As both Linkletter et al. (2006) andNeal (2000) note, the single covariates demonstrate anassociation to the response whose strength may be as-sessed utilizing the distance of the posterior samplesof the ρk’s from 1. One notes that, according to thiscriterion, the true covariates are all selected. It is con-ceivable, however, for some of the unrelated covariatesto be selected using the same criterion, since the ρk’sall sample below 1, and that this problem would becompounded as p grows. Plot (b) of Figure 2, instead,captures results from employing the variable selectionparameters γ and shows how the inclusion of these pa-rameters results in the sampled values of the ρk’s forvariables unrelated to the response being all pushed upagainst 1.

This simple simulated scenario also helps us to il-lustrate a couple of other features. First, a single ex-ponential term in (7) is able to capture a wide variety

FIG. 2. Use of variable selection parameters: Simulated data (n = 80,p = 20). Box plots of posterior samples for ρk ∈ [0,1]. Plots (a)and (b) demonstrate selection without and with, respectively, the inclusion of the selection parameter γ .


of continuous response surfaces, allowing a great flex-ibility in the shape of the response surface, with thelinear fit being a subset of one of many types of sur-faces that can be generated. Second, the effect of co-variates with higher-order polynomial-like associationto the response is captured by having estimates of thecorresponding ρk’s further away from 1; see, for ex-ample, covariate x4 in Figure 2 which expresses thehighest order association to the response.

6.3 Large p

Next we show simulation results on continuous,count and survival data models, for (n,p) = (100,

1,000). We employ an additive term as the kernel forall models,

y = a1x1 + a2x2 + a3x3 + a4x4(18)

+ a5 sin(a6x5) + a7 sin(a8x6) + ε.

The functional form for the simulation kernel is de-signed so that the first four covariates express a linearrelationship to the response while the next two expressnonlinear associations. Model-specific coefficient val-ues are displayed in Table 1. Methods employed to ran-domly generate the observed count and event time datafrom the latent response kernel are also outlined in thetable. For example, the kernel captures the log-meanof the Poisson distribution used to generate count data,and it is used to generate the survivor function that is

inverted to provide event time data for the Cox model.As in the previous simulation, all covariates are gener-ated from U (0,1).

We set the hyperparameters as described in Sec-tion 6.1. We used MCMC scheme 1 and increased thenumber of total iterations, with respect to the simplersimulation with only p = 20, to 800,000 iterations, dis-carding half of them for burn-in.

Results are reported in Table 1. While the continu-ous and count data GP models readily assigned highmarginal posterior probabilities to the correct covari-ates (figures not shown), the Cox GP model correctlyidentified only 5 of 6 predictors; see Figure 3 for theposterior distributions of γk = 1 and the box plots forthe posterior samples of ρk for this model (for readabil-ity, only the first 20 covariates are displayed). The pre-dictive power for the continuous and count data mod-els was assessed by normalizing the mean squared pre-diction error (MSPE) with the variance of the test set.Excellent results were achieved in our simulations. Forthe Cox GP model, the averaged survivor function es-timated on the test set is shown in Figure 4, where weobserve a tight fit between the estimated curve and theKaplan–Meier empirical estimate constructed from thesame test data.

Though for the Cox model we only report results ob-tained using the partial likelihood formulation, we con-ducted the same simulation study with the model based

TABLE 1Large p: Simulations for continuous, count and survival data models with (n,p) = (100,1,000)

Continuous data Count data Cox model

Coefficients:a1 1.0 1.6 3.0a2 1.0 1.6 −2.5a3 1.0 1.6 3.5a4 1.0 1.6 −3.0a5 1.0 1.0 1.0a6 3.0 3.0 3.0a7 1.0 1.0 −1.0a8 5.0 5.0 5.0Model Identity link log(λ) = y S(t |y) = exp[−H0(t) exp(y)]

t ∼ Pois(λ) H0(t) = λt, λ = 0.2t = M/(λ exp(y)),M ∼ Exp(1)

5% uniform randomly censored,tcens = U (0, tevent)

Train/test 100/20 100/20 100/60

Correctly selected 6 out of 6 6 out of 6 5 out of 6False positives 0 0 0MSPE (normalized) 0.0067 0.045 see Figure 4


FIG. 3. Cox GP model with large p: Simulated data (n = 100,p = 1,000). Posterior distributions for γk = 1 and box plots of posteriorsamples for ρk .

on the full likelihood of Kalbfleisch (1978). The par-tial likelihood model formulation produced more con-sistent results across multiple chains, with the samedata, and was able to detect much weaker signals. TheKalbfleisch (1978) model did, however, produce lowerposterior values near 0 for nonselected covariates, un-like the partial likelihood formulation, which showsvalues typically from 10–40%, pointing to a potentialbias toward false positives.

Additional simulations, including larger sample sizescases, are reported in Savitsky (2010).

6.4 Comparison of MCMC Methods

We compare the 2 MCMC schemes previously de-scribed for posterior inference on (γ ,ρ) on the basisof sampling and computational efficiency. We use theunivariate regression simulation kernel

y = x1 + 0.8x2 + 1.3x3 + sin(x4) + sin(3x5)

+ sin(5x6) + (1.5x7)(1.5x8) + ε,

with ε ∼ N (0, σ 2) and σ = 0.05. We utilize 1,000 co-variates with all but the first 8 defined as nuisance. Weuse a training and a validation set of 100 observationseach.

The two schemes differ in the way they update(γ ,ρ). While scheme 1 samples either one or two po-sitions in the model space on each iteration, scheme 2samples (γk, ρk) for each of the p covariates. Becauseof this a good “rule-of-thumb” should employ a num-ber of iterations for scheme 1 which is roughly p timesthe number of iterations employed for scheme 2. The

use of the Keep move in scheme 1, however, reducesthe need of scaling the number of iterations by exactlyp, since all ρk’s are sampled at each iteration. In oursimulations we found stable convergence under moder-ate correlation among covariates for scheme 2 in 5,000iterations and for scheme 1 in 500,000 iterations. Forboth schemes, we discarded half of the iterations asburn-in. The CPU run times we report in Table 2 arebased on utilization of Matlab with a 2.4 GHz QuadCore (Q6600) PC with 4 GB of RAM running 64-bitWindows XP.

FIG. 4. Cox GP model with large p: Simulated data(n = 100,p = 1,000). Average survivor function curve on the val-idation set (dashed line) compared to the Kaplan–Meier empiricalestimate (solid line).


TABLE 2Efficiency comparison of GP MCMC methods

MCMC scheme 2 MCMC scheme 1

Adaptive Nonadaptive

Iterations (computation) 5,000 5,000 500,000Autocorrelation timeρ6 310 82 441ρ8 59 35 121

ComputationCPU-time (sec) 980 4,956 10,224

We compared sampling efficiency looking at auto-correlation for selected ρk . The autocorrelation time isdefined as one plus twice the sum of the autocorrela-tions at all lags and serves as a measure of the relativedependence for MCMC samples. We used the numberof MCMC iterations divided by this factor as an “ef-fective sample size.” We followed a procedure outlinedby Neal (2000) and ran first scheme 2 for 1,000 it-erations, to obtain a state near the posterior distribu-tion. We then employed this state to initiate a chainfor each of the two schemes. We ran scheme 2 for anadditional 2,000 iterations and scheme 1 for 200,000(using the last 2,000 draws for each of the target ρk

for final comparison). For scheme 2 we used both theadaptive and nonadaptive versions. Table 2 reports re-sults for ρ8, aligned to a covariate expressing a lin-ear interaction, and for ρ6, for a highly nonlinear in-teraction. We observe that both versions of scheme 2express notable improvements in computational effi-ciency as compared to scheme 1. We note, however,that the adaptive scheme method produces draws ofhigher autocorrelation than the nonadaptive method.

6.5 Sensitivity Analysis

We begin with a sensitivity analysis on the prior forρk|γk = 1. Table 3 shows results under a full factorialcombination for hyperparameters (a, b) of a Beta priorconstruction, where we recall Beta(1,1) ≡ U (0,1).Results were obtained with the univariate regressionsimulation kernel

y = x1 + x2 + sin(1.5x3) sin(1.5x4) + sin(3x5)

+ sin(3x6) + (1.5x7)(1.5x8) + ε,

with ε ∼ N (0, σ 2) and where we employed a highererror variance of σ = 0.28. As before, we employ1,000 covariates with all but the first 8 defined asnuisance. A training sample of 110 was simulated,

along with a test set of 100 observations. We employedthe adaptive scheme 2, with 5,000 iterations, half dis-carded as burn-in.

Figure 5 shows box plots of posterior samples for ρk

for two symmetric alternatives, 1 : (a, b) = (0.5,0.5)

(U-shaped) and 2 : (a, b) = (2.0,2.0) (symmetric uni-modal). For scenario 2 we observe a reduction in pos-terior jitter on nuisance covariates and a stabilization ofposterior sampling for associated covariates, but also agreater tendency to exclude x3, x4. One would expectthe differences in posterior sampling behavior acrossprior hyperparameter values to decline as the samplesize increases. Table 3 displays the number of nonse-lected true variables (false negatives), out of 8, alongwith the normalized MSPEs for all scenarios. Therewere no false positives to report across all hyperpa-rameter settings. Overall, results are similar across thechosen settings for (a, b), with slightly better perfor-mances for a < 1 and b ≥ 1, corresponding to strictlydecreasing shapes that aid selection by pushing moremass away from 1, increasing the prior probability ofthe good variables to be selected, especially in the pres-ence of a large number of noisy variables.

Next we imposed a Beta distribution on the hyper-parameter α of the priors γk ∼ Bernoulli(α) for covari-ate inclusion. We follow Brown, Vannucci and Fearn(1998a) to specify a vague prior by setting the mean of

TABLE 3Prior sensitivity for ρk |γk = 1 ∼ Beta(a, b). Results are reported

as (number of false negatives)/(normalized MSPE)

b\a 0.5 1.0 2.0

0.5 2/0.18 2/0.15 2/0.181.0 1/0.14 1/0.16 2/0.182.0 1/0.15 2/0.16 2/0.17


FIG. 5. Prior Sensitivity for ρk |γk = 1 ∼ Beta(a, b): Box plots of posterior samples for ρk for (a, b) = (0.5,0.5)—plot (a)—and(a, b) = (2.0,2.0)—plot (b).

the Beta prior to 0.025, reflecting a prior expectationfor model sparsity, and the sum of the two parametersof the distribution to 2. We ran the same univariate re-gression simulation kernel as above with the hyperpa-rameter settings for the Beta prior on ρk equal to (1,1)

and obtained the same selection results as in the case ofα fixed and a slightly lower normalized MSPE of 0.14.

Last, we explored performances with respect to cor-relation among the predictors. We utilized the samekernel as above with 8 true predictors from which toconstruct the response. We then induced a 70% corre-lation among 20 randomly chosen nuisance covariatesand the true predictor x6. We found 2 false negativesand 1 false positive, which demonstrates a relative se-lection robustness under correlation. We did observe asignificant decline in normalized MSPE, however, to0.33, as compared to previous runs.

7. BENCHMARK DATA APPLICATIONS

We now present results on two data sets often usedin the literature as benchmarks. For both analyses weperformed inference by using the MCMC—scheme 2,with 5,000 iterations and half discarded as burn-in.

7.1 Ozone data

We start by revisiting the ozone data, first analyzedfor variable selection by Breiman and Friedman (1985)and more recently by Liang et al. (2008). This dataset supplies integer counts for the maximum numberof ozone particles per one million particles of air nearLos Angeles for n = 330 days and includes an asso-ciated set of 8 meteorological predictors. We held out

a randomly chosen set of 165 observations for valida-tion.

Liang et al. (2008) use a linear regression model in-cluding all linear and quadratic terms for a total ofp = 44 covariates. They achieve variable selection byimposing a mixture prior on the vector β of regres-sion coefficients and specifying a g-prior of the typeβγ |φ ∼ N (0,

gφ(XT

γ Xγ )−1). Their results are reportedin Table 4 with various formulations for g. In particu-lar, the local empirical Bayes method offers a model-dependent maximizer of the marginal likelihood on g,while the hyper-g formulation with a = 4 is one mem-ber of a continuous set of hyper-prior distributions onthe shrinkage factor, g/(1 + g) ∼ Beta(1, a/2 − 1).Since the design matrix expresses a high conditionnumber, a situation that can at times induce poor re-sults with g-priors, we additionally applied the methodof Brown, Vannucci and Fearn (2002) who used a mix-ture prior of the type βγ ∼ N (0, cI). Results shownin Table 4 were obtained from the Matlab code madeavailable by the authors.

Though previous variable selection work on theozone data all choose a Gaussian likelihood, a moreprecise approach employs a discrete Poisson or nega-tive binomial formulation on data with low count val-ues, or a log-normal approximation where counts arehigh. With a maximum value of 38 and a mean of 11we chose to model the data with the negative-binomialcount data model (5). We used the same hyperparam-eter settings as in our simulation study. Results areshown in Figure 6. By selecting, for example, the best 3variables, we achieve a notable decrease in the root-MSPE as compared to the linear models. Also, by al-


TABLE 4Ozone data: Results

Prior on g Mγ pγ RMSPE

Local empirical Bayes X5,X6,X7,X26,X2

7,X3X5 6 4.5

Hyper-g (a = 4) X5,X6,X7,X26,X2

7,X3X5 6 4.5

Fixed (BIC) X5,X6,X7,X26,X2

7,X3X5 6 4.5

Brown, Vannucci and Fearn (2002) X1X6,X1X7,X6X7,X21,X2

3,X27 6 4.5

GP model X3,X6,X7 3 3.7

lowing an a priori unspecified functional form for howcovariates relate to the response, we end up selecting amuch more parsimonious model, although, of course,we lose in interpretability of the selected terms, withrespect to linear formulations that specifically includelinear, quadratic and interactions terms in the model.

7.2 Boston Housing data

Next we utilize the Boston Housing data set, also an-alyzed by Breiman and Friedman (1985), who used anadditive model and employed an algorithm to empir-ically determine the functional relationship for eachpredictor. This data set relates p = 13 predictors tothe median value of owner-occupied homes in each ofn = 506 census tracts in the Boston metropolitan area.As with the previous data set, we held out a random setof 250 observations to assess prediction.

We employed the continuous data model (3) withthe same hyperparameter settings as in our simulations.The four predictors chosen by Breiman and Friedman(1985), (x6, x10, x11, x13), had all marginal posterior

probability of inclusion greater than 0.9 in our model.Other variables with high marginal posterior probabil-ity were (x5, x7, x8, x12). The adaptability of the GP re-sponse surface is illustrated with closer examination ofcovariate x5, which measures the level of nitrogen ox-ide (NOX), a pollutant emitted by cars and factories. Atlow levels, indicating proximity to jobs, x5 presents apositive association to the response, and at high levels,indicating overly industrialized areas, a negative asso-ciation. This inverted parabolic association over the co-variate range probably drove its exclusion in the modelof Breiman and Friedman (1985). The GP formulationis, however, able to capture this strong nonlinear rela-tionship as is noted in Figure 7. By using only the sub-set of the best eight predictors, we achieved a normal-ized MSE of 0.1 and a prediction R2 of 0.9, very closeto the value of 0.89 reported by Breiman and Friedman(1985) on the training data.

We also employed the Matern covariance construc-tion (9), which we recall employs an explicit smooth-ing parameter, ν ∈ [0,∞). While selection results were

FIG. 6. Ozone data: Posterior distributions for γk = 1 and box plots of posterior samples for ρk .


FIG. 7. Boston housing data: Posterior distributions for γk = 1 and box plots of posterior samples for ρk .

roughly similar, the prediction results for the Maternmodel were significantly worse than the exponentialmodel, with a normalized MSPE of 0.16, probably dueto overfitting. It is worth noticing that the more com-plex form for the Bessel function increases the CPUcomputation time by a factor of 5–10 under the Materncovariance as compared to the exponential construc-tion.

For comparison, we looked at GBMs. We used ver-sion 3.1 of the gbm package for the R software envi-ronment. We utilized the same training and validationdata as above. After experimentation and use of 10-fold cross-validation, we chose a small value for theinput regularization parameter, ν = 0.0005, to providea smoother fit that prevents overfitting. Larger valuesof ν resulted in higher prediction errors. The GBM wasrun for 50,000 iterations to achieve minimum fit error.The result provided a normalized MSPE of 0.13 on thetest set, similar to, though slightly higher than, the GPresult. The left-hand chart of Figure 8 displays the rela-tive covariate importance. Higher values correspond to(x13, x6, x8), and agree with our GP results. A numberof other covariates show similar importance values toone another, though lower than these top 3, making itunclear as to whether they are truly related or nuisancecovariates. Similar conclusions are reported by otherauthors. For example, Tokdar, Zhu and Ghosh (2010)analyze a subset of the same data set with a Bayesiandensity regression model based on logistic Gaussianprocesses and subspace projections and found (x13, x6)

as the most influential predictors, with a number ofothers having a mild influence as well. The right-hand

plot supplies a partial dependence plot obtained by theGBM for variable x13 by averaging over the associa-tions for the other covariates. We note that the non-linear association is not constrained to be smooth un-der GBM.

8. DISCUSSION

In this paper we have presented a unified modelingapproach via Gaussian processes that extends to datafrom the exponential dispersion family and to survivaldata. Such model formulation allows for nonlinear as-sociations of the predictors to the response. We haveconsidered, in particular, continuous, categorical andcount responses and survival data. Next we have ad-dressed the important problem of selecting variablesfrom a set of possible predictors and have put forward ageneral framework that employs Bayesian variable se-lection methods and mixture priors for the selection ofthe predictors. We have investigated strategies for pos-terior inference and have demonstrated performanceson simulated and benchmark data. GP models providea parsimonious approach to model formulation with agreat degree of freedom for the data to define the fit.Our results, in particular, have shown that GP mod-els can achieve good prediction performances withoutthe requirement of prespecifying higher order and non-linear additive functions of the predictors. The bench-mark data applications have shown that a GP formu-lation may be appropriate in cases of heterogeneouscovariates, where the inability to employ an obvioustransformation would require higher order polynomialterms in an additive linear fashion, or even in the case


FIG. 8. Boston housing data: GBM covariate analysis. Left-hand chart provides variables importance, normalized to sum up to 100.Right-hand plot enumerates partial association of x13 to the response.

of a homogeneous covariate space where the transfor-mation overly reduces structure in the data. Our simu-lation results have further highlighted the ability of theGP formulation to manage data sets with p n.

A challenge in the use of variable selection methodsin the GP framework is to manage the numerical insta-bility in the construction of the GP covariance matrix.In the Appendix we describe a projection method toreduce the effective dimension of this matrix. Anotherpractical limitation of the models we have described isthe difficulty to use them with qualitative predictors.Qian, Wu and Wu (2008) provide a modification of theGP covariance kernel that allows for nominal qualita-tive predictors consisting of any number of levels. Inparticular, the authors model the covariance structureunder a mixture of qualitative and quantitative predic-tors by employing a multiplicative factor against theusual GP kernel for each qualitative predictor to cap-ture the by-level categorical effects.

Some generalization of the methods we have pre-sented are possible. For example, as with GLM models,we may employ an additional set of variance inflationparameters in a similar construction to Neal (1999) andothers to allow for heavier tailed distributions whilemaintaining the conjugate framework.

APPENDIX: COMPUTATIONAL ASPECTS

We focus on the exponential form (7) and introducean efficient computational algorithm to generate C. We

also review a method of Banerjee et al. (2008) to ap-proximate the inverse matrix that employs a randomsubset of observations and provide a pseudo-code.

A.1 Generating the Covariance Matrix C

Let us begin with the quadratic expression, G ={gi,j } in (7). We rewrite gi,j = A′

i,j [− log(ρ)] withAi,j constructed as a p × 1 vector of term-by-termsquared differences, (xik − xjk)

2, k = 1, . . . , p. Wemay directly employ the p × 1 vector, ρ, as P is di-agonal. As a first step, we may then directly computeG = A[− log(ρ)], where A is n × n × p. We are, how-ever, able to reduce the more complex structure of A toa two dimensional matrix form by simply stacking each{i, j} row of dimension 1 × p under each other suchthat our revised structure, A∗, is of dimension n2 × p

and the computation, G = A∗[− log(ρ)], reduces to aseries of inner products. Next, we note that log(ρk) = 0for ρk = 1. So we may reduce the dimension for eachof the n2 inner products by reducing the dimension ofρ to the pγ < p nontrivial covariates. We may furtherimprove efficiency by recognizing that since our resul-tant covariance matrix, C, is symmetric positive def-inite, we need only compute the inner products for areduced set of unique terms (by removing redundantrows from A∗) and then “re-inflate” the result to a vec-tor of the correct length. Finally, we exponentiate thisvector, multiply the nonlinear weight (1/λz), add theaffine intercept term, (1/λa), and then reshape this vec-tor into the resulting n×n matrix, C. The resulting im-provement in computational efficiency at n = 100 from


the naive approach that employs double loops of innerproducts is on the order of 500 times.

Our MCMC scheme 2 proposes a change to ρk ∈ ρ,one-at-a-time, conditionally on ρ−k and the other sam-pled parameters. Changing a single ρk requires updat-ing only one column of the inner product computationof A∗ and [− log(ρ)]. Rather than conducting an entirerecomputation for C, we multiply the kth column ofA∗ (with number of rows reduced to only unique termsin C) by log(

ρk,propρk,old

), where “prop” means the proposedvalue for ρk . This result is next exponentiated (to a co-variance kernel), re-inflated and shaped into an n × n

matrix, �. We then take the current value less the affineterm, Cold − 1

λaJn, and multiply by �, term-by-term,

and add back the affine term to achieve the new co-variance matrix associated to the proposed value forρk . So we may devise an algorithm to update an ex-isting covariance matrix, C, rather than conducting anentire recomputation. At p = 1,000 with 6 nontrivialcovariates and n = 100, this algorithm further reducesthe computation time over recomputing the full covari-ance by a factor of 2. This efficiency grows nonlinearlywith the number of nontrivial covariates.

A.2 Projection Method for Large n

In order to ease the computations, we have alsoadapted a dimension reduction method proposed byBanerjee et al. (2008) for spatial data. The methodachieves a reduced-dimension computation of the in-verse of the full (n × n) covariance matrix. It canalso help with the accuracy and stability of the pos-terior computations when working with possibly ill-conditioned GP covariance matrices, particularly forlarge n. To begin, randomly choose m < n points(knots), sampled within fixed intervals on a grid toensure relatively uniform coverage, and label these m

points z∗. Then define zm→n as the orthogonal projec-tion of z onto the lower dimensional space spanned byz∗, computed as the conditional expectation

zm→n = E(z|z∗) = C′(z∗,z)C

−1(z∗,z∗)z

∗.We use the univariate regression framework in (3) to il-lustrate the dimension reduction from constructing theprojection model using zm→n in place of z(x). Recastthe model from (3) to

y = zm→n + ε = C′(z∗,z)C

−1(z∗,z∗)z

∗ + ε,

where εi ∼ N (0, 1r). Then derive n = Cov(y) =

1rIn + C′

(z∗,z)C−1(z∗,z∗)C(z∗,z). Finally, employ the Wood-

bury matrix identity to transform the inverse com-putation, −1

n = rI − r2C′(z∗,z)[C(z∗,z∗) + rC(z∗,z) ·

C′(z∗,z)]−1C(z∗,z), where the quantity inside the square

brackets, now being inverted, is m × m, supplying thedimension reduction for inverse computation we seek.We note that, in the absence of the projection method, alarge jitter term would be required to invert the GP co-variance matrix, trading accuracy for stability. Thoughthe projection method approximates a higher dimen-sional covariance matrix in a lower dimensional projec-tion, we yet improve performance and avoid the accu-racy/stability trade-off. We do, however, expect to usemore iterations for MCMC convergence when employ-ing a relatively lower projection ratio.

All results shown in this paper were obtained withm/n = 0.35, for simulated data, and with m/n = 0.25,for the benchmark applications, where we enhancedcomputation stability in the presence of the high con-dition number for the design matrix. We have also em-ployed the Cholesky decomposition, in a similar fash-ion as in Neal (1999), in lieu of directly computing theresulting m × m inverse.

A.3 Pseudo-code

Procedure to Compute, C = 1λa

Jn + 1λz

exp(−G):Input: data matrices;

(X1,X2) of dimension (n1, n2) × p

Output: function, [A∗, Ifull] = difference(X1,X2)% A∗ is matrix of squared L2 distances

for 2 data matrices of p columns% A∗ size, � × p, � ≤ n1n2: only unique entries% Ifull re-inflates A∗ with duplicate entries% Key point: Compute A∗, once,

and re-use in GP posterior computations% Set counter to stack all (i, j) obs

from X1,X2 in vectorized constructioncount = 1;

% Compute squared distancesFOR i = 1 to n1

FOR j = 1 to n2

A∗full(count,:) = (x1,i − x2,j )

2;count = count + 1;

ENDEND

% Reduce A∗full to A∗

[A∗, Ifull] = unique(A∗full,by row);

END FUNCTION

Input: Data = (A∗, Ifull), � = (ρ, λa, λz)

Output: function, [C] = C(A∗, Ifull,�)% An n1 × n2 GP covariance matrix% Only compute inner product

for column k where ρk < 1


selρ = {ρk < 1};ρ = ρ(selρ)

A∗ = A∗(:, selρ);% Compute vector of unique values for C

−Gvec = A∗[log(ρ)]′;Cvec = 1

λa+ 1

λzexp(−Gvec);

% Re-inflate Cvec to include duplicate valuesCvec = Cvec(Ifull);

% Snap Cvec into matrix form, C

C = reshape(Cvec, n2, n1)′;

END FUNCTION

Input: Previous covariance = Cold;Data = (A∗, Ifull); Position changed = k,Parameters = (ρk,new, ρk,old), Intercept = λa

Output:[Cnew] = Cpartial(Cold,A∗, Ifull, k, λa)

% Compose new covariance matrix, Cnew,from old, Cold

% Compute inner products only for row k of A∗% Produce matrix of multiplicative differences

from old to new−�Gvec = A∗(:, k) × log(

ρk,newρk,old

);% Re-inflate exp(−�Gvec)

exp(−�Gvec) = exp(−�Gvec)(Ifull);% Re-shape −�Gvec to matrix, �

� = reshape(exp[−�Gvec], n2, n1)′;

% Compute Cnew

Cnew = 1λa

Jn + (Cold − 1λa

Jn)⊙

�;END FUNCTIONProcedure to Compute Inverse of n = 1

rIn + C:

Input: Number of sub-sample = m, Data = X,Error precision = r

Covariance parameters = � = (ρ, λa, λz)

Output: −1n

% Randomly select m < n observationson which to project n × 1, z(x)

ind = random.permutations.latin.hypercube(n);% space-fillingXm = X(ind(1 :m), :);

% Compute squared distances, A∗m,A∗

[A∗m, Im,full] = difference(Xm,Xm); % m × n

[A∗, Ifull] = difference(Xm,X); % n × n

% Compose associated covariance matricesC(m,m) = C(A∗

m, Im,full,�);C(m,n) = C(A∗, Ifull,�);

% Compute n

n = 1rIn + C′

(m,n)C−1(m,m)C(m,n);

% Compute −1n employing

term-by-term multiplication−1

n = rIn − r2C′(m,n)[C(m,m)

+rC(m,n)C′(m,n)]−1C(m,n);

END

ACKNOWLEDGMENTS

Marina Vannucci supported in part by NIH-NHGRIGrant R01-HG003319 and by NSF Grant DMS-10-07871. Naijum Sha supported in part by NSF GrantCMMI-0654417. Terrance Savitsky was supported un-der NIH Grant NCI T32 CA096520 while at Rice Uni-versity.

REFERENCES

ALBERT, J. H. and CHIB, S. (1993). Bayesian analysis of binaryand polychotomous response data. J. Amer. Statist. Assoc. 88669–679. MR1224394

BANERJEE, S., GELFAND, A. E., FINLEY, A. O. and SANG, H.(2008). Gaussian predictive process models for large spatialdata sets. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 825–848.MR2523906

BISHOP, C. M. (2006). Pattern Recognition and Machine Learn-ing. Information Science and Statistics. Springer, New York.MR2247587

BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimaltransformations for multiple regression and correlation (withdiscussion). J. Amer. Statist. Assoc. 80 580–619. MR0803258

BROWN, P. J., VANNUCCI, M. and FEARN, T. (1998a). Bayesianwavelength selection in multicomponent analysis. J. Chemo-metrics 12 173–182.

BROWN, P. J., VANNUCCI, M. and FEARN, T. (1998b). Multivari-ate Bayesian variable selection and prediction. J. R. Stat. Soc.Ser. B Stat. Methodol. 60 627–641. MR1626005

BROWN, P. J., VANNUCCI, M. and FEARN, T. (2002). Bayesmodel averaging with selection of regressors. J. R. Stat. Soc.Ser. B Stat. Methodol. 64 519–536. MR1924304

CHEN, M.-H., IBRAHIM, J. G. and YIANNOUTSOS, C. (1999).Prior elicitation, variable selection and Bayesian computationfor logistic regression models. J. R. Stat. Soc. Ser. B Stat.Methodol. 61 223–242. MR1664057

CHIPMAN, H., GEORGE, E. and MCCULLOCH, R. (2001). Prac-tical implementation of Bayesian model selection. In ModelSelection (P. Lahiri, ed.) 65–134. IMS, Beachwood, OH.MR2000752

CHIPMAN, H., GEORGE, E. and MCCULLOCH, R. (2002).Bayesian treed models. Machine Learning 48 303–324.

COX, D. R. (1972). Regression models and life-tables (with dis-cussion). J. Roy. Statist. Soc. Ser. B 34 187–220. MR0341758

DIGGLE, P. J., TAWN, J. A. and MOYEED, R. A. (1998). Model-based geostatistics (with discussion). J. Roy. Statist. Soc. Ser. C47 299–350. MR1626544

FAHRMEIR, L., KNEIB, T. and LANG, S. (2004). Penalized struc-tured additive regression for space-time data: A Bayesian per-spective. Statist. Sinica 14 731–761. MR2087971

FREUND, Y. and SCHAPIRE, R. E. (1997). A decision-theoreticgeneralization of on-line learning and an application to boost-ing. J. Comput. System Sci. 55 119–139. MR1473055

GEORGE, E. I. and MCCULLOCH, R. E. (1993). Variable selectionvia Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.

http://www.ams.org/mathscinet-getitem?mr=1224394













GEORGE, E. I. and MCCULLOCH, R. E. (1997). Approaches forBayesian variable selection. Statist. Sinica 7 339–373.

GOTTARDO, R. and RAFTERY, A. E. (2008). Markov chain MonteCarlo with mixtures of mutually singular distributions. J. Com-put. Graph. Statist. 17 949–975. MR2479418

GRAMACY, R. B. and LEE, H. K. H. (2008). Bayesian treed Gaus-sian process models with an application to computer modeling.J. Amer. Statist. Assoc. 103 1119–1130. MR2528830

GREEN, P. J. (1995). Reversible jump Markov chain Monte Carlocomputation and Bayesian model determination. Biometrika 82711–732. MR1380810

GUAN, Y. and STEPHENS, M. (2011). Bayesian variable selectionregression for genome-wide association studies, and other large-scale problems. Ann. Appl. Stat. To appear.

HAARIO, H., SAKSMAN, E. and TAMMINEN, J. (2001). An adap-tive metropolis algorithm. Bernoulli 7 223–242. MR1828504

HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001). The Ele-ments of Statistical Learning: Data Mining, Inference, and Pre-diction. Springer, New York. MR1851606

JI, C. and SCHMIDLER, S. (2009). Adaptive Markov chain MonteCarlo for Bayesian variable selection. Technical report.

KALBFLEISCH, J. D. (1978). Non-parametric Bayesian analysisof survival time data. J. Roy. Statist. Soc. Ser. B 40 214–221.MR0517442

LEE, K. E. and MALLICK, B. K. (2004). Bayesian methods forvariable selection in survival models with application to DNAmicroarray data. Sankhya 66 756–778. MR2205819

LIANG, F., PAULO, R., MOLINA, G., CLYDE, M. A. andBERGER, J. O. (2008). Mixtures of g priors for Bayesianvariable selection. J. Amer. Statist. Assoc. 103 410–423.MR2420243

LINKLETTER, C., BINGHAM, D., HENGARTNER, N., HIG-DON, D. and YE, K. Q. (2006). Variable selection for Gaus-sian process models in computer experiments. Technometrics48 478–490. MR2328617

LONG, J. (1997). Regression Models for Categorical and LimitedDependent Variables. Sage, Thousand Oaks, CA.

MADIGAN, D. and YORK, J. (1995). Bayesian graphical modelsfor discrete data. Internat. Statist. Rev. 63 215–232.

MCCULLAGH, P. and NELDER, J. A. (1989). Generalized LinearModels, 2nd ed. Chapman & Hall, London.

NEAL, R. M. (1999). Regression and classification using Gaus-sian process priors. In Bayesian Statistics 6 (A. P. Dawid,J. M. Bernardo, J. O. Berger and A. F. M. Smith, eds.) 475–501.Oxford Univ. Press, New York. MR1723510

NEAL, R. M. (2000). Markov chain sampling methods for Dirich-let process mixture models. J. Comput. Graph. Statist. 9 249–265. MR1823804

O’HAGAN, A. (1978). Curve fitting and optimal design for predic-tion. J. Roy. Statist. Soc. Ser. B 40 1–42. MR0512140

PANAGIOTELIS, A. and SMITH, M. (2008). Bayesian identifica-tion, selection and estimation of semiparametric functions inhigh-dimensional additive models. J. Econometrics 143 291–316. MR2389611

PARZEN, E. (1963). Probability density functionals and reproduc-ing kernel Hilbert spaces. In Proc. Sympos. Time Series Analysis

(Brown Univ., 1962) (M. Rosenblatt, ed.) 155–169. Wiley, NewYork. MR0149634

QIAN, P. Z. G., WU, H. and WU, C. F. J. (2008). Gaussianprocess models for computer experiments with qualitative andquantitative factors. Technometrics 50 383–396. MR2457574

RAFTERY, A. E., MADIGAN, D. and VOLINSKY, C. T. (1996).Accounting for model uncertainty in survival analysis improvespredictive performance. In Bayesian Statistics 5. Oxford Sci.Publ. 323–349. Oxford Univ. Press, New York. MR1425413

RASMUSSEN, C. E. and WILLIAMS, C. K. I. (2006). GaussianProcesses for Machine Learning. Adaptive Computation andMachine Learning. MIT Press, Cambridge, MA. MR2514435

ROBERTS, G. O. and ROSENTHAL, J. S. (2007). Coupling andergodicity of adaptive Markov chain Monte Carlo algorithms.J. Appl. Probab. 44 458–475. MR2340211

RUPPERT, D., WAND, M. P. and CARROLL, R. J. (2003). Semi-parametric Regression. Cambridge Series in Statistical andProbabilistic Mathematics 12. Cambridge Univ. Press, Cam-bridge. MR1998720

SACKS, J., SCHILLER, S. B. and WELCH, W. J. (1989). De-signs for computer experiments. Technometrics 31 41–47.MR0997669

SAVITSKY, T. D. (2010). Generalized Gaussian process modelswith Bayesian variable selection. Ph.D. thesis, Dept. Statistics,Rice Univ. MR2782375

SHA, N., TADESSE, M. G. and VANNUCCI, M. (2006). Bayesianvariable selection for the analysis of microarray data with cen-sored outcomes. Bioinformatics 22 2262–2268.

SHA, N., VANNUCCI, M., TADESSE, M. G., BROWN, P. J.,DRAGONI, I., DAVIES, N., ROBERTS, T. C., CONTESTA-BILE, A., SALMON, M., BUCKLEY, C. and FALCIANI, F.(2004). Bayesian variable selection in multinomial probit mod-els to identify molecular signatures of disease stage. Biometrics60 812–828. MR2089459

SHAWE-TAYLOR, J. and CRISTIANINI, N. (2004). Kernel Meth-ods for Pattern Analysis. Cambridge Univ. Press, Cambridge.

SINHA, D., IBRAHIM, J. G. and CHEN, M.-H. (2003). A Bayesianjustification of Cox’s partial likelihood. Biometrika 90 629–641.MR2006840

THRUN, S., SAUL, L. K. and SCHOLKOPF, B. (2004). Advancesin Neural Information Processing Systems. MIT Press, Cam-bridge.

TOKDAR, S., ZHU, Y. and GHOSH, J. (2010). Bayesian densityregression with logistic Gaussian process and subspace projec-tion. Bayesian Anal. 5 319–344.

VOLINSKY, C., MADIGAN, D., RAFTERY, A. and KRONMAL, R.(1997). Bayesian model averaging in proportional hazard mod-els: Assessing the risk of stroke. Appl. Statist. 46 433–448.

WAHBA, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59.SIAM, Philadelphia, PA. MR1045442

WENG, Y. P. and WONG, K. F. (2007). Baseline Survival Func-tion Estimators under Proportional Hazards Assumption, Ph.D.thesis, Institute of Statistics, National Univ. Kaohsiung, Taiwan.

























variable selection for nonparametric gaussian process

Documents