variable selection and model building viawahba/ftp1/zhang.lbp.pdf · variable selection and model...

14
Variable Selection and Model Building via Likelihood Basis Pursuit Hao Helen ZHANG, Grace WAHBA, Yi LIN, Meta VOELKER, Michael FERRIS, Ronald KLEIN, and Barbara KLEIN This article presents a nonparametric penalized likelihood approach for variable selection and model building, called likelihood basis pursuit (LBP). In the setting of a tensor product reproducing kernel Hilbert space, we decompose the log-likelihood into the sum of different func- tional components such as main effects and interactions, with each component represented by appropriate basis functions. Basis functions are chosen to be compatible with variable selection and model building in the context of a smoothing spline ANOVA model. Basis pursuit is applied to obtain the optimal decomposition in terms of having the smallest l 1 norm on the coefficients. We use the functional L 1 norm to measure the importance of each component and determine the “threshold” value by a sequential Monte Carlo bootstrap test algorithm. As a generalized LASSO-type method, LBP produces shrinkage estimates for the coefficients, which greatly facilitates the variable selec- tion process and provides highly interpretable multivariate functional estimates at the same time. To choose the regularization parameters appearing in the LBP models, generalized approximate cross-validation (GACV) is derived as a tuning criterion. To make GACV widely applicable to large datasets, its randomized version is proposed as well. A technique “slice modeling” is used to solve the optimization problem and makes the computation more efficient. LBP has great potential for a wide range of research and application areas such as medical studies, and in this article we apply it to two large ongoing epidemiologic studies, the Wisconsin Epidemiologic Study of Diabetic Retinopathy (WESDR) and the Beaver Dam Eye Study (BDES). KEY WORDS: Generalized approximate cross-validation; LASSO; Monte Carlo bootstrap test; Nonparametric variable selection; Slice modeling; Smoothing spline ANOVA. 1. INTRODUCTION Variable selection, or dimension reduction, is fundamental to multivariate statistical model building. Not only does judi- cious variable selection improve the model’s predictive ability, it also generally provides a better understanding of the underly- ing concept that generates the data. Due to recent proliferation of large, high-dimensional databases, variable selection has be- come the focus of intensive research in several areas such as text processing, environmental sciences, and genomics, partic- ularly gene expression array data involving tens or hundreds of thousands of variables. Traditional variable selection approaches, such as stepwise selection and best subset selection, are built in linear regres- sion models, and the well-known criteria like Mallows’s C p , the Akaike information criterion (AIC) and the Bayesian in- formation criterion (BIC) are often used to penalize the num- ber of nonzero parameters (see Linhart and Zucchini 1986 for an introduction). To achieve better prediction and reduce the variances of estimators, many shrinkage estimation approaches have been proposed. Bridge regression was introduced by Frank and Friedman (1993), which is a constrained least squares method subject to an L p penalty with p 1. Two special cases of bridge regression are the LASSO proposed by Tibshirani (1996) when p = 1 and the ridge regression when p = 2. Due Hao Helen Zhang is Assistant Professor, Department of Statistics, North Carolina State University, Raleigh, NC 27695 (E-mail: [email protected]. edu). Grace Wahba is IJ Schoenberg and Bascom Professor (E-mail: wahba@ stat.wisc.edu) and Yi Lin is Associate Professor (E-mail: [email protected]), Department of Statistics, and Michael Ferris is Professor (E-mail: ferris@cs. wisc.edu), Department of Computer Sciences and Industrial Engineering, University of Wisconsin-Madison, Madison, WI 53706. Meta Voelker is Se- nior Analyst, Alphatech Inc., Arlington, VA 22203 (E-mail: meta.voelker@dc. alphatech.com). Ronald Klein is Professor (E-mail: [email protected]) and Barbara Klein is Professor (E-mail: [email protected]), Depart- ment of Ophthalmology, Medical School, University of Wisconsin-Madison, Madison, WI 53726. This work was supported in part by National Sci- ence Foundation grants DMS-00-72292, DMS-01-34987, DMS-04-05913, and CCR-9972372; National Institutes of Health grants EY09946 and EY03083; and AFOSR grant F49620-01-1-0040. The authors thank the editor, the asso- ciate editor, and the two referees for their constructive comments and sugges- tions that have led to significant improvement of this article. to the nature of the L 1 penalty, the LASSO tends to shrink small coefficients to 0 and hence gives concise models. It also exhibits the stability of ridge regression estimates. Fu (1998) made a thorough comparison between the bridge model and the LASSO. Knight and Fu (2000) proved some asymptotic re- sults for LASSO-type estimators. In the case of wavelet regres- sion, this L 1 penalty approach is called “basis pursuit.” Chen, Donoho, and Saunders (1998) discussed atomic decomposition by basis pursuit in some detail. (A related development can be found in Bakin 1999.) Gunn and Kandola (2002) proposed a structural modeling approach with sparse kernels. Recently, Fan and Li (2001) suggested a nonconcave penalized likelihood approach with the smoothly clipped absolute deviation (SCAD) penalty function, which resulted in an unbiased, sparse, and continuous estimator. Our motivation of this study is to pro- vide a flexible nonparametric alternative to the parametric ap- proaches for variable selection as well as model building. Yau, Kohn, and Wood (2003) presented a Bayesian method for vari- able selection in a nonparametric manner. Smoothing spline analysis of variance (SS–ANOVA) pro- vides a general framework for nonparametric multivariate func- tion estimation and has been studied intensively for Gaussian data. Wahba, Wang, Gu, Klein, and Klein (1995) gave a gen- eral setting for applying the SS–ANOVA model to exponential families. Gu (2002) provided a comprehensive review of the SS–ANOVA and some recent progress. In this work we have developed a unified model that appropriately combines the SS–ANOVA model and basis pursuit for variable selection and model building. The article is organized as follows. Section 2 introduces the notation and illustrates the general structure of the likeli- hood basis pursuit (LBP) model. We focus on the main-effects model and the two-factor interaction model, then generalize the models to incorporate categorical variables. Section 3 dis- cusses the important issue of adaptively choosing regularization © 2004 American Statistical Association Journal of the American Statistical Association September 2004, Vol. 99, No. 467, Theory and Methods DOI 10.1198/016214504000000593 659

Upload: vuonglien

Post on 29-Sep-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Variable Selection and Model Building viaLikelihood Basis Pursuit

Hao Helen ZHANG Grace WAHBA Yi LIN Meta VOELKER Michael FERRISRonald KLEIN and Barbara KLEIN

This article presents a nonparametric penalized likelihood approach for variable selection and model building called likelihood basis pursuit(LBP) In the setting of a tensor product reproducing kernel Hilbert space we decompose the log-likelihood into the sum of different func-tional components such as main effects and interactions with each component represented by appropriate basis functions Basis functionsare chosen to be compatible with variable selection and model building in the context of a smoothing spline ANOVA model Basis pursuitis applied to obtain the optimal decomposition in terms of having the smallest l1 norm on the coefficients We use the functional L1 normto measure the importance of each component and determine the ldquothresholdrdquo value by a sequential Monte Carlo bootstrap test algorithmAs a generalized LASSO-type method LBP produces shrinkage estimates for the coefficients which greatly facilitates the variable selec-tion process and provides highly interpretable multivariate functional estimates at the same time To choose the regularization parametersappearing in the LBP models generalized approximate cross-validation (GACV) is derived as a tuning criterion To make GACV widelyapplicable to large datasets its randomized version is proposed as well A technique ldquoslice modelingrdquo is used to solve the optimizationproblem and makes the computation more efficient LBP has great potential for a wide range of research and application areas such asmedical studies and in this article we apply it to two large ongoing epidemiologic studies the Wisconsin Epidemiologic Study of DiabeticRetinopathy (WESDR) and the Beaver Dam Eye Study (BDES)

KEY WORDS Generalized approximate cross-validation LASSO Monte Carlo bootstrap test Nonparametric variable selection Slicemodeling Smoothing spline ANOVA

1 INTRODUCTION

Variable selection or dimension reduction is fundamentalto multivariate statistical model building Not only does judi-cious variable selection improve the modelrsquos predictive abilityit also generally provides a better understanding of the underly-ing concept that generates the data Due to recent proliferationof large high-dimensional databases variable selection has be-come the focus of intensive research in several areas such astext processing environmental sciences and genomics partic-ularly gene expression array data involving tens or hundreds ofthousands of variables

Traditional variable selection approaches such as stepwiseselection and best subset selection are built in linear regres-sion models and the well-known criteria like Mallowsrsquos Cp the Akaike information criterion (AIC) and the Bayesian in-formation criterion (BIC) are often used to penalize the num-ber of nonzero parameters (see Linhart and Zucchini 1986 foran introduction) To achieve better prediction and reduce thevariances of estimators many shrinkage estimation approacheshave been proposed Bridge regression was introduced by Frankand Friedman (1993) which is a constrained least squaresmethod subject to an Lp penalty with p ge 1 Two special casesof bridge regression are the LASSO proposed by Tibshirani(1996) when p = 1 and the ridge regression when p = 2 Due

Hao Helen Zhang is Assistant Professor Department of Statistics NorthCarolina State University Raleigh NC 27695 (E-mail hzhang2statncsuedu) Grace Wahba is IJ Schoenberg and Bascom Professor (E-mail wahbastatwiscedu) and Yi Lin is Associate Professor (E-mail yilinstatwiscedu)Department of Statistics and Michael Ferris is Professor (E-mail ferriscswiscedu) Department of Computer Sciences and Industrial EngineeringUniversity of Wisconsin-Madison Madison WI 53706 Meta Voelker is Se-nior Analyst Alphatech Inc Arlington VA 22203 (E-mail metavoelkerdcalphatechcom) Ronald Klein is Professor (E-mail kleinrepiophthwiscedu)and Barbara Klein is Professor (E-mail kleinBepiophthwiscedu) Depart-ment of Ophthalmology Medical School University of Wisconsin-MadisonMadison WI 53726 This work was supported in part by National Sci-ence Foundation grants DMS-00-72292 DMS-01-34987 DMS-04-05913 andCCR-9972372 National Institutes of Health grants EY09946 and EY03083and AFOSR grant F49620-01-1-0040 The authors thank the editor the asso-ciate editor and the two referees for their constructive comments and sugges-tions that have led to significant improvement of this article

to the nature of the L1 penalty the LASSO tends to shrinksmall coefficients to 0 and hence gives concise models It alsoexhibits the stability of ridge regression estimates Fu (1998)made a thorough comparison between the bridge model andthe LASSO Knight and Fu (2000) proved some asymptotic re-sults for LASSO-type estimators In the case of wavelet regres-sion this L1 penalty approach is called ldquobasis pursuitrdquo ChenDonoho and Saunders (1998) discussed atomic decompositionby basis pursuit in some detail (A related development canbe found in Bakin 1999) Gunn and Kandola (2002) proposeda structural modeling approach with sparse kernels RecentlyFan and Li (2001) suggested a nonconcave penalized likelihoodapproach with the smoothly clipped absolute deviation (SCAD)penalty function which resulted in an unbiased sparse andcontinuous estimator Our motivation of this study is to pro-vide a flexible nonparametric alternative to the parametric ap-proaches for variable selection as well as model building YauKohn and Wood (2003) presented a Bayesian method for vari-able selection in a nonparametric manner

Smoothing spline analysis of variance (SSndashANOVA) pro-vides a general framework for nonparametric multivariate func-tion estimation and has been studied intensively for Gaussiandata Wahba Wang Gu Klein and Klein (1995) gave a gen-eral setting for applying the SSndashANOVA model to exponentialfamilies Gu (2002) provided a comprehensive review of theSSndashANOVA and some recent progress In this work we havedeveloped a unified model that appropriately combines theSSndashANOVA model and basis pursuit for variable selection andmodel building

The article is organized as follows Section 2 introducesthe notation and illustrates the general structure of the likeli-hood basis pursuit (LBP) model We focus on the main-effectsmodel and the two-factor interaction model then generalizethe models to incorporate categorical variables Section 3 dis-cusses the important issue of adaptively choosing regularization

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000593

659

660 Journal of the American Statistical Association September 2004

parameters An extension of generalized approximate cross-validation (GACV) proposed by Xiang and Wahba (1996) isderived as a tuning criterion Section 4 proposes the measureof importance for the variables and if desired their interac-tions We develop sequential Monte Carlo bootstrap test algo-rithm to determine the selection threshold Section 5 covers thenumerical computation issue Sections 6ndash8 present several sim-ulation examples and the applications of LBP to two large epi-demiologic studies We carry out a data analysis for the 4-yearrisk of progression of diabetic retinopathy in the WisconsinEpidemiologic Study of Diabetic Retinopathy (WESDR) andfor the 5-year risk of mortality in the Beaver Dam Eye Study(BDES) The final section contains some concluding remarksProofs are relegated to Appendixes A and B

2 LIKELIHOOD BASIS PURSUIT

21 Smoothing Spline ANOVA for Exponential Families

We are interested in estimating the dependence of Y onthe covariates X = (X(1) X(d)) Typically X is in ahigh-dimensional space X = X (1) otimes middot middot middot otimes X (d) where X (α)α = 1 d is some measurable space and otimes denotes the ten-sor product operation Conditioning on x assume that Y isfrom an exponential family with density of form h(y|x) =exp[yf (x) minus b(f (x))a(φ) + c(yφ)] where a gt 0 b andc are known functions f (x) is the parameter of interest depen-dent on x and φ is either known or a nuisance parameter in-dependent of x We denote the observations of Yi |xi sim h(y|xi )i = 1 n by the vector y = (y1 yn)

prime The scaled condi-tional log-likelihood is

L(y f ) = 1

n

nsum

i=1

[minuslyi f (xi)]

equiv 1

n

nsum

i=1

[minusyif (xi) + bf (xi )] (1)

Although the methodology proposed here is general and validfor any exponential family we use the Bernoulli case as ourworking example In Bernoulli data Y takes on values 01with the conditional probability p(x) equiv Pr(Y = 1|X = x)The logit function f (x) = log(

p(x)1minusp(x)

) and the log-likelihoodl(y f ) = yf minus log(1 + ef ) Many parametric approaches suchas those of Tibshirani (1996) Fu (1998) and Fan and Li (2001)assume f (x) to be a linear function of x Instead we allow f

to vary in a high-dimensional function space which leads toa more flexible estimate for the target function In this sec-tion and Section 22 we assume that all of the covariates arecontinuous Later in Section 23 we take categorical variablesinto account Similar to the classical ANOVA for any functionf (x) = f (x(1) x(d)) on a product domain X we can defineits functional ANOVA decomposition as

f (x) = b0 +dsum

α=1

(x(α)

)

+sum

αltβ

fαβ

(x(α) x(β)

)

+ all higher-order interactions (2)

where b0 is constant fαrsquos are the main effects and fαβ rsquosare the two-factor interactions The identifiability of the termsis ensured by side conditions through averaging operatorsWe are to estimate each fα in an reproducing kernel Hilbertspace (RKHS) H(α) each fαβ in the tensor product spaceH(α) otimesH(β) and so on The full model space is then the d th-order tensor product space

otimesdα=1 H(α) Each functional com-

ponent in the decomposition (2) falls in the correspondingsubspace of

otimesdα=1 H(α)

For any continuous covariate x(α) we scale it onto [01] andchoose H(α) to be the second-order Sobolev space W2[01]which is defined as g g(t) gprime(t) are absolutely continu-ous gprimeprime(t) isin L2[01] When endowed with a certain innerproduct W2[01] is an RKHS with the reproducing ker-nel (RK) 1 + K0(s t) + K1(s t) Here K0(s t) = k1(s)k1(t)

and K1(s t) = k2(s)k2(t) minus k4(|s minus t|) with k1(t) = t minus 12

k2(t) = 12 (k2

1(t) minus 112 ) and k4(t) = 1

24 (k41(t) minus 1

2k21(t) + 7

240 )

Notice that H(α) = [1] oplus H(α)π oplus H(α)

s with [1] the constantsubspace H(α)

π the ldquoparametricrdquo subspace generated by K0

consisting of linear functions and H(α)s the ldquononparametricrdquo

subspace generated by K1 consisting of smooth functions Thereproducing kernel of

otimesdα=1 H(α) is

dprod

α=1

(1 + k1

(s(α)

)k1(t(α)

)+ K1(s(α) t(α)

)) (3)

Correspondinglyotimesd

α=1 H(α) can be decomposed into the ten-sor sum of parametric main-effect subspaces and smooth main-effect subspaces and two-factor interaction subspaces of threepossible forms parametric otimes parametric smooth otimes parametricand smooth otimes smooth and similarly for three-factor or higherinteraction subspaces The RKs for these subspaces are given bythe corresponding terms in the expansion of (3) and the termsin (2) can be expanded in terms corresponding to those RKsin the expansion of (3) Therefore our model encompasses thelinear model as a special case (see Wahba 1990 for more de-tails) In various situations the ANOVA decomposition in (2)and in the expansion of (3) is truncated at some point In thiswork we consider truncation for the continuous variables nolater than after the two-factor interactions The remaining RKswill be used to construct an overcomplete set of basis functionsfor the likelihood basis pursuit via details given later

22 Likelihood Basis Pursuit Models

Basis pursuit is a principle for decomposing a signal into anoptimal superposition of dictionary elements where ldquooptimalrdquomeans having the smallest l1 norm of the coefficients amongall such decompositions Chen et al (1998) illustrated atomicdecomposition by basis pursuit in wavelet regression In thisarticle we apply basis pursuit to the negative log-likelihood inthe context of a dictionary based on the SSndashANOVA decom-position then select the important components using the mul-tivariate function estimates Let H be the model space aftertruncation The variational problem for the LBP model is

minfisinH

1

n

nsum

i=1

[minusl(yi f (xi)

)]+ Jλ(f ) (4)

Zhang et al Likelihood Basis Pursuit 661

where Jλ(f ) denotes the l1 norm of the coefficients of the ba-sis functions in the representation of f It is a generalizationof the LASSO penalty in the context of nonparametric mod-els The l1 penalty often produces coefficients that are exactly 0and therefore gives sparse solutions This sparsity helps dis-tinguish important variables from unimportant ones easily andmore effectively (See Tibshirani 1996 and Fan and Li 2001for comparison of the l1 penalty with other forms of penalty)The regularization parameter λ balances the fitness in the like-lihood and the penalty part

For the usual smoothing spline modeling the penalty Jλ isa quadratic norm or seminorm in an RKHS Kimeldorf andWahba (1971) showed that the minimizer fλ for the traditionalsmoothing spline model falls in a finite-dimensional spacealthough the model space is of infinite dimension For the penal-ized likelihood approach with a nonquadratic penalty like thel1 penalty it is very hard to obtain analytic solutions In light ofthe results for the quadratic penalty situation we propose usinga sufficiently large number of basis functions to span the modelspace and estimate the target function in that space If all of thedata x1 xn are used to generate the bases the resultingfunctional space demands intensive computation and the ap-plication is limited for large-scale problems Thus we adopt theparsimonious bases approach used by Xiang and Wahba (1998)Ruppert and Carroll (2000) Lin et al (2000) and Yau et al(2003) Gu and Kim (2001) showed that the number of basisterms can be much smaller than n without degrading the per-formance of the estimation For N le n we subsample N pointsx1lowast xNlowast from the original data and use them to generatebasis functions that then span the model space Hlowast Notice thatwe are not wasting any data resource here because all of thedata points are used for model fitting although only a subset ofthem are selected to generate basis functions

The issue of choosing N and the subsamples is importantIn practice we generally start with a reasonably large N It iswell known that ldquoreasonably largerdquo is not actually very large(see Lin et al 2000) In principle the subspace spanned bythe chosen basis terms needs to be rich enough to provide adecent fit to the true curve In this article we use the simplerandom sampling technique to choose the subsamples Alterna-tively a cluster algorithm may be used as done by Xiang andWahba (1998) and Yau et al (2003) Its basic idea involves twosteps First we group the data into N clusters that have max-imum separation by some good algorithm Then within eachcluster we randomly choose one data point as a representativeto be included in the base pool This scheme usually provideswell-separated subsamples

221 Main-Effects Model The main-effects model alsoknown as the additive model is a sum of d functions ofone variable The function space is the tensor sum of con-stant and the main effect subspaces of

otimesdα=1 H(α) Define

Hlowast = [1]oplusdα=1 spank1(x

(α))K1(x(α) x

(α)jlowast ) j = 1 N equiv

[1]oplusdα=1 H

(α)lowast where k1(middot) and K1(middot middot) are as defined in Sec-

tion 21 Then any component function fα isin H(α)lowast has the rep-resentation fα(x(α)) = bαk1(x

(α)) + sumNj=1 cαjK1(x

(α) x(α)jlowast )

The LBP estimate f isinHlowast is obtained by minimizing

1

n

nsum

i=1

[minusl(yi f (xi )

)]+ λπ

dsum

α=1

|bα| + λs

dsum

α=1

Nsum

j=1

|cαj | (5)

where f (x) = b0 + sumdα=1 fα(x(α)) and (λπ λs) are the regu-

larization parameters Here and in the sequel we have chosen togroup terms of similar types (here ldquoparametricrdquo and ldquosmoothrdquo)and to allow distinct λrsquos for different groups Of coursewe could choose to set λπ = λs Note that λs = infin yields a para-metric model Alternatively we could choose different λrsquos foreach coefficient if we decide to incorporate some prior knowl-edge of certain variables or their effects

222 Two-Factor Interaction Model Two-factor inte-ractions arise in many practical problems (See Hastie andTibshirani 1990 sec 955 or Lin et al 2000 figs 9 and 10for interpretable plots of two-factor interactions with contin-uous variables) In the LBP model the two-factor interactionspace consists of the ldquoparametricrdquo part and the ldquosmoothrdquo partThe parametric part is generated by d parametric main-effectterms and d(dminus1)

2 parametricndashparametric interaction termsThe smooth part is the tensor sum of the subspaces gener-ated by smooth main-effect terms parametricndashsmooth inter-action terms and smoothndashsmooth interaction terms For eachpair α = β the two-factor interaction subspace is H(αβ)lowast =spank1(x

(α))k1(x(β))K1(x

(α) x(α)jlowast )k1(x

(β))k1(x(β)jlowast )K1(x

(α)

x(α)jlowast )K1(x

(β) x(β)

jlowast ) j = 1 N and the interaction term

fαβ(x(α) x(β)) has the representation

fαβ = bαβk1(x(α)

)k1(x(β)

)

+Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

The whole function space is Hlowast equiv [1]oplusdα=1 H

(α)lowast +oplusβltα H

(αβ)lowast The LBP optimization problem is

minf isinHlowast

1

n

nsum

i=1

[minusl(yi f (xi )

)]+ λπ

(dsum

α=1

|bα|)

+ λππ

(sum

αltβ

|bαβ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (6)

Note that different penalties are allowed for the five differenttypes of terms

23 Incorporating Categorical Variables

In real applications some covariates may be categorical suchas sex race and smoking history in many medical studies Inprevious sections we proposed the main-effects model (5) andthe two-factor interaction model (6) for continuous variablesonly Now we generalize these models to incorporate categor-ical variables which are denoted by Z = (Z(1) Z(r)) Forsimplicity we assume that each Z(γ ) has two categories T F for γ = 1 r The following idea is easily extended to the

662 Journal of the American Statistical Association September 2004

situation when some variables have more than two categoriesDefine the mapping γ by

γ

(z(γ )

)=

1

2if z(γ ) = T

minus1

2if z(γ ) = F

Generally the mapping is chosen to make the range of categori-cal variables comparable with that of continuous variables Forany variable with C gt 2 categories C minus 1 contrasts are needed

The main-effects model that incorporates the categori-cal variables is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λs

dsum

α=1

Nsum

j=1

|cαj | (7)

where f (x z) = b0 + sumdα=1 bαk1(x

(α)) + sumrγ=1 Bγ times

γ (z(γ )) + sumdα=1

sumNj=1 cαjK1(x

(α) x(α)jlowast ) For each γ

the function γ can be considered as the main effect ofthe covariate Z(γ ) Thus we choose to associate the coef-ficients |B|rsquos and |b|rsquos with the same parameter λπ

Adding two-factor interactions with categorical vari-ables to a model that already includes parametric andsmooth terms adds a number of additional terms to thegeneral model Compared with (6) four new types ofterms are involved when we take into account categoricalvariables categorical main effects categoricalndashcategoricalinteractions ldquoparametric continuousrdquondashcategorical interac-tions and ldquosmooth continuousrdquondashcategorical interactionsThe modified two-factor interaction model is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λππ

(sum

αltβ

|bαβ | +sum

γltθ

|Bγθ | +dsum

α=1

rsum

γ=1

|Pαγ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj | +

dsum

α=1

rsum

γ=1

Nsum

j=1

|cπsαγj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (8)

where

f (x z) = b0 +dsum

α=1

bαk1(x(α)

)+rsum

γ=1

Bγ γ

(z(γ )

)

+sum

αltβ

bαβk1(x(α)

)k1(x(β)

)

+sum

γltθ

Bγ θγ

(z(γ )

(z(θ)

)

+dsum

α=1

rsum

γ=1

Pαγ k1(x(α)

(z(γ )

)

+sum

α =β

Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+dsum

α=1

rsum

γ=1

Nsum

j=1

cπsαγjK1

(x(α) x

(α)jlowast

(z(γ )

)

+dsum

α=1

Nsum

j=1

cαjK1(x(α) x

(α)jlowast

)

+sum

αltβ

Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

We assign different regularization parameters for main-effect terms parametricndashparametric interaction termsparametricndashsmooth interaction terms and smoothndashsmoothinteraction terms In particular the coefficients |Bγθ |rsquos and|Pαγ |rsquos are associated with the same parameter λππ andthe coefficients |cπs

αγj |rsquos and |cπsαβj |rsquos are associated with

the same parameter λπs

3 GENERALIZED APPROXIMATECROSSndashVALIDATION

Regularization parameter selection has been a very activefield of research Ordinary cross-validation (OCV) (Wahba andWold 1975) generalized cross-validation (GCV) (Craven andWahba 1979) and GACV (Xiang and Wahba 1996) are widelyused in various contexts of smoothing spline models We derivethe GACV to select λrsquos in the LBP models Here we present theGACV score and its derivation only for the main-effects modelthe extension to high-order interaction models is straightfor-ward With an abuse of notation we use λ to represent the col-lective set of tuning parameters in particular λ = (λπ λs) forthe main-effects model and λ = (λπ λππ λπs λs λss) for thetwo-factor interaction model

31 Approximate Cross-Validation

Let p be the ldquotruerdquo but unknown probability function andlet pλ be its estimate associated with λ Similarly let f and micro

denote the true logit and mean functions and let fλ and microλ

denote the corresponding estimates KullbackndashLeibler (KL)distance also known as the relative entropy is often used tomeasure the distance between two probability distributionsFor Bernoulli data we have KL(ppλ) = EX[ 1

2 micro(f minus fλ) minus(b(f )minusb(fλ))] with b(f ) = log(1+ef ) Removing the quan-tity that does not depend on λ from the KL distance expressionwe get the comparative KL (CKL) distance EX[minusmicrofλ + b(fλ)]The ordinary leave-one-out cross validation (CV) for CKL is

CV(λ) = 1

n

nsum

i=1

[minusyif[minusi]λ (xi ) + b(fλ(xi ))

] (9)

where fλ[minusi] is the minimizer of the objective function with

the ith data point omitted CV is commonly used as a roughlyunbiased estimate for CKL (see Xiang and Wahba 1996

Zhang et al Likelihood Basis Pursuit 663

Gu 2002) Direct calculation of CV involves computing n leave-one-out estimates which is expensive and almost infeasible forlarge-scale problems Thus we derive a second-order approxi-mate cross-validation (ACV) score to the CV (See Xiang andWahba 1996 Lin et al 2000 Gao Wahba Klein and Klein2001 for the ACV in traditional smoothing spline models)We first establish the leave-one-out lemma for the LBP mod-els The proof of Lemma 1 is given in Appendix A

Lemma 1 (Leave-One-Out Lemma) Denote the objectivefunction in the LBP model by

Iλ(fy) = L(y f ) + Jλ(f ) (10)

Let fλ[minusi] be the minimizer of Iλ(fy) with the ith obser-

vation omitted and let micro[minusi]λ (middot) be the mean function corre-

sponding to f[minusi]λ (middot) For any v isin R we define the vector

V = (y1 yiminus1 v yi+1 yn)prime Let hλ(i v middot) be the mini-

mizer of Iλ(fV) then hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

Using Taylor series approximations and Lemma 1 we canderive (in App B) the ACV score

ACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ 1

n

nsum

i=1

hii

yi(yi minus microλ(xi))

1 minus σ 2λihii

(11)

where σ 2λi equiv pλ(xi )(1 minus pλ(xi )) and hii is the iith entry of

the matrix H defined in (B9) (more details have been given inZhang 2002) Let W be the diagonal matrix with σ 2

λi in the iithposition By replacing hii with 1

n

sumni=1 hii equiv 1

ntr(H) and re-

placing 1 minus σ 2λihii with 1

ntr[I minus (W12HW12)] in (11) we ob-

tain the GACV score

GACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi) + b(fλ(xi ))]

+ tr(H)

n

sumni=1 yi(yi minus microλ(xi ))

tr[I minus W12HW12] (12)

32 Randomized GACV

Direct computation of (12) involves the inversion of a large-scale matrix whose size depends on sample size n basissize N and dimension d Large values of n N or d may makethe computation expensive and produce unstable solutionsThus the randomized GACV (ranGACV) score is proposed asa computable proxy for GACV Essentially we use the random-ized trace estimates for tr(H) and tr[I minus 1

2 (W12HW12)] basedon the following theorem (which has been exploited by numer-ous authors eg Girard 1998)

If A is any square matrix and ε is a mean 0 ran-dom n-vector with independent components withvariance σ 2

ε then 1σ 2

εEεT Aε = tr(A)

Let ε = (ε1 εn)prime be a mean 0 random n-vector of inde-

pendent components with variance σ 2ε Let f

yλ and f

y+ελ be the

minimizer of (5) using the original data y and the perturbed

data y + ε Using the derivation procedure of Lin et al (2000)we get the ranGACV score for the LBP estimates

ranGACV(λ)

= 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ εT (fy+ελ minus f

yλ )

n

sumni=1 yi(yi minus microλ(xi ))

εT ε minus εT W(fy+ελ minus f

yλ )

(13)

In addition two ideas help reduce the variance of the secondterm in (13)

1 Choose ε as Bernoulli (5) taking values in +σεminusσεThis guarantees that the randomized trace estimate has theminimal variance given a fixed σ 2

ε (see Hutchinson 1989)2 Generate U independent perturbations ε(u) u = 1 U

and compute U replicated ranGACVs Then use their av-erage to compute the GACV estimate

4 SELECTION CRITERIA FOR MAINndashEFFECTSAND TWOndashFACTOR INTERACTIONS

41 The L1 Importance Measure

After choosing the optimal λ by the GACV or ranGACVcriteria the LBP estimate f

λis obtained by minimizing (5)

(6) (7) or (8) How to measure the importance of a particu-lar component in the fitted model is a key question We pro-pose using the functional L1 norm as the importance measureThe sparsity in the solutions will help distinguish the signifi-cant terms from insignificant ones effectively and thus improvethe performance of our importance measure In practice foreach functional component its L1 norm is empirically calcu-lated as the average of the function values evaluated at all ofthe data points for instance L1(fα) = 1

n

sumni=1 |fα(x

(α)i )| and

L1(fαβ) = 1n

sumni=1 |fαβ(x

(α)i x

(β)i )| For the categorical vari-

ables in model (7) the empirical L1 norm of the main ef-fect fγ is L1(fγ ) = 1

n

sumni=1 |Bγ γ (z

(γ )

i )| for γ = 1 r The norms of the interaction terms involved with categoricalvariables are defined similarly The rank of the L1 norm scoresis used to rank the relative importance of functional compo-nents For instance the component with the largest L1 normis ranked as the most important one and any component witha zero or tiny L1 norm is ranked as an unimportant one We havealso tried using the functional L2 norm to rank the componentsthis gave almost identical results in terms of the set of variablesselected in numerous simulation studies (not reproduced here)

42 Choosing the Threshold

In this section we focus on the main-effects model Usingthe chosen parameter λ we obtain the estimated main-effectscomponents f1 fd and calculate their L1 norms L1(f1)

L1(fd ) We use a sequential procedure to select importantterms Denote the decreasingly ordered norms by L(1) L(d)

and the corresponding components by f(1) f(d) A uni-versal threshold value is needed to differentiate the impor-tant components from unimportant ones Call the threshold q Only variables with their L1 norms greater than or equal to q

are ldquoimportantrdquo

664 Journal of the American Statistical Association September 2004

Now we develop a sequential Monte Carlo bootstrap testprocedure to determine q Essentially we test the variablesrsquoimportance one by one in their L1 norm rank order If onevariable passes the test (hence ldquoimportantrdquo) then it enters thenull model for testing the next variable otherwise the proce-dure stops After the first η (0 le η le d minus 1) variables enterthe model it is a one-sided hypothesis-testing problem to de-cide whether the next component f(η+1) is important or notWhen η = 0 the null model f is the constant say f = b0 andthe hypotheses are H0 L(1) = 0 versus H1 L(1) gt 0 When1 le η le d minus 1 the null model is f = b0 + f(1) + middot middot middot + f(η)and the hypotheses are H0 L(η+1) = 0 versus H1 L(η+1) gt 0Let the desired one-sided test level be α If the null distribu-tion of L(η+1) were knownthen we could get the critical valueα-percentile and make a decision of rejection or acceptanceIn practice calculating the exact α-percentile is difficult or im-possible however the Monte Carlo bootstrap test provides aconvenient approximation to the full test Conditional on theoriginal covariates x1 xn we generate ylowast(η)

1 ylowast(η)n

(response 0 or 1) using the null model f = b0 + f(1) +middot middot middot+ f(η)

as the true logit function We sample a total of T independentsets of data (x1 y

lowast(η)1t ) (xn y

lowast(η)nt ) t = 1 T from the

null model f fit the main-effects model for each set and com-pute L

lowast(η+1)t t = 1 T If exactly k of the simulated Llowast(η+1)

values exceed L(η+1) and none equals it then the Monte Carlop value is k+1

T +1 (See Davison and Hinkley 1997 for an intro-duction to the Monte Carlo bootstrap test)

Sequential Monte Carlo Bootstrap Test Algorithm

Step 1 Let η = 0 and f = b0 Test H0 L(1) = 0 ver-sus H1 L(1) gt 0 Generate T independent sets of

data (x1 ylowast(0)1t ) (xn y

lowast(0)nt ) t = 1 T from

f = b0 Fit the LBP main-effects model and com-pute the Monte Carlo p value p0 If p0 lt α then goto step 2 otherwise stop and define q as any numberslightly larger than L(1)

Step 2 Let η = η + 1 and f = b0 + f(1) + middot middot middot + f(η) TestH0 L(η+1) = 0 versus H1 L(η+1) gt 0 Generate

T independent sets of data (x1 ylowast(η)1t ) (xn y

lowast(η)nt )

based on f fit the main-effects model and com-pute the Monte Carlo p value pη If pη lt α andη lt d minus 1 then repeat step 2 and if pη lt α andη = d minus 1 then go to step 3 otherwise stop and de-fine q = L(η)

Step 3 Stop the procedure and define q = L(d)

The order of entry for sequential testing of the terms beingentertained for the model is determined by the magnitude ofthe component L1 norms There are other reasonable ways todetermine the order of entry and the particular strategy usedcan affect the results as is the case for any stepwise procedureFor the LBP approach the relative ranking among the impor-tant terms usually does not affect the final component selectionsolution as long as the important terms are all ranked higherthan the unimportant terms Thus our procedure usually can dis-tinguish between important and unimportant terms as in mostof our examples When the distinction between important and

unimportant terms is ambiguous with our method the ambigu-ous terms can be recognized and further investigation will beneeded on these terms

5 NUMERICAL COMPUTATION

Because the objective function in either (5) or (6) is not dif-ferentiable with respect to the coefficients brsquos and crsquos somenumerical methods for optimization fail to solve this kind ofproblem We can replace the l1 norms in the objective func-tion by nonnegative variables constrained linearly to be thecorresponding absolute values using standard mathematicalprogramming techniques and then solve a series of programswith nonlinear objective functions and linear constraints Manyoptimization methods can solve such problems We use MINOS(Murtagh and Saunders 1983) because it generally performswell with the linearly constrained models and returns consis-tent results

Consider choosing the optimal λ by selecting the grid pointfor which ranGACV achieves a minimum For each λ tofind ranGACV the program (5) (6) (7) or (8) must be solvedtwicemdashonce with y (the original problem) and once with y + ε

(the perturbed problem) This often results in hundreds orthousands of individual solves depending on the range for λTo obtain solutions in a reasonable amount of time we use anefficient solving approach known as slice modeling (see Ferrisand Voelker 2000 2001) Slice modeling is an approach to solv-ing a series of mathematical programs with the same structurebut different data Because for LBP we can consider the val-ues (λy) to be individual slices of data (because only thesevalues change between solves) the program can be reduced toan example of nonlinear slice modeling By applying slice mod-eling ideas (namely maintaining the common program struc-ture and ldquocorerdquo data shared between solves and using previoussolutions as starting points for late solves) we can improve ef-ficiency and make the grid search feasible We have developedefficient code for the main-effects and two-way interaction LBPmodels The code is easy to use and runs fast For examplein one simulation example with n = 1000 d = 10 and λ fixedthe main-effects model takes less than 2 seconds and the two-way interaction model takes less than 50 seconds

6 SIMULATION

61 Simulation 1 Main-Effects Model

In this example there are a total of d = 10 covariatesX(1) X(10) They are taken to be uniformly distributedin [01] independently The sample size n = 1000 We use thesimple random subsampling technique to select N = 50 basisfunctions The perturbation ε is distributed as Bernoulli(5) tak-ing two values +25minus25 Four variables X(1) X(3) X(6)and X(8) are important and the others are noise variablesThe true conditional logit function is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5 + 2

e minus 1ex(8) minus 5

We fit the main-effects LBP model and search the parameters(λπ λs) globally Because the true f is known both CKL andranGACV can be used for choosing the λrsquos Figure 1 depicts thevalues of CKL(λ) and ranGACV(λ) as functions of (λπ λs)

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 2: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

660 Journal of the American Statistical Association September 2004

parameters An extension of generalized approximate cross-validation (GACV) proposed by Xiang and Wahba (1996) isderived as a tuning criterion Section 4 proposes the measureof importance for the variables and if desired their interac-tions We develop sequential Monte Carlo bootstrap test algo-rithm to determine the selection threshold Section 5 covers thenumerical computation issue Sections 6ndash8 present several sim-ulation examples and the applications of LBP to two large epi-demiologic studies We carry out a data analysis for the 4-yearrisk of progression of diabetic retinopathy in the WisconsinEpidemiologic Study of Diabetic Retinopathy (WESDR) andfor the 5-year risk of mortality in the Beaver Dam Eye Study(BDES) The final section contains some concluding remarksProofs are relegated to Appendixes A and B

2 LIKELIHOOD BASIS PURSUIT

21 Smoothing Spline ANOVA for Exponential Families

We are interested in estimating the dependence of Y onthe covariates X = (X(1) X(d)) Typically X is in ahigh-dimensional space X = X (1) otimes middot middot middot otimes X (d) where X (α)α = 1 d is some measurable space and otimes denotes the ten-sor product operation Conditioning on x assume that Y isfrom an exponential family with density of form h(y|x) =exp[yf (x) minus b(f (x))a(φ) + c(yφ)] where a gt 0 b andc are known functions f (x) is the parameter of interest depen-dent on x and φ is either known or a nuisance parameter in-dependent of x We denote the observations of Yi |xi sim h(y|xi )i = 1 n by the vector y = (y1 yn)

prime The scaled condi-tional log-likelihood is

L(y f ) = 1

n

nsum

i=1

[minuslyi f (xi)]

equiv 1

n

nsum

i=1

[minusyif (xi) + bf (xi )] (1)

Although the methodology proposed here is general and validfor any exponential family we use the Bernoulli case as ourworking example In Bernoulli data Y takes on values 01with the conditional probability p(x) equiv Pr(Y = 1|X = x)The logit function f (x) = log(

p(x)1minusp(x)

) and the log-likelihoodl(y f ) = yf minus log(1 + ef ) Many parametric approaches suchas those of Tibshirani (1996) Fu (1998) and Fan and Li (2001)assume f (x) to be a linear function of x Instead we allow f

to vary in a high-dimensional function space which leads toa more flexible estimate for the target function In this sec-tion and Section 22 we assume that all of the covariates arecontinuous Later in Section 23 we take categorical variablesinto account Similar to the classical ANOVA for any functionf (x) = f (x(1) x(d)) on a product domain X we can defineits functional ANOVA decomposition as

f (x) = b0 +dsum

α=1

(x(α)

)

+sum

αltβ

fαβ

(x(α) x(β)

)

+ all higher-order interactions (2)

where b0 is constant fαrsquos are the main effects and fαβ rsquosare the two-factor interactions The identifiability of the termsis ensured by side conditions through averaging operatorsWe are to estimate each fα in an reproducing kernel Hilbertspace (RKHS) H(α) each fαβ in the tensor product spaceH(α) otimesH(β) and so on The full model space is then the d th-order tensor product space

otimesdα=1 H(α) Each functional com-

ponent in the decomposition (2) falls in the correspondingsubspace of

otimesdα=1 H(α)

For any continuous covariate x(α) we scale it onto [01] andchoose H(α) to be the second-order Sobolev space W2[01]which is defined as g g(t) gprime(t) are absolutely continu-ous gprimeprime(t) isin L2[01] When endowed with a certain innerproduct W2[01] is an RKHS with the reproducing ker-nel (RK) 1 + K0(s t) + K1(s t) Here K0(s t) = k1(s)k1(t)

and K1(s t) = k2(s)k2(t) minus k4(|s minus t|) with k1(t) = t minus 12

k2(t) = 12 (k2

1(t) minus 112 ) and k4(t) = 1

24 (k41(t) minus 1

2k21(t) + 7

240 )

Notice that H(α) = [1] oplus H(α)π oplus H(α)

s with [1] the constantsubspace H(α)

π the ldquoparametricrdquo subspace generated by K0

consisting of linear functions and H(α)s the ldquononparametricrdquo

subspace generated by K1 consisting of smooth functions Thereproducing kernel of

otimesdα=1 H(α) is

dprod

α=1

(1 + k1

(s(α)

)k1(t(α)

)+ K1(s(α) t(α)

)) (3)

Correspondinglyotimesd

α=1 H(α) can be decomposed into the ten-sor sum of parametric main-effect subspaces and smooth main-effect subspaces and two-factor interaction subspaces of threepossible forms parametric otimes parametric smooth otimes parametricand smooth otimes smooth and similarly for three-factor or higherinteraction subspaces The RKs for these subspaces are given bythe corresponding terms in the expansion of (3) and the termsin (2) can be expanded in terms corresponding to those RKsin the expansion of (3) Therefore our model encompasses thelinear model as a special case (see Wahba 1990 for more de-tails) In various situations the ANOVA decomposition in (2)and in the expansion of (3) is truncated at some point In thiswork we consider truncation for the continuous variables nolater than after the two-factor interactions The remaining RKswill be used to construct an overcomplete set of basis functionsfor the likelihood basis pursuit via details given later

22 Likelihood Basis Pursuit Models

Basis pursuit is a principle for decomposing a signal into anoptimal superposition of dictionary elements where ldquooptimalrdquomeans having the smallest l1 norm of the coefficients amongall such decompositions Chen et al (1998) illustrated atomicdecomposition by basis pursuit in wavelet regression In thisarticle we apply basis pursuit to the negative log-likelihood inthe context of a dictionary based on the SSndashANOVA decom-position then select the important components using the mul-tivariate function estimates Let H be the model space aftertruncation The variational problem for the LBP model is

minfisinH

1

n

nsum

i=1

[minusl(yi f (xi)

)]+ Jλ(f ) (4)

Zhang et al Likelihood Basis Pursuit 661

where Jλ(f ) denotes the l1 norm of the coefficients of the ba-sis functions in the representation of f It is a generalizationof the LASSO penalty in the context of nonparametric mod-els The l1 penalty often produces coefficients that are exactly 0and therefore gives sparse solutions This sparsity helps dis-tinguish important variables from unimportant ones easily andmore effectively (See Tibshirani 1996 and Fan and Li 2001for comparison of the l1 penalty with other forms of penalty)The regularization parameter λ balances the fitness in the like-lihood and the penalty part

For the usual smoothing spline modeling the penalty Jλ isa quadratic norm or seminorm in an RKHS Kimeldorf andWahba (1971) showed that the minimizer fλ for the traditionalsmoothing spline model falls in a finite-dimensional spacealthough the model space is of infinite dimension For the penal-ized likelihood approach with a nonquadratic penalty like thel1 penalty it is very hard to obtain analytic solutions In light ofthe results for the quadratic penalty situation we propose usinga sufficiently large number of basis functions to span the modelspace and estimate the target function in that space If all of thedata x1 xn are used to generate the bases the resultingfunctional space demands intensive computation and the ap-plication is limited for large-scale problems Thus we adopt theparsimonious bases approach used by Xiang and Wahba (1998)Ruppert and Carroll (2000) Lin et al (2000) and Yau et al(2003) Gu and Kim (2001) showed that the number of basisterms can be much smaller than n without degrading the per-formance of the estimation For N le n we subsample N pointsx1lowast xNlowast from the original data and use them to generatebasis functions that then span the model space Hlowast Notice thatwe are not wasting any data resource here because all of thedata points are used for model fitting although only a subset ofthem are selected to generate basis functions

The issue of choosing N and the subsamples is importantIn practice we generally start with a reasonably large N It iswell known that ldquoreasonably largerdquo is not actually very large(see Lin et al 2000) In principle the subspace spanned bythe chosen basis terms needs to be rich enough to provide adecent fit to the true curve In this article we use the simplerandom sampling technique to choose the subsamples Alterna-tively a cluster algorithm may be used as done by Xiang andWahba (1998) and Yau et al (2003) Its basic idea involves twosteps First we group the data into N clusters that have max-imum separation by some good algorithm Then within eachcluster we randomly choose one data point as a representativeto be included in the base pool This scheme usually provideswell-separated subsamples

221 Main-Effects Model The main-effects model alsoknown as the additive model is a sum of d functions ofone variable The function space is the tensor sum of con-stant and the main effect subspaces of

otimesdα=1 H(α) Define

Hlowast = [1]oplusdα=1 spank1(x

(α))K1(x(α) x

(α)jlowast ) j = 1 N equiv

[1]oplusdα=1 H

(α)lowast where k1(middot) and K1(middot middot) are as defined in Sec-

tion 21 Then any component function fα isin H(α)lowast has the rep-resentation fα(x(α)) = bαk1(x

(α)) + sumNj=1 cαjK1(x

(α) x(α)jlowast )

The LBP estimate f isinHlowast is obtained by minimizing

1

n

nsum

i=1

[minusl(yi f (xi )

)]+ λπ

dsum

α=1

|bα| + λs

dsum

α=1

Nsum

j=1

|cαj | (5)

where f (x) = b0 + sumdα=1 fα(x(α)) and (λπ λs) are the regu-

larization parameters Here and in the sequel we have chosen togroup terms of similar types (here ldquoparametricrdquo and ldquosmoothrdquo)and to allow distinct λrsquos for different groups Of coursewe could choose to set λπ = λs Note that λs = infin yields a para-metric model Alternatively we could choose different λrsquos foreach coefficient if we decide to incorporate some prior knowl-edge of certain variables or their effects

222 Two-Factor Interaction Model Two-factor inte-ractions arise in many practical problems (See Hastie andTibshirani 1990 sec 955 or Lin et al 2000 figs 9 and 10for interpretable plots of two-factor interactions with contin-uous variables) In the LBP model the two-factor interactionspace consists of the ldquoparametricrdquo part and the ldquosmoothrdquo partThe parametric part is generated by d parametric main-effectterms and d(dminus1)

2 parametricndashparametric interaction termsThe smooth part is the tensor sum of the subspaces gener-ated by smooth main-effect terms parametricndashsmooth inter-action terms and smoothndashsmooth interaction terms For eachpair α = β the two-factor interaction subspace is H(αβ)lowast =spank1(x

(α))k1(x(β))K1(x

(α) x(α)jlowast )k1(x

(β))k1(x(β)jlowast )K1(x

(α)

x(α)jlowast )K1(x

(β) x(β)

jlowast ) j = 1 N and the interaction term

fαβ(x(α) x(β)) has the representation

fαβ = bαβk1(x(α)

)k1(x(β)

)

+Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

The whole function space is Hlowast equiv [1]oplusdα=1 H

(α)lowast +oplusβltα H

(αβ)lowast The LBP optimization problem is

minf isinHlowast

1

n

nsum

i=1

[minusl(yi f (xi )

)]+ λπ

(dsum

α=1

|bα|)

+ λππ

(sum

αltβ

|bαβ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (6)

Note that different penalties are allowed for the five differenttypes of terms

23 Incorporating Categorical Variables

In real applications some covariates may be categorical suchas sex race and smoking history in many medical studies Inprevious sections we proposed the main-effects model (5) andthe two-factor interaction model (6) for continuous variablesonly Now we generalize these models to incorporate categor-ical variables which are denoted by Z = (Z(1) Z(r)) Forsimplicity we assume that each Z(γ ) has two categories T F for γ = 1 r The following idea is easily extended to the

662 Journal of the American Statistical Association September 2004

situation when some variables have more than two categoriesDefine the mapping γ by

γ

(z(γ )

)=

1

2if z(γ ) = T

minus1

2if z(γ ) = F

Generally the mapping is chosen to make the range of categori-cal variables comparable with that of continuous variables Forany variable with C gt 2 categories C minus 1 contrasts are needed

The main-effects model that incorporates the categori-cal variables is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λs

dsum

α=1

Nsum

j=1

|cαj | (7)

where f (x z) = b0 + sumdα=1 bαk1(x

(α)) + sumrγ=1 Bγ times

γ (z(γ )) + sumdα=1

sumNj=1 cαjK1(x

(α) x(α)jlowast ) For each γ

the function γ can be considered as the main effect ofthe covariate Z(γ ) Thus we choose to associate the coef-ficients |B|rsquos and |b|rsquos with the same parameter λπ

Adding two-factor interactions with categorical vari-ables to a model that already includes parametric andsmooth terms adds a number of additional terms to thegeneral model Compared with (6) four new types ofterms are involved when we take into account categoricalvariables categorical main effects categoricalndashcategoricalinteractions ldquoparametric continuousrdquondashcategorical interac-tions and ldquosmooth continuousrdquondashcategorical interactionsThe modified two-factor interaction model is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λππ

(sum

αltβ

|bαβ | +sum

γltθ

|Bγθ | +dsum

α=1

rsum

γ=1

|Pαγ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj | +

dsum

α=1

rsum

γ=1

Nsum

j=1

|cπsαγj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (8)

where

f (x z) = b0 +dsum

α=1

bαk1(x(α)

)+rsum

γ=1

Bγ γ

(z(γ )

)

+sum

αltβ

bαβk1(x(α)

)k1(x(β)

)

+sum

γltθ

Bγ θγ

(z(γ )

(z(θ)

)

+dsum

α=1

rsum

γ=1

Pαγ k1(x(α)

(z(γ )

)

+sum

α =β

Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+dsum

α=1

rsum

γ=1

Nsum

j=1

cπsαγjK1

(x(α) x

(α)jlowast

(z(γ )

)

+dsum

α=1

Nsum

j=1

cαjK1(x(α) x

(α)jlowast

)

+sum

αltβ

Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

We assign different regularization parameters for main-effect terms parametricndashparametric interaction termsparametricndashsmooth interaction terms and smoothndashsmoothinteraction terms In particular the coefficients |Bγθ |rsquos and|Pαγ |rsquos are associated with the same parameter λππ andthe coefficients |cπs

αγj |rsquos and |cπsαβj |rsquos are associated with

the same parameter λπs

3 GENERALIZED APPROXIMATECROSSndashVALIDATION

Regularization parameter selection has been a very activefield of research Ordinary cross-validation (OCV) (Wahba andWold 1975) generalized cross-validation (GCV) (Craven andWahba 1979) and GACV (Xiang and Wahba 1996) are widelyused in various contexts of smoothing spline models We derivethe GACV to select λrsquos in the LBP models Here we present theGACV score and its derivation only for the main-effects modelthe extension to high-order interaction models is straightfor-ward With an abuse of notation we use λ to represent the col-lective set of tuning parameters in particular λ = (λπ λs) forthe main-effects model and λ = (λπ λππ λπs λs λss) for thetwo-factor interaction model

31 Approximate Cross-Validation

Let p be the ldquotruerdquo but unknown probability function andlet pλ be its estimate associated with λ Similarly let f and micro

denote the true logit and mean functions and let fλ and microλ

denote the corresponding estimates KullbackndashLeibler (KL)distance also known as the relative entropy is often used tomeasure the distance between two probability distributionsFor Bernoulli data we have KL(ppλ) = EX[ 1

2 micro(f minus fλ) minus(b(f )minusb(fλ))] with b(f ) = log(1+ef ) Removing the quan-tity that does not depend on λ from the KL distance expressionwe get the comparative KL (CKL) distance EX[minusmicrofλ + b(fλ)]The ordinary leave-one-out cross validation (CV) for CKL is

CV(λ) = 1

n

nsum

i=1

[minusyif[minusi]λ (xi ) + b(fλ(xi ))

] (9)

where fλ[minusi] is the minimizer of the objective function with

the ith data point omitted CV is commonly used as a roughlyunbiased estimate for CKL (see Xiang and Wahba 1996

Zhang et al Likelihood Basis Pursuit 663

Gu 2002) Direct calculation of CV involves computing n leave-one-out estimates which is expensive and almost infeasible forlarge-scale problems Thus we derive a second-order approxi-mate cross-validation (ACV) score to the CV (See Xiang andWahba 1996 Lin et al 2000 Gao Wahba Klein and Klein2001 for the ACV in traditional smoothing spline models)We first establish the leave-one-out lemma for the LBP mod-els The proof of Lemma 1 is given in Appendix A

Lemma 1 (Leave-One-Out Lemma) Denote the objectivefunction in the LBP model by

Iλ(fy) = L(y f ) + Jλ(f ) (10)

Let fλ[minusi] be the minimizer of Iλ(fy) with the ith obser-

vation omitted and let micro[minusi]λ (middot) be the mean function corre-

sponding to f[minusi]λ (middot) For any v isin R we define the vector

V = (y1 yiminus1 v yi+1 yn)prime Let hλ(i v middot) be the mini-

mizer of Iλ(fV) then hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

Using Taylor series approximations and Lemma 1 we canderive (in App B) the ACV score

ACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ 1

n

nsum

i=1

hii

yi(yi minus microλ(xi))

1 minus σ 2λihii

(11)

where σ 2λi equiv pλ(xi )(1 minus pλ(xi )) and hii is the iith entry of

the matrix H defined in (B9) (more details have been given inZhang 2002) Let W be the diagonal matrix with σ 2

λi in the iithposition By replacing hii with 1

n

sumni=1 hii equiv 1

ntr(H) and re-

placing 1 minus σ 2λihii with 1

ntr[I minus (W12HW12)] in (11) we ob-

tain the GACV score

GACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi) + b(fλ(xi ))]

+ tr(H)

n

sumni=1 yi(yi minus microλ(xi ))

tr[I minus W12HW12] (12)

32 Randomized GACV

Direct computation of (12) involves the inversion of a large-scale matrix whose size depends on sample size n basissize N and dimension d Large values of n N or d may makethe computation expensive and produce unstable solutionsThus the randomized GACV (ranGACV) score is proposed asa computable proxy for GACV Essentially we use the random-ized trace estimates for tr(H) and tr[I minus 1

2 (W12HW12)] basedon the following theorem (which has been exploited by numer-ous authors eg Girard 1998)

If A is any square matrix and ε is a mean 0 ran-dom n-vector with independent components withvariance σ 2

ε then 1σ 2

εEεT Aε = tr(A)

Let ε = (ε1 εn)prime be a mean 0 random n-vector of inde-

pendent components with variance σ 2ε Let f

yλ and f

y+ελ be the

minimizer of (5) using the original data y and the perturbed

data y + ε Using the derivation procedure of Lin et al (2000)we get the ranGACV score for the LBP estimates

ranGACV(λ)

= 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ εT (fy+ελ minus f

yλ )

n

sumni=1 yi(yi minus microλ(xi ))

εT ε minus εT W(fy+ελ minus f

yλ )

(13)

In addition two ideas help reduce the variance of the secondterm in (13)

1 Choose ε as Bernoulli (5) taking values in +σεminusσεThis guarantees that the randomized trace estimate has theminimal variance given a fixed σ 2

ε (see Hutchinson 1989)2 Generate U independent perturbations ε(u) u = 1 U

and compute U replicated ranGACVs Then use their av-erage to compute the GACV estimate

4 SELECTION CRITERIA FOR MAINndashEFFECTSAND TWOndashFACTOR INTERACTIONS

41 The L1 Importance Measure

After choosing the optimal λ by the GACV or ranGACVcriteria the LBP estimate f

λis obtained by minimizing (5)

(6) (7) or (8) How to measure the importance of a particu-lar component in the fitted model is a key question We pro-pose using the functional L1 norm as the importance measureThe sparsity in the solutions will help distinguish the signifi-cant terms from insignificant ones effectively and thus improvethe performance of our importance measure In practice foreach functional component its L1 norm is empirically calcu-lated as the average of the function values evaluated at all ofthe data points for instance L1(fα) = 1

n

sumni=1 |fα(x

(α)i )| and

L1(fαβ) = 1n

sumni=1 |fαβ(x

(α)i x

(β)i )| For the categorical vari-

ables in model (7) the empirical L1 norm of the main ef-fect fγ is L1(fγ ) = 1

n

sumni=1 |Bγ γ (z

(γ )

i )| for γ = 1 r The norms of the interaction terms involved with categoricalvariables are defined similarly The rank of the L1 norm scoresis used to rank the relative importance of functional compo-nents For instance the component with the largest L1 normis ranked as the most important one and any component witha zero or tiny L1 norm is ranked as an unimportant one We havealso tried using the functional L2 norm to rank the componentsthis gave almost identical results in terms of the set of variablesselected in numerous simulation studies (not reproduced here)

42 Choosing the Threshold

In this section we focus on the main-effects model Usingthe chosen parameter λ we obtain the estimated main-effectscomponents f1 fd and calculate their L1 norms L1(f1)

L1(fd ) We use a sequential procedure to select importantterms Denote the decreasingly ordered norms by L(1) L(d)

and the corresponding components by f(1) f(d) A uni-versal threshold value is needed to differentiate the impor-tant components from unimportant ones Call the threshold q Only variables with their L1 norms greater than or equal to q

are ldquoimportantrdquo

664 Journal of the American Statistical Association September 2004

Now we develop a sequential Monte Carlo bootstrap testprocedure to determine q Essentially we test the variablesrsquoimportance one by one in their L1 norm rank order If onevariable passes the test (hence ldquoimportantrdquo) then it enters thenull model for testing the next variable otherwise the proce-dure stops After the first η (0 le η le d minus 1) variables enterthe model it is a one-sided hypothesis-testing problem to de-cide whether the next component f(η+1) is important or notWhen η = 0 the null model f is the constant say f = b0 andthe hypotheses are H0 L(1) = 0 versus H1 L(1) gt 0 When1 le η le d minus 1 the null model is f = b0 + f(1) + middot middot middot + f(η)and the hypotheses are H0 L(η+1) = 0 versus H1 L(η+1) gt 0Let the desired one-sided test level be α If the null distribu-tion of L(η+1) were knownthen we could get the critical valueα-percentile and make a decision of rejection or acceptanceIn practice calculating the exact α-percentile is difficult or im-possible however the Monte Carlo bootstrap test provides aconvenient approximation to the full test Conditional on theoriginal covariates x1 xn we generate ylowast(η)

1 ylowast(η)n

(response 0 or 1) using the null model f = b0 + f(1) +middot middot middot+ f(η)

as the true logit function We sample a total of T independentsets of data (x1 y

lowast(η)1t ) (xn y

lowast(η)nt ) t = 1 T from the

null model f fit the main-effects model for each set and com-pute L

lowast(η+1)t t = 1 T If exactly k of the simulated Llowast(η+1)

values exceed L(η+1) and none equals it then the Monte Carlop value is k+1

T +1 (See Davison and Hinkley 1997 for an intro-duction to the Monte Carlo bootstrap test)

Sequential Monte Carlo Bootstrap Test Algorithm

Step 1 Let η = 0 and f = b0 Test H0 L(1) = 0 ver-sus H1 L(1) gt 0 Generate T independent sets of

data (x1 ylowast(0)1t ) (xn y

lowast(0)nt ) t = 1 T from

f = b0 Fit the LBP main-effects model and com-pute the Monte Carlo p value p0 If p0 lt α then goto step 2 otherwise stop and define q as any numberslightly larger than L(1)

Step 2 Let η = η + 1 and f = b0 + f(1) + middot middot middot + f(η) TestH0 L(η+1) = 0 versus H1 L(η+1) gt 0 Generate

T independent sets of data (x1 ylowast(η)1t ) (xn y

lowast(η)nt )

based on f fit the main-effects model and com-pute the Monte Carlo p value pη If pη lt α andη lt d minus 1 then repeat step 2 and if pη lt α andη = d minus 1 then go to step 3 otherwise stop and de-fine q = L(η)

Step 3 Stop the procedure and define q = L(d)

The order of entry for sequential testing of the terms beingentertained for the model is determined by the magnitude ofthe component L1 norms There are other reasonable ways todetermine the order of entry and the particular strategy usedcan affect the results as is the case for any stepwise procedureFor the LBP approach the relative ranking among the impor-tant terms usually does not affect the final component selectionsolution as long as the important terms are all ranked higherthan the unimportant terms Thus our procedure usually can dis-tinguish between important and unimportant terms as in mostof our examples When the distinction between important and

unimportant terms is ambiguous with our method the ambigu-ous terms can be recognized and further investigation will beneeded on these terms

5 NUMERICAL COMPUTATION

Because the objective function in either (5) or (6) is not dif-ferentiable with respect to the coefficients brsquos and crsquos somenumerical methods for optimization fail to solve this kind ofproblem We can replace the l1 norms in the objective func-tion by nonnegative variables constrained linearly to be thecorresponding absolute values using standard mathematicalprogramming techniques and then solve a series of programswith nonlinear objective functions and linear constraints Manyoptimization methods can solve such problems We use MINOS(Murtagh and Saunders 1983) because it generally performswell with the linearly constrained models and returns consis-tent results

Consider choosing the optimal λ by selecting the grid pointfor which ranGACV achieves a minimum For each λ tofind ranGACV the program (5) (6) (7) or (8) must be solvedtwicemdashonce with y (the original problem) and once with y + ε

(the perturbed problem) This often results in hundreds orthousands of individual solves depending on the range for λTo obtain solutions in a reasonable amount of time we use anefficient solving approach known as slice modeling (see Ferrisand Voelker 2000 2001) Slice modeling is an approach to solv-ing a series of mathematical programs with the same structurebut different data Because for LBP we can consider the val-ues (λy) to be individual slices of data (because only thesevalues change between solves) the program can be reduced toan example of nonlinear slice modeling By applying slice mod-eling ideas (namely maintaining the common program struc-ture and ldquocorerdquo data shared between solves and using previoussolutions as starting points for late solves) we can improve ef-ficiency and make the grid search feasible We have developedefficient code for the main-effects and two-way interaction LBPmodels The code is easy to use and runs fast For examplein one simulation example with n = 1000 d = 10 and λ fixedthe main-effects model takes less than 2 seconds and the two-way interaction model takes less than 50 seconds

6 SIMULATION

61 Simulation 1 Main-Effects Model

In this example there are a total of d = 10 covariatesX(1) X(10) They are taken to be uniformly distributedin [01] independently The sample size n = 1000 We use thesimple random subsampling technique to select N = 50 basisfunctions The perturbation ε is distributed as Bernoulli(5) tak-ing two values +25minus25 Four variables X(1) X(3) X(6)and X(8) are important and the others are noise variablesThe true conditional logit function is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5 + 2

e minus 1ex(8) minus 5

We fit the main-effects LBP model and search the parameters(λπ λs) globally Because the true f is known both CKL andranGACV can be used for choosing the λrsquos Figure 1 depicts thevalues of CKL(λ) and ranGACV(λ) as functions of (λπ λs)

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 3: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Zhang et al Likelihood Basis Pursuit 661

where Jλ(f ) denotes the l1 norm of the coefficients of the ba-sis functions in the representation of f It is a generalizationof the LASSO penalty in the context of nonparametric mod-els The l1 penalty often produces coefficients that are exactly 0and therefore gives sparse solutions This sparsity helps dis-tinguish important variables from unimportant ones easily andmore effectively (See Tibshirani 1996 and Fan and Li 2001for comparison of the l1 penalty with other forms of penalty)The regularization parameter λ balances the fitness in the like-lihood and the penalty part

For the usual smoothing spline modeling the penalty Jλ isa quadratic norm or seminorm in an RKHS Kimeldorf andWahba (1971) showed that the minimizer fλ for the traditionalsmoothing spline model falls in a finite-dimensional spacealthough the model space is of infinite dimension For the penal-ized likelihood approach with a nonquadratic penalty like thel1 penalty it is very hard to obtain analytic solutions In light ofthe results for the quadratic penalty situation we propose usinga sufficiently large number of basis functions to span the modelspace and estimate the target function in that space If all of thedata x1 xn are used to generate the bases the resultingfunctional space demands intensive computation and the ap-plication is limited for large-scale problems Thus we adopt theparsimonious bases approach used by Xiang and Wahba (1998)Ruppert and Carroll (2000) Lin et al (2000) and Yau et al(2003) Gu and Kim (2001) showed that the number of basisterms can be much smaller than n without degrading the per-formance of the estimation For N le n we subsample N pointsx1lowast xNlowast from the original data and use them to generatebasis functions that then span the model space Hlowast Notice thatwe are not wasting any data resource here because all of thedata points are used for model fitting although only a subset ofthem are selected to generate basis functions

The issue of choosing N and the subsamples is importantIn practice we generally start with a reasonably large N It iswell known that ldquoreasonably largerdquo is not actually very large(see Lin et al 2000) In principle the subspace spanned bythe chosen basis terms needs to be rich enough to provide adecent fit to the true curve In this article we use the simplerandom sampling technique to choose the subsamples Alterna-tively a cluster algorithm may be used as done by Xiang andWahba (1998) and Yau et al (2003) Its basic idea involves twosteps First we group the data into N clusters that have max-imum separation by some good algorithm Then within eachcluster we randomly choose one data point as a representativeto be included in the base pool This scheme usually provideswell-separated subsamples

221 Main-Effects Model The main-effects model alsoknown as the additive model is a sum of d functions ofone variable The function space is the tensor sum of con-stant and the main effect subspaces of

otimesdα=1 H(α) Define

Hlowast = [1]oplusdα=1 spank1(x

(α))K1(x(α) x

(α)jlowast ) j = 1 N equiv

[1]oplusdα=1 H

(α)lowast where k1(middot) and K1(middot middot) are as defined in Sec-

tion 21 Then any component function fα isin H(α)lowast has the rep-resentation fα(x(α)) = bαk1(x

(α)) + sumNj=1 cαjK1(x

(α) x(α)jlowast )

The LBP estimate f isinHlowast is obtained by minimizing

1

n

nsum

i=1

[minusl(yi f (xi )

)]+ λπ

dsum

α=1

|bα| + λs

dsum

α=1

Nsum

j=1

|cαj | (5)

where f (x) = b0 + sumdα=1 fα(x(α)) and (λπ λs) are the regu-

larization parameters Here and in the sequel we have chosen togroup terms of similar types (here ldquoparametricrdquo and ldquosmoothrdquo)and to allow distinct λrsquos for different groups Of coursewe could choose to set λπ = λs Note that λs = infin yields a para-metric model Alternatively we could choose different λrsquos foreach coefficient if we decide to incorporate some prior knowl-edge of certain variables or their effects

222 Two-Factor Interaction Model Two-factor inte-ractions arise in many practical problems (See Hastie andTibshirani 1990 sec 955 or Lin et al 2000 figs 9 and 10for interpretable plots of two-factor interactions with contin-uous variables) In the LBP model the two-factor interactionspace consists of the ldquoparametricrdquo part and the ldquosmoothrdquo partThe parametric part is generated by d parametric main-effectterms and d(dminus1)

2 parametricndashparametric interaction termsThe smooth part is the tensor sum of the subspaces gener-ated by smooth main-effect terms parametricndashsmooth inter-action terms and smoothndashsmooth interaction terms For eachpair α = β the two-factor interaction subspace is H(αβ)lowast =spank1(x

(α))k1(x(β))K1(x

(α) x(α)jlowast )k1(x

(β))k1(x(β)jlowast )K1(x

(α)

x(α)jlowast )K1(x

(β) x(β)

jlowast ) j = 1 N and the interaction term

fαβ(x(α) x(β)) has the representation

fαβ = bαβk1(x(α)

)k1(x(β)

)

+Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

The whole function space is Hlowast equiv [1]oplusdα=1 H

(α)lowast +oplusβltα H

(αβ)lowast The LBP optimization problem is

minf isinHlowast

1

n

nsum

i=1

[minusl(yi f (xi )

)]+ λπ

(dsum

α=1

|bα|)

+ λππ

(sum

αltβ

|bαβ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (6)

Note that different penalties are allowed for the five differenttypes of terms

23 Incorporating Categorical Variables

In real applications some covariates may be categorical suchas sex race and smoking history in many medical studies Inprevious sections we proposed the main-effects model (5) andthe two-factor interaction model (6) for continuous variablesonly Now we generalize these models to incorporate categor-ical variables which are denoted by Z = (Z(1) Z(r)) Forsimplicity we assume that each Z(γ ) has two categories T F for γ = 1 r The following idea is easily extended to the

662 Journal of the American Statistical Association September 2004

situation when some variables have more than two categoriesDefine the mapping γ by

γ

(z(γ )

)=

1

2if z(γ ) = T

minus1

2if z(γ ) = F

Generally the mapping is chosen to make the range of categori-cal variables comparable with that of continuous variables Forany variable with C gt 2 categories C minus 1 contrasts are needed

The main-effects model that incorporates the categori-cal variables is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λs

dsum

α=1

Nsum

j=1

|cαj | (7)

where f (x z) = b0 + sumdα=1 bαk1(x

(α)) + sumrγ=1 Bγ times

γ (z(γ )) + sumdα=1

sumNj=1 cαjK1(x

(α) x(α)jlowast ) For each γ

the function γ can be considered as the main effect ofthe covariate Z(γ ) Thus we choose to associate the coef-ficients |B|rsquos and |b|rsquos with the same parameter λπ

Adding two-factor interactions with categorical vari-ables to a model that already includes parametric andsmooth terms adds a number of additional terms to thegeneral model Compared with (6) four new types ofterms are involved when we take into account categoricalvariables categorical main effects categoricalndashcategoricalinteractions ldquoparametric continuousrdquondashcategorical interac-tions and ldquosmooth continuousrdquondashcategorical interactionsThe modified two-factor interaction model is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λππ

(sum

αltβ

|bαβ | +sum

γltθ

|Bγθ | +dsum

α=1

rsum

γ=1

|Pαγ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj | +

dsum

α=1

rsum

γ=1

Nsum

j=1

|cπsαγj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (8)

where

f (x z) = b0 +dsum

α=1

bαk1(x(α)

)+rsum

γ=1

Bγ γ

(z(γ )

)

+sum

αltβ

bαβk1(x(α)

)k1(x(β)

)

+sum

γltθ

Bγ θγ

(z(γ )

(z(θ)

)

+dsum

α=1

rsum

γ=1

Pαγ k1(x(α)

(z(γ )

)

+sum

α =β

Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+dsum

α=1

rsum

γ=1

Nsum

j=1

cπsαγjK1

(x(α) x

(α)jlowast

(z(γ )

)

+dsum

α=1

Nsum

j=1

cαjK1(x(α) x

(α)jlowast

)

+sum

αltβ

Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

We assign different regularization parameters for main-effect terms parametricndashparametric interaction termsparametricndashsmooth interaction terms and smoothndashsmoothinteraction terms In particular the coefficients |Bγθ |rsquos and|Pαγ |rsquos are associated with the same parameter λππ andthe coefficients |cπs

αγj |rsquos and |cπsαβj |rsquos are associated with

the same parameter λπs

3 GENERALIZED APPROXIMATECROSSndashVALIDATION

Regularization parameter selection has been a very activefield of research Ordinary cross-validation (OCV) (Wahba andWold 1975) generalized cross-validation (GCV) (Craven andWahba 1979) and GACV (Xiang and Wahba 1996) are widelyused in various contexts of smoothing spline models We derivethe GACV to select λrsquos in the LBP models Here we present theGACV score and its derivation only for the main-effects modelthe extension to high-order interaction models is straightfor-ward With an abuse of notation we use λ to represent the col-lective set of tuning parameters in particular λ = (λπ λs) forthe main-effects model and λ = (λπ λππ λπs λs λss) for thetwo-factor interaction model

31 Approximate Cross-Validation

Let p be the ldquotruerdquo but unknown probability function andlet pλ be its estimate associated with λ Similarly let f and micro

denote the true logit and mean functions and let fλ and microλ

denote the corresponding estimates KullbackndashLeibler (KL)distance also known as the relative entropy is often used tomeasure the distance between two probability distributionsFor Bernoulli data we have KL(ppλ) = EX[ 1

2 micro(f minus fλ) minus(b(f )minusb(fλ))] with b(f ) = log(1+ef ) Removing the quan-tity that does not depend on λ from the KL distance expressionwe get the comparative KL (CKL) distance EX[minusmicrofλ + b(fλ)]The ordinary leave-one-out cross validation (CV) for CKL is

CV(λ) = 1

n

nsum

i=1

[minusyif[minusi]λ (xi ) + b(fλ(xi ))

] (9)

where fλ[minusi] is the minimizer of the objective function with

the ith data point omitted CV is commonly used as a roughlyunbiased estimate for CKL (see Xiang and Wahba 1996

Zhang et al Likelihood Basis Pursuit 663

Gu 2002) Direct calculation of CV involves computing n leave-one-out estimates which is expensive and almost infeasible forlarge-scale problems Thus we derive a second-order approxi-mate cross-validation (ACV) score to the CV (See Xiang andWahba 1996 Lin et al 2000 Gao Wahba Klein and Klein2001 for the ACV in traditional smoothing spline models)We first establish the leave-one-out lemma for the LBP mod-els The proof of Lemma 1 is given in Appendix A

Lemma 1 (Leave-One-Out Lemma) Denote the objectivefunction in the LBP model by

Iλ(fy) = L(y f ) + Jλ(f ) (10)

Let fλ[minusi] be the minimizer of Iλ(fy) with the ith obser-

vation omitted and let micro[minusi]λ (middot) be the mean function corre-

sponding to f[minusi]λ (middot) For any v isin R we define the vector

V = (y1 yiminus1 v yi+1 yn)prime Let hλ(i v middot) be the mini-

mizer of Iλ(fV) then hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

Using Taylor series approximations and Lemma 1 we canderive (in App B) the ACV score

ACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ 1

n

nsum

i=1

hii

yi(yi minus microλ(xi))

1 minus σ 2λihii

(11)

where σ 2λi equiv pλ(xi )(1 minus pλ(xi )) and hii is the iith entry of

the matrix H defined in (B9) (more details have been given inZhang 2002) Let W be the diagonal matrix with σ 2

λi in the iithposition By replacing hii with 1

n

sumni=1 hii equiv 1

ntr(H) and re-

placing 1 minus σ 2λihii with 1

ntr[I minus (W12HW12)] in (11) we ob-

tain the GACV score

GACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi) + b(fλ(xi ))]

+ tr(H)

n

sumni=1 yi(yi minus microλ(xi ))

tr[I minus W12HW12] (12)

32 Randomized GACV

Direct computation of (12) involves the inversion of a large-scale matrix whose size depends on sample size n basissize N and dimension d Large values of n N or d may makethe computation expensive and produce unstable solutionsThus the randomized GACV (ranGACV) score is proposed asa computable proxy for GACV Essentially we use the random-ized trace estimates for tr(H) and tr[I minus 1

2 (W12HW12)] basedon the following theorem (which has been exploited by numer-ous authors eg Girard 1998)

If A is any square matrix and ε is a mean 0 ran-dom n-vector with independent components withvariance σ 2

ε then 1σ 2

εEεT Aε = tr(A)

Let ε = (ε1 εn)prime be a mean 0 random n-vector of inde-

pendent components with variance σ 2ε Let f

yλ and f

y+ελ be the

minimizer of (5) using the original data y and the perturbed

data y + ε Using the derivation procedure of Lin et al (2000)we get the ranGACV score for the LBP estimates

ranGACV(λ)

= 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ εT (fy+ελ minus f

yλ )

n

sumni=1 yi(yi minus microλ(xi ))

εT ε minus εT W(fy+ελ minus f

yλ )

(13)

In addition two ideas help reduce the variance of the secondterm in (13)

1 Choose ε as Bernoulli (5) taking values in +σεminusσεThis guarantees that the randomized trace estimate has theminimal variance given a fixed σ 2

ε (see Hutchinson 1989)2 Generate U independent perturbations ε(u) u = 1 U

and compute U replicated ranGACVs Then use their av-erage to compute the GACV estimate

4 SELECTION CRITERIA FOR MAINndashEFFECTSAND TWOndashFACTOR INTERACTIONS

41 The L1 Importance Measure

After choosing the optimal λ by the GACV or ranGACVcriteria the LBP estimate f

λis obtained by minimizing (5)

(6) (7) or (8) How to measure the importance of a particu-lar component in the fitted model is a key question We pro-pose using the functional L1 norm as the importance measureThe sparsity in the solutions will help distinguish the signifi-cant terms from insignificant ones effectively and thus improvethe performance of our importance measure In practice foreach functional component its L1 norm is empirically calcu-lated as the average of the function values evaluated at all ofthe data points for instance L1(fα) = 1

n

sumni=1 |fα(x

(α)i )| and

L1(fαβ) = 1n

sumni=1 |fαβ(x

(α)i x

(β)i )| For the categorical vari-

ables in model (7) the empirical L1 norm of the main ef-fect fγ is L1(fγ ) = 1

n

sumni=1 |Bγ γ (z

(γ )

i )| for γ = 1 r The norms of the interaction terms involved with categoricalvariables are defined similarly The rank of the L1 norm scoresis used to rank the relative importance of functional compo-nents For instance the component with the largest L1 normis ranked as the most important one and any component witha zero or tiny L1 norm is ranked as an unimportant one We havealso tried using the functional L2 norm to rank the componentsthis gave almost identical results in terms of the set of variablesselected in numerous simulation studies (not reproduced here)

42 Choosing the Threshold

In this section we focus on the main-effects model Usingthe chosen parameter λ we obtain the estimated main-effectscomponents f1 fd and calculate their L1 norms L1(f1)

L1(fd ) We use a sequential procedure to select importantterms Denote the decreasingly ordered norms by L(1) L(d)

and the corresponding components by f(1) f(d) A uni-versal threshold value is needed to differentiate the impor-tant components from unimportant ones Call the threshold q Only variables with their L1 norms greater than or equal to q

are ldquoimportantrdquo

664 Journal of the American Statistical Association September 2004

Now we develop a sequential Monte Carlo bootstrap testprocedure to determine q Essentially we test the variablesrsquoimportance one by one in their L1 norm rank order If onevariable passes the test (hence ldquoimportantrdquo) then it enters thenull model for testing the next variable otherwise the proce-dure stops After the first η (0 le η le d minus 1) variables enterthe model it is a one-sided hypothesis-testing problem to de-cide whether the next component f(η+1) is important or notWhen η = 0 the null model f is the constant say f = b0 andthe hypotheses are H0 L(1) = 0 versus H1 L(1) gt 0 When1 le η le d minus 1 the null model is f = b0 + f(1) + middot middot middot + f(η)and the hypotheses are H0 L(η+1) = 0 versus H1 L(η+1) gt 0Let the desired one-sided test level be α If the null distribu-tion of L(η+1) were knownthen we could get the critical valueα-percentile and make a decision of rejection or acceptanceIn practice calculating the exact α-percentile is difficult or im-possible however the Monte Carlo bootstrap test provides aconvenient approximation to the full test Conditional on theoriginal covariates x1 xn we generate ylowast(η)

1 ylowast(η)n

(response 0 or 1) using the null model f = b0 + f(1) +middot middot middot+ f(η)

as the true logit function We sample a total of T independentsets of data (x1 y

lowast(η)1t ) (xn y

lowast(η)nt ) t = 1 T from the

null model f fit the main-effects model for each set and com-pute L

lowast(η+1)t t = 1 T If exactly k of the simulated Llowast(η+1)

values exceed L(η+1) and none equals it then the Monte Carlop value is k+1

T +1 (See Davison and Hinkley 1997 for an intro-duction to the Monte Carlo bootstrap test)

Sequential Monte Carlo Bootstrap Test Algorithm

Step 1 Let η = 0 and f = b0 Test H0 L(1) = 0 ver-sus H1 L(1) gt 0 Generate T independent sets of

data (x1 ylowast(0)1t ) (xn y

lowast(0)nt ) t = 1 T from

f = b0 Fit the LBP main-effects model and com-pute the Monte Carlo p value p0 If p0 lt α then goto step 2 otherwise stop and define q as any numberslightly larger than L(1)

Step 2 Let η = η + 1 and f = b0 + f(1) + middot middot middot + f(η) TestH0 L(η+1) = 0 versus H1 L(η+1) gt 0 Generate

T independent sets of data (x1 ylowast(η)1t ) (xn y

lowast(η)nt )

based on f fit the main-effects model and com-pute the Monte Carlo p value pη If pη lt α andη lt d minus 1 then repeat step 2 and if pη lt α andη = d minus 1 then go to step 3 otherwise stop and de-fine q = L(η)

Step 3 Stop the procedure and define q = L(d)

The order of entry for sequential testing of the terms beingentertained for the model is determined by the magnitude ofthe component L1 norms There are other reasonable ways todetermine the order of entry and the particular strategy usedcan affect the results as is the case for any stepwise procedureFor the LBP approach the relative ranking among the impor-tant terms usually does not affect the final component selectionsolution as long as the important terms are all ranked higherthan the unimportant terms Thus our procedure usually can dis-tinguish between important and unimportant terms as in mostof our examples When the distinction between important and

unimportant terms is ambiguous with our method the ambigu-ous terms can be recognized and further investigation will beneeded on these terms

5 NUMERICAL COMPUTATION

Because the objective function in either (5) or (6) is not dif-ferentiable with respect to the coefficients brsquos and crsquos somenumerical methods for optimization fail to solve this kind ofproblem We can replace the l1 norms in the objective func-tion by nonnegative variables constrained linearly to be thecorresponding absolute values using standard mathematicalprogramming techniques and then solve a series of programswith nonlinear objective functions and linear constraints Manyoptimization methods can solve such problems We use MINOS(Murtagh and Saunders 1983) because it generally performswell with the linearly constrained models and returns consis-tent results

Consider choosing the optimal λ by selecting the grid pointfor which ranGACV achieves a minimum For each λ tofind ranGACV the program (5) (6) (7) or (8) must be solvedtwicemdashonce with y (the original problem) and once with y + ε

(the perturbed problem) This often results in hundreds orthousands of individual solves depending on the range for λTo obtain solutions in a reasonable amount of time we use anefficient solving approach known as slice modeling (see Ferrisand Voelker 2000 2001) Slice modeling is an approach to solv-ing a series of mathematical programs with the same structurebut different data Because for LBP we can consider the val-ues (λy) to be individual slices of data (because only thesevalues change between solves) the program can be reduced toan example of nonlinear slice modeling By applying slice mod-eling ideas (namely maintaining the common program struc-ture and ldquocorerdquo data shared between solves and using previoussolutions as starting points for late solves) we can improve ef-ficiency and make the grid search feasible We have developedefficient code for the main-effects and two-way interaction LBPmodels The code is easy to use and runs fast For examplein one simulation example with n = 1000 d = 10 and λ fixedthe main-effects model takes less than 2 seconds and the two-way interaction model takes less than 50 seconds

6 SIMULATION

61 Simulation 1 Main-Effects Model

In this example there are a total of d = 10 covariatesX(1) X(10) They are taken to be uniformly distributedin [01] independently The sample size n = 1000 We use thesimple random subsampling technique to select N = 50 basisfunctions The perturbation ε is distributed as Bernoulli(5) tak-ing two values +25minus25 Four variables X(1) X(3) X(6)and X(8) are important and the others are noise variablesThe true conditional logit function is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5 + 2

e minus 1ex(8) minus 5

We fit the main-effects LBP model and search the parameters(λπ λs) globally Because the true f is known both CKL andranGACV can be used for choosing the λrsquos Figure 1 depicts thevalues of CKL(λ) and ranGACV(λ) as functions of (λπ λs)

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 4: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

662 Journal of the American Statistical Association September 2004

situation when some variables have more than two categoriesDefine the mapping γ by

γ

(z(γ )

)=

1

2if z(γ ) = T

minus1

2if z(γ ) = F

Generally the mapping is chosen to make the range of categori-cal variables comparable with that of continuous variables Forany variable with C gt 2 categories C minus 1 contrasts are needed

The main-effects model that incorporates the categori-cal variables is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λs

dsum

α=1

Nsum

j=1

|cαj | (7)

where f (x z) = b0 + sumdα=1 bαk1(x

(α)) + sumrγ=1 Bγ times

γ (z(γ )) + sumdα=1

sumNj=1 cαjK1(x

(α) x(α)jlowast ) For each γ

the function γ can be considered as the main effect ofthe covariate Z(γ ) Thus we choose to associate the coef-ficients |B|rsquos and |b|rsquos with the same parameter λπ

Adding two-factor interactions with categorical vari-ables to a model that already includes parametric andsmooth terms adds a number of additional terms to thegeneral model Compared with (6) four new types ofterms are involved when we take into account categoricalvariables categorical main effects categoricalndashcategoricalinteractions ldquoparametric continuousrdquondashcategorical interac-tions and ldquosmooth continuousrdquondashcategorical interactionsThe modified two-factor interaction model is as follows

Minimize

1

n

nsum

i=1

[minusl(yi f (xi zi )

)]+ λπ

(dsum

α=1

|bα| +rsum

γ=1

|Bγ |)

+ λππ

(sum

αltβ

|bαβ | +sum

γltθ

|Bγθ | +dsum

α=1

rsum

γ=1

|Pαγ |)

+ λπs

(sum

α =β

Nsum

j=1

|cπsαβj | +

dsum

α=1

rsum

γ=1

Nsum

j=1

|cπsαγj |

)

+ λs

(dsum

α=1

Nsum

j=1

|cαj |)

+ λss

(sum

αltβ

Nsum

j=1

|cssαβj |

) (8)

where

f (x z) = b0 +dsum

α=1

bαk1(x(α)

)+rsum

γ=1

Bγ γ

(z(γ )

)

+sum

αltβ

bαβk1(x(α)

)k1(x(β)

)

+sum

γltθ

Bγ θγ

(z(γ )

(z(θ)

)

+dsum

α=1

rsum

γ=1

Pαγ k1(x(α)

(z(γ )

)

+sum

α =β

Nsum

j=1

cπsαβjK1

(x(α) x

(α)jlowast

)k1(x(β)

)k1(x

(β)jlowast

)

+dsum

α=1

rsum

γ=1

Nsum

j=1

cπsαγjK1

(x(α) x

(α)jlowast

(z(γ )

)

+dsum

α=1

Nsum

j=1

cαjK1(x(α) x

(α)jlowast

)

+sum

αltβ

Nsum

j=1

cssαβjK1

(x(α) x

(α)jlowast

)K1

(x(β) x

(β)jlowast

)

We assign different regularization parameters for main-effect terms parametricndashparametric interaction termsparametricndashsmooth interaction terms and smoothndashsmoothinteraction terms In particular the coefficients |Bγθ |rsquos and|Pαγ |rsquos are associated with the same parameter λππ andthe coefficients |cπs

αγj |rsquos and |cπsαβj |rsquos are associated with

the same parameter λπs

3 GENERALIZED APPROXIMATECROSSndashVALIDATION

Regularization parameter selection has been a very activefield of research Ordinary cross-validation (OCV) (Wahba andWold 1975) generalized cross-validation (GCV) (Craven andWahba 1979) and GACV (Xiang and Wahba 1996) are widelyused in various contexts of smoothing spline models We derivethe GACV to select λrsquos in the LBP models Here we present theGACV score and its derivation only for the main-effects modelthe extension to high-order interaction models is straightfor-ward With an abuse of notation we use λ to represent the col-lective set of tuning parameters in particular λ = (λπ λs) forthe main-effects model and λ = (λπ λππ λπs λs λss) for thetwo-factor interaction model

31 Approximate Cross-Validation

Let p be the ldquotruerdquo but unknown probability function andlet pλ be its estimate associated with λ Similarly let f and micro

denote the true logit and mean functions and let fλ and microλ

denote the corresponding estimates KullbackndashLeibler (KL)distance also known as the relative entropy is often used tomeasure the distance between two probability distributionsFor Bernoulli data we have KL(ppλ) = EX[ 1

2 micro(f minus fλ) minus(b(f )minusb(fλ))] with b(f ) = log(1+ef ) Removing the quan-tity that does not depend on λ from the KL distance expressionwe get the comparative KL (CKL) distance EX[minusmicrofλ + b(fλ)]The ordinary leave-one-out cross validation (CV) for CKL is

CV(λ) = 1

n

nsum

i=1

[minusyif[minusi]λ (xi ) + b(fλ(xi ))

] (9)

where fλ[minusi] is the minimizer of the objective function with

the ith data point omitted CV is commonly used as a roughlyunbiased estimate for CKL (see Xiang and Wahba 1996

Zhang et al Likelihood Basis Pursuit 663

Gu 2002) Direct calculation of CV involves computing n leave-one-out estimates which is expensive and almost infeasible forlarge-scale problems Thus we derive a second-order approxi-mate cross-validation (ACV) score to the CV (See Xiang andWahba 1996 Lin et al 2000 Gao Wahba Klein and Klein2001 for the ACV in traditional smoothing spline models)We first establish the leave-one-out lemma for the LBP mod-els The proof of Lemma 1 is given in Appendix A

Lemma 1 (Leave-One-Out Lemma) Denote the objectivefunction in the LBP model by

Iλ(fy) = L(y f ) + Jλ(f ) (10)

Let fλ[minusi] be the minimizer of Iλ(fy) with the ith obser-

vation omitted and let micro[minusi]λ (middot) be the mean function corre-

sponding to f[minusi]λ (middot) For any v isin R we define the vector

V = (y1 yiminus1 v yi+1 yn)prime Let hλ(i v middot) be the mini-

mizer of Iλ(fV) then hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

Using Taylor series approximations and Lemma 1 we canderive (in App B) the ACV score

ACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ 1

n

nsum

i=1

hii

yi(yi minus microλ(xi))

1 minus σ 2λihii

(11)

where σ 2λi equiv pλ(xi )(1 minus pλ(xi )) and hii is the iith entry of

the matrix H defined in (B9) (more details have been given inZhang 2002) Let W be the diagonal matrix with σ 2

λi in the iithposition By replacing hii with 1

n

sumni=1 hii equiv 1

ntr(H) and re-

placing 1 minus σ 2λihii with 1

ntr[I minus (W12HW12)] in (11) we ob-

tain the GACV score

GACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi) + b(fλ(xi ))]

+ tr(H)

n

sumni=1 yi(yi minus microλ(xi ))

tr[I minus W12HW12] (12)

32 Randomized GACV

Direct computation of (12) involves the inversion of a large-scale matrix whose size depends on sample size n basissize N and dimension d Large values of n N or d may makethe computation expensive and produce unstable solutionsThus the randomized GACV (ranGACV) score is proposed asa computable proxy for GACV Essentially we use the random-ized trace estimates for tr(H) and tr[I minus 1

2 (W12HW12)] basedon the following theorem (which has been exploited by numer-ous authors eg Girard 1998)

If A is any square matrix and ε is a mean 0 ran-dom n-vector with independent components withvariance σ 2

ε then 1σ 2

εEεT Aε = tr(A)

Let ε = (ε1 εn)prime be a mean 0 random n-vector of inde-

pendent components with variance σ 2ε Let f

yλ and f

y+ελ be the

minimizer of (5) using the original data y and the perturbed

data y + ε Using the derivation procedure of Lin et al (2000)we get the ranGACV score for the LBP estimates

ranGACV(λ)

= 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ εT (fy+ελ minus f

yλ )

n

sumni=1 yi(yi minus microλ(xi ))

εT ε minus εT W(fy+ελ minus f

yλ )

(13)

In addition two ideas help reduce the variance of the secondterm in (13)

1 Choose ε as Bernoulli (5) taking values in +σεminusσεThis guarantees that the randomized trace estimate has theminimal variance given a fixed σ 2

ε (see Hutchinson 1989)2 Generate U independent perturbations ε(u) u = 1 U

and compute U replicated ranGACVs Then use their av-erage to compute the GACV estimate

4 SELECTION CRITERIA FOR MAINndashEFFECTSAND TWOndashFACTOR INTERACTIONS

41 The L1 Importance Measure

After choosing the optimal λ by the GACV or ranGACVcriteria the LBP estimate f

λis obtained by minimizing (5)

(6) (7) or (8) How to measure the importance of a particu-lar component in the fitted model is a key question We pro-pose using the functional L1 norm as the importance measureThe sparsity in the solutions will help distinguish the signifi-cant terms from insignificant ones effectively and thus improvethe performance of our importance measure In practice foreach functional component its L1 norm is empirically calcu-lated as the average of the function values evaluated at all ofthe data points for instance L1(fα) = 1

n

sumni=1 |fα(x

(α)i )| and

L1(fαβ) = 1n

sumni=1 |fαβ(x

(α)i x

(β)i )| For the categorical vari-

ables in model (7) the empirical L1 norm of the main ef-fect fγ is L1(fγ ) = 1

n

sumni=1 |Bγ γ (z

(γ )

i )| for γ = 1 r The norms of the interaction terms involved with categoricalvariables are defined similarly The rank of the L1 norm scoresis used to rank the relative importance of functional compo-nents For instance the component with the largest L1 normis ranked as the most important one and any component witha zero or tiny L1 norm is ranked as an unimportant one We havealso tried using the functional L2 norm to rank the componentsthis gave almost identical results in terms of the set of variablesselected in numerous simulation studies (not reproduced here)

42 Choosing the Threshold

In this section we focus on the main-effects model Usingthe chosen parameter λ we obtain the estimated main-effectscomponents f1 fd and calculate their L1 norms L1(f1)

L1(fd ) We use a sequential procedure to select importantterms Denote the decreasingly ordered norms by L(1) L(d)

and the corresponding components by f(1) f(d) A uni-versal threshold value is needed to differentiate the impor-tant components from unimportant ones Call the threshold q Only variables with their L1 norms greater than or equal to q

are ldquoimportantrdquo

664 Journal of the American Statistical Association September 2004

Now we develop a sequential Monte Carlo bootstrap testprocedure to determine q Essentially we test the variablesrsquoimportance one by one in their L1 norm rank order If onevariable passes the test (hence ldquoimportantrdquo) then it enters thenull model for testing the next variable otherwise the proce-dure stops After the first η (0 le η le d minus 1) variables enterthe model it is a one-sided hypothesis-testing problem to de-cide whether the next component f(η+1) is important or notWhen η = 0 the null model f is the constant say f = b0 andthe hypotheses are H0 L(1) = 0 versus H1 L(1) gt 0 When1 le η le d minus 1 the null model is f = b0 + f(1) + middot middot middot + f(η)and the hypotheses are H0 L(η+1) = 0 versus H1 L(η+1) gt 0Let the desired one-sided test level be α If the null distribu-tion of L(η+1) were knownthen we could get the critical valueα-percentile and make a decision of rejection or acceptanceIn practice calculating the exact α-percentile is difficult or im-possible however the Monte Carlo bootstrap test provides aconvenient approximation to the full test Conditional on theoriginal covariates x1 xn we generate ylowast(η)

1 ylowast(η)n

(response 0 or 1) using the null model f = b0 + f(1) +middot middot middot+ f(η)

as the true logit function We sample a total of T independentsets of data (x1 y

lowast(η)1t ) (xn y

lowast(η)nt ) t = 1 T from the

null model f fit the main-effects model for each set and com-pute L

lowast(η+1)t t = 1 T If exactly k of the simulated Llowast(η+1)

values exceed L(η+1) and none equals it then the Monte Carlop value is k+1

T +1 (See Davison and Hinkley 1997 for an intro-duction to the Monte Carlo bootstrap test)

Sequential Monte Carlo Bootstrap Test Algorithm

Step 1 Let η = 0 and f = b0 Test H0 L(1) = 0 ver-sus H1 L(1) gt 0 Generate T independent sets of

data (x1 ylowast(0)1t ) (xn y

lowast(0)nt ) t = 1 T from

f = b0 Fit the LBP main-effects model and com-pute the Monte Carlo p value p0 If p0 lt α then goto step 2 otherwise stop and define q as any numberslightly larger than L(1)

Step 2 Let η = η + 1 and f = b0 + f(1) + middot middot middot + f(η) TestH0 L(η+1) = 0 versus H1 L(η+1) gt 0 Generate

T independent sets of data (x1 ylowast(η)1t ) (xn y

lowast(η)nt )

based on f fit the main-effects model and com-pute the Monte Carlo p value pη If pη lt α andη lt d minus 1 then repeat step 2 and if pη lt α andη = d minus 1 then go to step 3 otherwise stop and de-fine q = L(η)

Step 3 Stop the procedure and define q = L(d)

The order of entry for sequential testing of the terms beingentertained for the model is determined by the magnitude ofthe component L1 norms There are other reasonable ways todetermine the order of entry and the particular strategy usedcan affect the results as is the case for any stepwise procedureFor the LBP approach the relative ranking among the impor-tant terms usually does not affect the final component selectionsolution as long as the important terms are all ranked higherthan the unimportant terms Thus our procedure usually can dis-tinguish between important and unimportant terms as in mostof our examples When the distinction between important and

unimportant terms is ambiguous with our method the ambigu-ous terms can be recognized and further investigation will beneeded on these terms

5 NUMERICAL COMPUTATION

Because the objective function in either (5) or (6) is not dif-ferentiable with respect to the coefficients brsquos and crsquos somenumerical methods for optimization fail to solve this kind ofproblem We can replace the l1 norms in the objective func-tion by nonnegative variables constrained linearly to be thecorresponding absolute values using standard mathematicalprogramming techniques and then solve a series of programswith nonlinear objective functions and linear constraints Manyoptimization methods can solve such problems We use MINOS(Murtagh and Saunders 1983) because it generally performswell with the linearly constrained models and returns consis-tent results

Consider choosing the optimal λ by selecting the grid pointfor which ranGACV achieves a minimum For each λ tofind ranGACV the program (5) (6) (7) or (8) must be solvedtwicemdashonce with y (the original problem) and once with y + ε

(the perturbed problem) This often results in hundreds orthousands of individual solves depending on the range for λTo obtain solutions in a reasonable amount of time we use anefficient solving approach known as slice modeling (see Ferrisand Voelker 2000 2001) Slice modeling is an approach to solv-ing a series of mathematical programs with the same structurebut different data Because for LBP we can consider the val-ues (λy) to be individual slices of data (because only thesevalues change between solves) the program can be reduced toan example of nonlinear slice modeling By applying slice mod-eling ideas (namely maintaining the common program struc-ture and ldquocorerdquo data shared between solves and using previoussolutions as starting points for late solves) we can improve ef-ficiency and make the grid search feasible We have developedefficient code for the main-effects and two-way interaction LBPmodels The code is easy to use and runs fast For examplein one simulation example with n = 1000 d = 10 and λ fixedthe main-effects model takes less than 2 seconds and the two-way interaction model takes less than 50 seconds

6 SIMULATION

61 Simulation 1 Main-Effects Model

In this example there are a total of d = 10 covariatesX(1) X(10) They are taken to be uniformly distributedin [01] independently The sample size n = 1000 We use thesimple random subsampling technique to select N = 50 basisfunctions The perturbation ε is distributed as Bernoulli(5) tak-ing two values +25minus25 Four variables X(1) X(3) X(6)and X(8) are important and the others are noise variablesThe true conditional logit function is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5 + 2

e minus 1ex(8) minus 5

We fit the main-effects LBP model and search the parameters(λπ λs) globally Because the true f is known both CKL andranGACV can be used for choosing the λrsquos Figure 1 depicts thevalues of CKL(λ) and ranGACV(λ) as functions of (λπ λs)

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 5: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Zhang et al Likelihood Basis Pursuit 663

Gu 2002) Direct calculation of CV involves computing n leave-one-out estimates which is expensive and almost infeasible forlarge-scale problems Thus we derive a second-order approxi-mate cross-validation (ACV) score to the CV (See Xiang andWahba 1996 Lin et al 2000 Gao Wahba Klein and Klein2001 for the ACV in traditional smoothing spline models)We first establish the leave-one-out lemma for the LBP mod-els The proof of Lemma 1 is given in Appendix A

Lemma 1 (Leave-One-Out Lemma) Denote the objectivefunction in the LBP model by

Iλ(fy) = L(y f ) + Jλ(f ) (10)

Let fλ[minusi] be the minimizer of Iλ(fy) with the ith obser-

vation omitted and let micro[minusi]λ (middot) be the mean function corre-

sponding to f[minusi]λ (middot) For any v isin R we define the vector

V = (y1 yiminus1 v yi+1 yn)prime Let hλ(i v middot) be the mini-

mizer of Iλ(fV) then hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

Using Taylor series approximations and Lemma 1 we canderive (in App B) the ACV score

ACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ 1

n

nsum

i=1

hii

yi(yi minus microλ(xi))

1 minus σ 2λihii

(11)

where σ 2λi equiv pλ(xi )(1 minus pλ(xi )) and hii is the iith entry of

the matrix H defined in (B9) (more details have been given inZhang 2002) Let W be the diagonal matrix with σ 2

λi in the iithposition By replacing hii with 1

n

sumni=1 hii equiv 1

ntr(H) and re-

placing 1 minus σ 2λihii with 1

ntr[I minus (W12HW12)] in (11) we ob-

tain the GACV score

GACV(λ) = 1

n

nsum

i=1

[minusyifλ(xi) + b(fλ(xi ))]

+ tr(H)

n

sumni=1 yi(yi minus microλ(xi ))

tr[I minus W12HW12] (12)

32 Randomized GACV

Direct computation of (12) involves the inversion of a large-scale matrix whose size depends on sample size n basissize N and dimension d Large values of n N or d may makethe computation expensive and produce unstable solutionsThus the randomized GACV (ranGACV) score is proposed asa computable proxy for GACV Essentially we use the random-ized trace estimates for tr(H) and tr[I minus 1

2 (W12HW12)] basedon the following theorem (which has been exploited by numer-ous authors eg Girard 1998)

If A is any square matrix and ε is a mean 0 ran-dom n-vector with independent components withvariance σ 2

ε then 1σ 2

εEεT Aε = tr(A)

Let ε = (ε1 εn)prime be a mean 0 random n-vector of inde-

pendent components with variance σ 2ε Let f

yλ and f

y+ελ be the

minimizer of (5) using the original data y and the perturbed

data y + ε Using the derivation procedure of Lin et al (2000)we get the ranGACV score for the LBP estimates

ranGACV(λ)

= 1

n

nsum

i=1

[minusyifλ(xi ) + b(fλ(xi ))]

+ εT (fy+ελ minus f

yλ )

n

sumni=1 yi(yi minus microλ(xi ))

εT ε minus εT W(fy+ελ minus f

yλ )

(13)

In addition two ideas help reduce the variance of the secondterm in (13)

1 Choose ε as Bernoulli (5) taking values in +σεminusσεThis guarantees that the randomized trace estimate has theminimal variance given a fixed σ 2

ε (see Hutchinson 1989)2 Generate U independent perturbations ε(u) u = 1 U

and compute U replicated ranGACVs Then use their av-erage to compute the GACV estimate

4 SELECTION CRITERIA FOR MAINndashEFFECTSAND TWOndashFACTOR INTERACTIONS

41 The L1 Importance Measure

After choosing the optimal λ by the GACV or ranGACVcriteria the LBP estimate f

λis obtained by minimizing (5)

(6) (7) or (8) How to measure the importance of a particu-lar component in the fitted model is a key question We pro-pose using the functional L1 norm as the importance measureThe sparsity in the solutions will help distinguish the signifi-cant terms from insignificant ones effectively and thus improvethe performance of our importance measure In practice foreach functional component its L1 norm is empirically calcu-lated as the average of the function values evaluated at all ofthe data points for instance L1(fα) = 1

n

sumni=1 |fα(x

(α)i )| and

L1(fαβ) = 1n

sumni=1 |fαβ(x

(α)i x

(β)i )| For the categorical vari-

ables in model (7) the empirical L1 norm of the main ef-fect fγ is L1(fγ ) = 1

n

sumni=1 |Bγ γ (z

(γ )

i )| for γ = 1 r The norms of the interaction terms involved with categoricalvariables are defined similarly The rank of the L1 norm scoresis used to rank the relative importance of functional compo-nents For instance the component with the largest L1 normis ranked as the most important one and any component witha zero or tiny L1 norm is ranked as an unimportant one We havealso tried using the functional L2 norm to rank the componentsthis gave almost identical results in terms of the set of variablesselected in numerous simulation studies (not reproduced here)

42 Choosing the Threshold

In this section we focus on the main-effects model Usingthe chosen parameter λ we obtain the estimated main-effectscomponents f1 fd and calculate their L1 norms L1(f1)

L1(fd ) We use a sequential procedure to select importantterms Denote the decreasingly ordered norms by L(1) L(d)

and the corresponding components by f(1) f(d) A uni-versal threshold value is needed to differentiate the impor-tant components from unimportant ones Call the threshold q Only variables with their L1 norms greater than or equal to q

are ldquoimportantrdquo

664 Journal of the American Statistical Association September 2004

Now we develop a sequential Monte Carlo bootstrap testprocedure to determine q Essentially we test the variablesrsquoimportance one by one in their L1 norm rank order If onevariable passes the test (hence ldquoimportantrdquo) then it enters thenull model for testing the next variable otherwise the proce-dure stops After the first η (0 le η le d minus 1) variables enterthe model it is a one-sided hypothesis-testing problem to de-cide whether the next component f(η+1) is important or notWhen η = 0 the null model f is the constant say f = b0 andthe hypotheses are H0 L(1) = 0 versus H1 L(1) gt 0 When1 le η le d minus 1 the null model is f = b0 + f(1) + middot middot middot + f(η)and the hypotheses are H0 L(η+1) = 0 versus H1 L(η+1) gt 0Let the desired one-sided test level be α If the null distribu-tion of L(η+1) were knownthen we could get the critical valueα-percentile and make a decision of rejection or acceptanceIn practice calculating the exact α-percentile is difficult or im-possible however the Monte Carlo bootstrap test provides aconvenient approximation to the full test Conditional on theoriginal covariates x1 xn we generate ylowast(η)

1 ylowast(η)n

(response 0 or 1) using the null model f = b0 + f(1) +middot middot middot+ f(η)

as the true logit function We sample a total of T independentsets of data (x1 y

lowast(η)1t ) (xn y

lowast(η)nt ) t = 1 T from the

null model f fit the main-effects model for each set and com-pute L

lowast(η+1)t t = 1 T If exactly k of the simulated Llowast(η+1)

values exceed L(η+1) and none equals it then the Monte Carlop value is k+1

T +1 (See Davison and Hinkley 1997 for an intro-duction to the Monte Carlo bootstrap test)

Sequential Monte Carlo Bootstrap Test Algorithm

Step 1 Let η = 0 and f = b0 Test H0 L(1) = 0 ver-sus H1 L(1) gt 0 Generate T independent sets of

data (x1 ylowast(0)1t ) (xn y

lowast(0)nt ) t = 1 T from

f = b0 Fit the LBP main-effects model and com-pute the Monte Carlo p value p0 If p0 lt α then goto step 2 otherwise stop and define q as any numberslightly larger than L(1)

Step 2 Let η = η + 1 and f = b0 + f(1) + middot middot middot + f(η) TestH0 L(η+1) = 0 versus H1 L(η+1) gt 0 Generate

T independent sets of data (x1 ylowast(η)1t ) (xn y

lowast(η)nt )

based on f fit the main-effects model and com-pute the Monte Carlo p value pη If pη lt α andη lt d minus 1 then repeat step 2 and if pη lt α andη = d minus 1 then go to step 3 otherwise stop and de-fine q = L(η)

Step 3 Stop the procedure and define q = L(d)

The order of entry for sequential testing of the terms beingentertained for the model is determined by the magnitude ofthe component L1 norms There are other reasonable ways todetermine the order of entry and the particular strategy usedcan affect the results as is the case for any stepwise procedureFor the LBP approach the relative ranking among the impor-tant terms usually does not affect the final component selectionsolution as long as the important terms are all ranked higherthan the unimportant terms Thus our procedure usually can dis-tinguish between important and unimportant terms as in mostof our examples When the distinction between important and

unimportant terms is ambiguous with our method the ambigu-ous terms can be recognized and further investigation will beneeded on these terms

5 NUMERICAL COMPUTATION

Because the objective function in either (5) or (6) is not dif-ferentiable with respect to the coefficients brsquos and crsquos somenumerical methods for optimization fail to solve this kind ofproblem We can replace the l1 norms in the objective func-tion by nonnegative variables constrained linearly to be thecorresponding absolute values using standard mathematicalprogramming techniques and then solve a series of programswith nonlinear objective functions and linear constraints Manyoptimization methods can solve such problems We use MINOS(Murtagh and Saunders 1983) because it generally performswell with the linearly constrained models and returns consis-tent results

Consider choosing the optimal λ by selecting the grid pointfor which ranGACV achieves a minimum For each λ tofind ranGACV the program (5) (6) (7) or (8) must be solvedtwicemdashonce with y (the original problem) and once with y + ε

(the perturbed problem) This often results in hundreds orthousands of individual solves depending on the range for λTo obtain solutions in a reasonable amount of time we use anefficient solving approach known as slice modeling (see Ferrisand Voelker 2000 2001) Slice modeling is an approach to solv-ing a series of mathematical programs with the same structurebut different data Because for LBP we can consider the val-ues (λy) to be individual slices of data (because only thesevalues change between solves) the program can be reduced toan example of nonlinear slice modeling By applying slice mod-eling ideas (namely maintaining the common program struc-ture and ldquocorerdquo data shared between solves and using previoussolutions as starting points for late solves) we can improve ef-ficiency and make the grid search feasible We have developedefficient code for the main-effects and two-way interaction LBPmodels The code is easy to use and runs fast For examplein one simulation example with n = 1000 d = 10 and λ fixedthe main-effects model takes less than 2 seconds and the two-way interaction model takes less than 50 seconds

6 SIMULATION

61 Simulation 1 Main-Effects Model

In this example there are a total of d = 10 covariatesX(1) X(10) They are taken to be uniformly distributedin [01] independently The sample size n = 1000 We use thesimple random subsampling technique to select N = 50 basisfunctions The perturbation ε is distributed as Bernoulli(5) tak-ing two values +25minus25 Four variables X(1) X(3) X(6)and X(8) are important and the others are noise variablesThe true conditional logit function is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5 + 2

e minus 1ex(8) minus 5

We fit the main-effects LBP model and search the parameters(λπ λs) globally Because the true f is known both CKL andranGACV can be used for choosing the λrsquos Figure 1 depicts thevalues of CKL(λ) and ranGACV(λ) as functions of (λπ λs)

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 6: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

664 Journal of the American Statistical Association September 2004

Now we develop a sequential Monte Carlo bootstrap testprocedure to determine q Essentially we test the variablesrsquoimportance one by one in their L1 norm rank order If onevariable passes the test (hence ldquoimportantrdquo) then it enters thenull model for testing the next variable otherwise the proce-dure stops After the first η (0 le η le d minus 1) variables enterthe model it is a one-sided hypothesis-testing problem to de-cide whether the next component f(η+1) is important or notWhen η = 0 the null model f is the constant say f = b0 andthe hypotheses are H0 L(1) = 0 versus H1 L(1) gt 0 When1 le η le d minus 1 the null model is f = b0 + f(1) + middot middot middot + f(η)and the hypotheses are H0 L(η+1) = 0 versus H1 L(η+1) gt 0Let the desired one-sided test level be α If the null distribu-tion of L(η+1) were knownthen we could get the critical valueα-percentile and make a decision of rejection or acceptanceIn practice calculating the exact α-percentile is difficult or im-possible however the Monte Carlo bootstrap test provides aconvenient approximation to the full test Conditional on theoriginal covariates x1 xn we generate ylowast(η)

1 ylowast(η)n

(response 0 or 1) using the null model f = b0 + f(1) +middot middot middot+ f(η)

as the true logit function We sample a total of T independentsets of data (x1 y

lowast(η)1t ) (xn y

lowast(η)nt ) t = 1 T from the

null model f fit the main-effects model for each set and com-pute L

lowast(η+1)t t = 1 T If exactly k of the simulated Llowast(η+1)

values exceed L(η+1) and none equals it then the Monte Carlop value is k+1

T +1 (See Davison and Hinkley 1997 for an intro-duction to the Monte Carlo bootstrap test)

Sequential Monte Carlo Bootstrap Test Algorithm

Step 1 Let η = 0 and f = b0 Test H0 L(1) = 0 ver-sus H1 L(1) gt 0 Generate T independent sets of

data (x1 ylowast(0)1t ) (xn y

lowast(0)nt ) t = 1 T from

f = b0 Fit the LBP main-effects model and com-pute the Monte Carlo p value p0 If p0 lt α then goto step 2 otherwise stop and define q as any numberslightly larger than L(1)

Step 2 Let η = η + 1 and f = b0 + f(1) + middot middot middot + f(η) TestH0 L(η+1) = 0 versus H1 L(η+1) gt 0 Generate

T independent sets of data (x1 ylowast(η)1t ) (xn y

lowast(η)nt )

based on f fit the main-effects model and com-pute the Monte Carlo p value pη If pη lt α andη lt d minus 1 then repeat step 2 and if pη lt α andη = d minus 1 then go to step 3 otherwise stop and de-fine q = L(η)

Step 3 Stop the procedure and define q = L(d)

The order of entry for sequential testing of the terms beingentertained for the model is determined by the magnitude ofthe component L1 norms There are other reasonable ways todetermine the order of entry and the particular strategy usedcan affect the results as is the case for any stepwise procedureFor the LBP approach the relative ranking among the impor-tant terms usually does not affect the final component selectionsolution as long as the important terms are all ranked higherthan the unimportant terms Thus our procedure usually can dis-tinguish between important and unimportant terms as in mostof our examples When the distinction between important and

unimportant terms is ambiguous with our method the ambigu-ous terms can be recognized and further investigation will beneeded on these terms

5 NUMERICAL COMPUTATION

Because the objective function in either (5) or (6) is not dif-ferentiable with respect to the coefficients brsquos and crsquos somenumerical methods for optimization fail to solve this kind ofproblem We can replace the l1 norms in the objective func-tion by nonnegative variables constrained linearly to be thecorresponding absolute values using standard mathematicalprogramming techniques and then solve a series of programswith nonlinear objective functions and linear constraints Manyoptimization methods can solve such problems We use MINOS(Murtagh and Saunders 1983) because it generally performswell with the linearly constrained models and returns consis-tent results

Consider choosing the optimal λ by selecting the grid pointfor which ranGACV achieves a minimum For each λ tofind ranGACV the program (5) (6) (7) or (8) must be solvedtwicemdashonce with y (the original problem) and once with y + ε

(the perturbed problem) This often results in hundreds orthousands of individual solves depending on the range for λTo obtain solutions in a reasonable amount of time we use anefficient solving approach known as slice modeling (see Ferrisand Voelker 2000 2001) Slice modeling is an approach to solv-ing a series of mathematical programs with the same structurebut different data Because for LBP we can consider the val-ues (λy) to be individual slices of data (because only thesevalues change between solves) the program can be reduced toan example of nonlinear slice modeling By applying slice mod-eling ideas (namely maintaining the common program struc-ture and ldquocorerdquo data shared between solves and using previoussolutions as starting points for late solves) we can improve ef-ficiency and make the grid search feasible We have developedefficient code for the main-effects and two-way interaction LBPmodels The code is easy to use and runs fast For examplein one simulation example with n = 1000 d = 10 and λ fixedthe main-effects model takes less than 2 seconds and the two-way interaction model takes less than 50 seconds

6 SIMULATION

61 Simulation 1 Main-Effects Model

In this example there are a total of d = 10 covariatesX(1) X(10) They are taken to be uniformly distributedin [01] independently The sample size n = 1000 We use thesimple random subsampling technique to select N = 50 basisfunctions The perturbation ε is distributed as Bernoulli(5) tak-ing two values +25minus25 Four variables X(1) X(3) X(6)and X(8) are important and the others are noise variablesThe true conditional logit function is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5 + 2

e minus 1ex(8) minus 5

We fit the main-effects LBP model and search the parameters(λπ λs) globally Because the true f is known both CKL andranGACV can be used for choosing the λrsquos Figure 1 depicts thevalues of CKL(λ) and ranGACV(λ) as functions of (λπ λs)

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 7: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Zhang et al Likelihood Basis Pursuit 665

Figure 1 Contours and Three-Dimensional Plots for CKL(λ) and GACV(λ)

within the region of interest [2minus202minus1] times [2minus202minus1] In thetop row are the contours for CKL(λ) and ranGACV(λ) wherethe white cross ldquoxrdquo indicates the location of the optimal regular-ization parameter Here λCKL = (2minus172minus15) and λranGACV =(2minus82minus15) The bottom row shows their three-dimensionalplots In general ranGACV(λ) approximates CKL(λ) quitewell globally

Using the optimal parameters we fit the main-effects modeland calculate the L1 norm scores for the individual componentsf1 f10 Figure 2 plots two sets of L1 norm scores obtainedusing λCKL and λranGACV in their decreasing order The dashedline indicates the threshold chosen by the proposed sequen-tial Monte Carlo bootstrap test algorithm Using this threshold

Figure 2 L1 Norm Scores for the Main-Effects Model ( CKLGACV)

variables X(6) X(3) X(1) and X(8) are selected as ldquoimportantrdquovariables correctly

Figure 3 depicts the procedure of the sequential bootstraptests to determine q We fit the main-effects model usingλranGACV and sequentially test the hypotheses H0 L(η) = 0 ver-sus H1 L(η) gt 0 η = 1 10 In each plot of Figure 3 graycircles denote the L1 norms for the variables in the null model

(a) (b)

(c) (d)

(e)

Figure 3 Monte Carlo Bootstrap Tests for Simulation 1 [(a) orig-inal boot 1 (b) original boot 2 (c) original boot 3(d) original boot 4 (e) original boot 5]

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 8: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

666 Journal of the American Statistical Association September 2004

(a) (b)

(c) (d)

Figure 4 True ( ) and Estimated ( ) Univariate Logit Compo-nents (a) X1 (b) X3 (c) X6 (d) X8

and black circles denote the L1 norms for those not in the nullmodel (In a color plot gray circles are shown in green andblack circles are shown in blue) Along the horizontal axis thevariable being tested for importance is bracketed by a pair of Our experiment shows that the null hypotheses of the first fourtests are all rejected at level α = 05 based on their Monte Carlop value 151

= 02 However the null hypothesis for the fifthcomponent f2 is accepted with the p value 1051

= 20 Thusf6 f3 f1 and f8 are selected as ldquoimportantrdquo components andq = L(4) = 21

In addition to selecting important variables LBP also pro-duces functional estimates for the individual components in themodel Figure 4 plots the true main effects f1 f3 f6 and f8and their estimates fitted using λranGACV In each panel thedashed line denoted the true curve and the solid line denotesthe corresponding estimate In general the fitted main-effectsmodel provides a reasonably good estimate for each importantcomponent Altogether we generated 20 datasets and fitted themain-effects model for each dataset with regularization para-meters tuned separately Throughout all of these 20 runs vari-ables X(1) X(3) X(6) and X(8) are always the four top-rankedvariables The results and figures shown here are based on thefirst dataset

62 Simulation 2 Two-Factor Interaction Model

There are d = 4 continuous covariates independently anduniformly distributed in [01] The true model is a two-factorinteraction model and the important effects are X(1) X(2) andtheir interaction The true logit function f is

f (x) = 4x(1) + π sin(πx(1)

)+ 6x(2)

minus 8(x(2)

)3 + cos(2π

(x(1) minus x(2)

))minus 4

We choose n = 1000 and N = 50 and use the same perturba-tion ε as in the previous example There are five tuning para-meters (λπ λππ λs λπs and λss ) in the two-factor interactionmodel In practice extra constraints may be added on the pa-rameters for different needs Here we penalize all of the two-factor interaction terms equally by setting λππ = λπs = λss The optimal parameters are λCKL = (2minus102minus102minus15

Figure 5 L1 Norm Scores for the Two-Factor Interaction Model( CKL GACV)

2minus102minus10) and λranGACV = (2minus202minus202minus182minus202minus20)Figure 5 plots the ranked L1 norm scores The dashed line rep-resents the threshold q chosen by the Monte Carlo bootstraptest procedure The LBP two-factor interaction model using ei-ther λCKL or λGACV selects all of the important effects X(1)X(2) and their interaction effect correctly

There is a strong interaction effect between variables X(1)

and X(2) which is shown clearly by the cross-sectional plots inFigure 6 The solid lines are the cross-sections of the true logitfunction f (x(1) x(2)) at distinct values x(1) = 2 5 8 and thedashed lines are their corresponding estimates given by the LBPmodel The parameters are tuned by the ranGACV criterion

63 Simulation 3 Main-Effects Model IncorporatingCategorical Variables

In this example among the 12 covariates X(1) X(10) arecontinuous and Z(1) and Z(2) are categorical The continuous

(a) (b) (c)

Figure 6 Cross-Sectional Plots for the Two-Factor Interaction Model(a) x = 2 (b) x = 5 (c) x = 8 ( true estimate)

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 9: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Zhang et al Likelihood Basis Pursuit 667

Figure 7 L1 Norm Scores for the Main-Effects Model IncorporatingCategorical Variables ( CKL GACV)

variables are uniformly distributed in [01] and the categori-cal variables are Bernoulli(5) with values 01 The true logitfunction is

f (x) = 4

3x(1) + π sin

(πx(3)

)+ 8(x(6)

)5

+ 2

e minus 1ex(8) + 4z(1) minus 7

The important main effects are X(1) X(3) X(6) X(8) and Z(1)The sample size is n = 1000 and the basis size is N = 50We use the same perturbation ε as in the previous exam-ples The main-effects model incorporating categorical vari-ables in (7) is fitted Figure 7 plots the ranked L1 norm scoresfor all of the covariates The LBP main-effects models usingλCKL and λGACV both select the important continuous and cat-egorical variables correctly

7 WISCONSIN EPIDEMIOLOGIC STUDY OFDIABETIC RETINOPATHY

The Wisconsin Epidemiologic Study of Diabetic Retinopa-thy (WESDR) is an ongoing epidemiologic study of a cohortof patients receiving their medical care southern WisconsinAll younger-onset diabetic persons (defined as under age 30of age at diagnosis and taking insulin) and a probability sam-ple of older-onset persons receiving primary medical care in an11-county area of southwestern Wisconsin in 1979ndash1980 wereinvited to participate Among 1210 identified younger onsetpatients 996 agreed to participate in the baseline examinationin 1980ndash1982 of those 891 participated in the first follow-upexamination Details about the study have been given by KleinKlein Moss Davis and DeMets (1984ab 1989) Klein KleinMoss and Cruickshanks (1998) and others A large numberof medical demographic ocular and other covariates wererecorded in each examination A multilevel retinopathy scorewas assigned to each eye based on its stereoscopic color fundusphotographs This dataset has been extensively analyzed usinga variety of statistical methods (see eg Craig Fryback Kleinand Klein 1999 Kim 1995)

In this section we examine the relation of a large numberof possible risk factors at baseline to the 4-year progression ofdiabetic retinopathy Each personrsquos retinopathy score was de-fined as the score for the worse eye and 4-year progression ofretinopathy was considered to occur if the retinopathy score de-graded two levels from baseline Wahba et al (1995) examinedrisk factors for progression of diabetic retinopathy on a subset

of the younger-onset group whose members had no or nonpro-liferative retinopathy at baseline the dataset comprised 669 per-sons A model of the risk of progression of diabetic retinopathyin this population was built using a SSndashANOVA model (whichhas a quadratic penalty functional) using the predictor variablesglycosylated hemoglobin (gly) duration of diabetes (dur) andbody mass index (bmi) these variables are described furtherin Appendix B That study began with these variables and twoother (nonindependent) variables age at baseline and age at di-agnosis these latter two were eliminated at the start Althoughit was not discussed by Wahba et al (1995) we report herethat that study began with a large number (perhaps about 20)of potential risk factors which were reduced to gly dur andbmi as likely the most important after many extended and la-borious parametric and nonparametric regression analyses ofsmall groups of variables at a time and by linear logistic re-gression by the authors and others At that time it was recog-nized that a (smoothly) nonparametric model selection methodthat could rapidly flag important variables in a dataset withmany candidate variables was very desirable For the purposesof the present study we make the reasonable assumption thatgly dur and bmi are the ldquotruthrdquo (ie the most important riskfactors in the analyzed population) and thus we are presentedwith a unique opportunity to examine the behavior of the LBPmethod in a real dataset where arguably the truth is knownby giving it many variables in this dataset and comparing theresults to those of Wahba et al (1995) Minor corrections andupdatings of that dataset have been made (but are not believedto affect the conclusions) and we have 648 persons in the up-dated dataset used here

We performed some preliminary winnowing of the manypotential prediction variables available reducing the set forexamination to 14 potential risk factors The continuous covari-ates are dur gly bmi sys ret pulse ins sch iop and the cate-gorical covariates are smk sex asp famdb mar the full namesfor these are given in Appendix C Because the true f is notknown for real data only ranGACV is available for tuning λFigure 8 plots the L1 norm scores of the individual functionalcomponents in the fitted LBP main-effects model The dashedline indicates the threshold q = 39 which is chosen by the se-quential bootstrap tests

We note that the LBP picks out three most important vari-ables gly dur and bmi specified by Wahba et al (1995)The LBP also chose sch (highest year of schoolcollege com-pleted) This variable frequently shows up in demographic stud-ies when one looks for it because it is likely a proxy for othervariables related to disease such as lifestyle and quality of med-ical care It did show up in preliminary studies of Wahba et al

Figure 8 L1 Norm Scores for the WESDR Main-Effects Model

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 10: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

668 Journal of the American Statistical Association September 2004

Figure 9 Estimated Logit Component for dur

(1995) (not reported there) but was not included because it wasnot considered a direct cause of disease itself The sequentialMonte Carlo bootstrap tests for gly dur sch and bmi all havep value 151

= 02 thus these four covariates are selected asimportant risk factors at the significance level α = 05

Figure 9 plots the estimated logit for dur The risk of progres-sion of diabetic retinopathy increases up to a duration of about15 years then decreases thereafter which generally agrees withthe analysis of Wahba et al (1995) The linear logistic regres-sion model (using the function glm in the R package) shows thatdur is not significant at level α = 05 The curve in Figure 9 ex-hibits a hilly shape which means that a quadratic function fitsthe curve better than a linear function We refit the linear lo-gistic model by intentionally including dur2 the term dur2 issignificant with a p value of 02 This fact confirms the discov-ery of the LBP and shows that LBP can be a valid screeningtool to aid in determining the appropriate functional form forthe individual covariate

When fitting the two-factor interaction model in (6) with theconstraints λππ = λπs = λss we did not find the durndashbmi inter-action of Wahba et al (1995) here We note that the interactionterms tend to be washed out if there are only a few interactionsHowever further exploratory analysis may be done by rearrang-ing the constraints andor varying the tuning parameters subjec-tively Note that the solution to the optimization problem is verysparse In this example we observed that approximately 90 ofthe coefficients are 0rsquos in the solution

8 BEAVER DAM EYE STUDY

The Beaver Dam Eye Study (BDES) is an ongoing popula-tion-based study of age-related ocular disorders It aims tocollect information related to the prevalence incidence andseverity of age-related cataract macular degeneration and di-abetic retinopathy Between 1987 and 1988 5924 eligiblepeople (age 43ndash84 years) were identified in Beaver DamWisconsin and of those 4926 (831) participated in thebaseline exam 5- and 10-year follow-up data have been col-lected and results are being reported Many variables of variouskinds have been collected including mortality between base-line and the follow-ups A detailed description of the study hasbeen given by Klein Klein Linton and DeMets (1991) recentreports include that of Klein Klein Lee Cruickshanks andChappell (2001)

We are interested in the relation between the 5-year mor-tality occurrence for the nondiabetic study participants andpossible risk factors at baseline We focus on the nondiabeticparticipants because the pattern of risk factors for people with

diabetes and the rest of the population differs We consider10 continuous and 8 categorical covariates the detailed in-formation for which is given in Appendix D The abbreviatednames of continuous covariates are pky sch inc bmi glu calchl hgb sys and age and those of the categorical covariatesare cv sex hair hist nout mar sum and vtm We deliberatelytake into account some ldquonoisyrdquo variables in the analysis in-cluding hair nout and sum which are not directly related tomortality in general We include these to show the performanceof the LBP approach and they are not expected to be pickedout eventually by the model Y is assigned a value of 1 if a per-son participated in the baseline examination and died beforethe start of the first 5-year follow-up and 0 otherwise Thereare 4422 nondiabetic study participants in the baseline exam-ination 395 of whom have missing data in the covariates Forthe purpose of this study we assume that the missing data aremissing at random thus these 335 subjects are not includedin our analysis This assumption is not necessarily valid be-cause age blood pressure body mass index cholesterol sexsmoking and hemoglobin may well affect the missingness buta further examination of the missingness is beyond the scopeof the present study In addition we exclude another 10 partic-ipants who have either outlier values pky gt 158 or very abnor-mal records bmi gt 58 or hgb lt 6 Thus we report an analysisof the remaining 4017 nondiabetic participants from the base-line population

We fit the main-effects model incorporating categorical vari-ables in (7) The sequential Monte Carlo bootstrap tests for sixcovariates age hgb pky sex sys and cv all have Monte Carlop values 151

= 02 whereas the test for glu is not signifi-cant with a p value 951 = 18 The threshold is chosen asq = L(6) = 25 Figure 10 plots the L1 norm scores for all ofthe potential risk factors Using the threshold (dashed line) 25chosen by the sequential bootstrap test procedure the LBPmodel identifies six important risk factors age hgb pky sexsys and cv for the 5-year mortality

Compared with the LBP model the linear logistic model withstepwise selection using the AIC implemented by the func-tion glm in R misses the variable sys but selects three morevariables inc bmi and sum Figure 11 depicts the estimatedunivariate logit components for the important continuous vari-ables selected by the LBP model All of the curves can be ap-proximated reasonably well by linear models except sys whosefunctional form exhibits a quadratic shape This explains whysys is not selected by the linear logistic model When we refitthe logistic regression model by including sys2 in the modelthe stepwise selection picked out both sys and sys2

Figure 10 L1 Norm Scores for the BDES Main-Effects Model

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 11: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Zhang et al Likelihood Basis Pursuit 669

(a) (b)

(c) (d)

Figure 11 Estimated Univariate Logit Components for ImportantVariables (a) age (b) hgb (c) pky (d) sys

9 DISCUSSION

We have proposed the LBP approach for variable selection inhigh-dimensional nonparametric model building In the spirit ofLASSO LBP produces shrinkage functional estimates by im-posing the l1 penalty on the coefficients of the basis functionsUsing the proposed measure of importance for the functionalcomponents LBP effectively selects important variables andthe results are highly interpretable LBP can handle continuousvariables and categorical variables simultaneously Although inthis article our continuous variables have all been on subsets ofthe real line it is clear that other continuous domains are possi-ble We have used LBP in the context of the Bernoulli distribu-tion but it can be extended to other exponential distributions aswell (of course to Gaussian data) We expect that larger num-bers of variables than that considered here may be handled andwe expect that there will be many other scientific applicationsof the method We plan to provide freeware for public use

We believe that this method is a useful addition to the toolboxof the data analyst It provides a way to examine the possibleeffects of a large number of variables in a nonparametric man-ner complementary to standard parametric models in its abilityto find nonparametric terms that may be missed by paramet-ric methods It has an advantage over quadratically penalizedlikelihood methods when the aim is to examine a large numberof variables or terms simultaneously inasmuch as the l1 penal-ties result in sparse solutions In practice it can be an efficienttool for examining complex datasets to identify and prioritizevariables (and possibly interactions) for further study Basedonly on the variables or interactions identified by the LBP onecan build more traditional parametric or penalized likelihoodmodels for which confidence intervals and theoretical proper-ties are known

APPENDIX A PROOF OF LEMMA 1

For i = 1 n we have minusl(micro[minusi]λ (xi ) τ ) = minusmicro

[minusi]λ (xi )τ + b(τ)

Let fλ[minusi] be the minimizer of the objective function

1

n

sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f ) (A1)

Because

part(minusl(micro[minusi]λ (xi ) τ ))

partτ= minusmicro

[minusi]λ (xi ) + b(τ )

and

part2(minusl(micro[minusi]λ (xi ) τ ))

part2τ= b(τ ) gt 0

we see that minusl(micro[minusi]λ (xi ) τ ) achieves its unique minimum at τ that

satisfies b(τ ) = micro[minusi]λ (xi ) So τ = f

[minusi]λ (xi ) Then for any f we have

minusl(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)le minusl(micro

[minusi]λ (xi ) f (xi )

) (A2)

Define yminusi = (y1 yiminus1micro[minusi]λ (xi ) yi+1 yn)prime For any f

Iλ(fyminusi )

= 1

n

minusl

(micro

[minusi]λ (xi ) f (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f (xj )

)]+ Jλ(f )

ge 1

n

minusl

(micro

[minusi]λ (xi ) f

[minusi]λ (xi )

)+sum

j =i

[minusl(yj f

[minusi]λ (xj )

)]

+ Jλ

(f

[minusi]λ

)

The first inequality comes from (A2) The second inequality is dueto the fact that f

[minusi]λ (middot) is the minimizer of (A1) Thus we have that

hλ(imicro[minusi]λ (xi ) middot) = f

[minusi]λ (middot)

APPENDIX B DERIVATION OF ACV

Here we derive the ACV in (11) for the main-effects model thederivation for two- or higher-order interaction models is similar Writemicro(x) = E(Y |X = x) and σ 2(x) = var(Y |X = x) For the functionalestimate f we define fi = f (xi ) and microi = micro(xi ) for i = 1 nTo emphasize that an estimate is associated with the parameter λwe also use fλi = fλ(xi ) and microλi = microλ(xi ) Now define

OBS(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )]

This is the observed CKL which because yi and fλi are usually pos-itively correlated tends to underestimate the CKL To correct thiswe use the leave-one-out CV in (9)

CV(λ) = 1

n

nsum

i=1

[minusyifλi[minusi] + b(fλi )

]

= OBS(λ) + 1

n

nsum

i=1

yi

(fλi minus fλi

[minusi])

= OBS(λ) + 1

n

nsum

i=1

yi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

times yi minus microλi

1 minus (microλi minus micro[minusi]λi )(yi minus micro

[minusi]λi )

(B1)

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 12: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

670 Journal of the American Statistical Association September 2004

For exponential families we have microλi = b(fλi) and σ 2λi

= b(fλi)If we approximate the finite difference by the functional derivativethen we get

microλi minus micro[minusi]λi

yi minus micro[minusi]λi

= b(fλi) minus b(f[minusi]λi )

yi minus micro[minusi]λi

asymp b(fλi)fλi minus f

[minusi]λi

yi minus micro[minusi]λi

= σ 2λi

fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B2)

Denote

Gi equiv fλi minus f[minusi]λi

yi minus micro[minusi]λi

(B3)

then from (B2) and (B1) we get

CV(λ) asymp OBS(λ) + 1

n

nsum

i=1

Giyi (yi minus microλi)

1 minus Giσ2λi

(B4)

Now we proceed to approximate Gi Consider the main-effectsmodel with

f (x) = b0 +dsum

α=1

bαk1(x(α)

)+dsum

α=1

Nsum

j=1

cαj K1(x(α) x

(α)jlowast

)

Let m = Nd Define the vectors b = (b0 b1 bd )prime and c =(c11 cdN )prime = (c1 cm)prime For α = 1 d define the n times N

matrix Rα = (K1(x(α)i

x(α)jlowast )) Let R = [R1 Rd ] and

T =

1 k1(x(1)1 ) middot middot middot k1(x

(d)1 )

1 k1(x(1)n ) middot middot middot k1(x

(d)n )

Define f = (f1 fn)prime then f = Tb + Rc The objective functionin (10) can be expressed as

Iλ(b cy) = 1

n

nsum

i=1

[minusyi

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)

+ b

(d+1sum

α=1

Tiαbα +msum

j=1

Rij cj

)]

+ λd

d+1sum

α=2

|bα | + λs

msum

j=1

|cj | (B5)

With the observed response y we denote the minimizer of (B5)by (b c) When a small perturbation ε is imposed on the responsewe denote the minimizer of Iλ(b cy + ε) by (b c) Then we havefyλ = T b + Rc and fy+ε

λ = Tb + Rc The l1 penalty in (B5) tends toproduce sparse solutions that is many components of b and c are ex-actly 0 in the solution For simplicity of explanation we assume thatthe first s components of b and the first r components of c are nonzerothat is

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

The general case is similar but notationally more complicatedThe 0rsquos in the solutions are robust against a small perturbation in

data That is when the magnitude of perturbation ε is small enoughthe 0 elements will stay at 0 This can be seen by transforming theoptimization problem (B5) into a nonlinear programming problemwith a differentiable convex objective function and linear constraintsand considering the KarushndashKuhnndashTucker (KKT) conditions (see egMangasarian 1969) Thus it is reasonable to assume that the solutionof (B5) with the perturbed data y + ε has the form

b = (b1 bs︸ ︷︷ ︸=0

bs+1 bd+1︸ ︷︷ ︸=0

)prime

and

c = (c1 cr︸ ︷︷ ︸=0

cr+1 cm︸ ︷︷ ︸=0

)prime

For convenience we denote the first s components of b by blowast andthe first r components of c by clowast Correspondingly let Tlowast be the sub-matrix containing the first s columns of T and let Rlowast be the submatrixconsisting of the first r columns of R Then

fy+ελ minus fy

λ = T(b minus b) + R(c minus c) = [Tlowast Rlowast ][

blowast minus blowastclowast minus clowast

] (B6)

Now[

partIλ

partblowast partIλ

partclowast]prime

(bcy+ε)

= 0 and

(B7)[partIλ

partblowast partIλ

partclowast]prime

(bcy)

= 0

The first-order Taylor approximation of [ partIλpartblowast partIλ

partclowast ]prime(bcy+ε)

at (b cy)

gives[

partIλ

partblowast partIλ

partclowast]

(bcy+ε)

asymp[ partIλ

partblowastpartIλpartclowast

]

(bcy)

+

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

[blowast minus blowastclowast minus clowast

]

+[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

(y + ε minus y) (B8)

when the magnitude of ε is small Define

U equiv

part2Iλ

partblowast partblowastprimepart2Iλ

partblowast partclowastprimepart2Iλ

partclowast partblowastprimepart2Iλ

partclowast partclowastprime

(bcy)

and

V equiv minus[ part2Iλ

partblowast partyprimepart2Iλ

partclowast partyprime

]

(bcy)

From (B7) and (B8) we have

U[

blowast minus blowastclowast minus clowast

]asymp Vε

Then (B6) gives us fy+ελ minus fy

λ asymp Hε where

H equiv [ Tlowast Rlowast ] Uminus1V (B9)

Now consider a special form of perturbation ε0 = (0 micro[minusi]λi

minus yi

0)prime then fy+ε0λ minus fy

λ asymp ε0iHmiddoti where ε0i = micro[minusi]λi minus yi Lemma 1

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 13: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

Zhang et al Likelihood Basis Pursuit 671

in Section 3 shows that f[minusi]λ is the minimizer of I (fy + ε0) There-

fore Gi in (B3) is

Gi = fλi minus f[minusi]λi

yi minus micro[minusi]λi

= f[minusi]λi minus fλi

ε0iasymp hii

From (B4) an ACV score is then

ACV(λ) = 1

n

nsum

i=1

[minusyifλi + b(fλi )] + 1

n

nsum

i=1

hiiyi (yi minus microλi)

1 minus σ 2λihii

(B10)

APPENDIX C WISCONSIN EPIDEMIOLOGIC STUDYOF DIABETIC RETINOPATHY

bull Continuous covariatesX1 (dur) duration of diabetes at the time of baseline

examination yearsX2 (gly) glycosylated hemoglobin a measure of hy-

perglycemia X3 (bmi) body mass index kgm2

X4 (sys) systolic blood pressure mm HgX5 (ret) retinopathy levelX6 ( pulse) pulse rate count for 30 secondsX7 (ins) insulin dose kgdayX8 (sch) years of school completedX9 (iop) intraocular pressure mm Hg

bull Categorical covariatesZ1 (smk) smoking status (0 = no 1 = any)Z2 (sex) gender (0 = female 1 = male)Z3 (asp) use of at least one aspirin for (0 = no

1 = yes)at least three months while diabetic

Z4 ( famdb) family history of diabetes (0 = none1 = yes)

Z5 (mar) marital status (0 = no 1 = yesever)

APPENDIX D BEAVER DAM EYE STUDY

bull Continuous covariatesX1 ( pky) pack years smoked (packs per

day20)lowastyears smokedX2 (sch) highest year of schoolcollege completed

yearsX3 (inc) total household personal income

thousandsmonthX4 (bmi) body mass index kgm2

X5 (glu) glucose (serum) mgdLX6 (cal) calcium (serum) mgdLX7 (chl) cholesterol (serum) mgdLX8 (hgb) hemoglobin (blood) gdLX9 (sys) systolic blood pressure mm HgX10 (age) age at examination years

bull Categorical covariatesZ1 (cv) history of cardiovascular disease (0 = no 1 =

yes)Z2 (sex) gender (0 = female 1 = male)Z3 (hair) hair color (0 = blondred 1 = brownblack)Z4 (hist) history of heavy drinking (0 = never

1 = pastcurrently)Z5 (nout) winter leisure time (0 = indoors

1 = outdoors)Z6 (mar) marital status (0 = no 1 = yesever)Z7 (sum) part of day spent outdoors in summer

(0 =lt 14 day 1 =gt 14 day)Z8 (vtm) vitamin use (0 = no 1 = yes)

[Received July 2002 Revised February 2004]

REFERENCES

Bakin S (1999) ldquoAdaptive Regression and Model Selection in Data MiningProblemsrdquo unpublished doctoral thesis Australia National University

Chen S Donoho D and Saunders M (1998) ldquoAtomic Decomposition byBasis Pursuitrdquo SIAM Journal of the Scientific Computing 20 33ndash61

Craig B A Fryback D G Klein R and Klein B (1999) ldquoA BayesianApproach to Modeling the Natural History of a Chronic Condition From Ob-servations With Interventionrdquo Statistics in Medicine 18 1355ndash1371

Craven P and Wahba G (1979) ldquoSmoothing Noise Data With Spline Func-tions Estimating the Correct Degree of Smoothing by the Method of Gener-alized Cross-Validationrdquo Numerische Mathematik 31 377ndash403

Davison A C and Hinkley D V (1997) Bootstrap Methods and Their Ap-plication Cambridge UK Cambridge University Press

Fan J and Li R Z (2001) ldquoVariable Selection via Penalized LikelihoodrdquoJournal of the American Statistical Association 96 1348ndash1360

Ferris M C and Voelker M M (2000) ldquoSlice Models in General PurposeModeling Systemsrdquo Optimization Methods and Software 17 1009ndash1032

(2001) ldquoSlice Models in GAMSrdquo in Operations Research Proceed-ings eds P Chamoni R Leisten A Martin J Minnemann and H StadtlerNew York Springer-Verlag pp 239ndash246

Frank I E and Friedman J H (1993) ldquoA Statistical View of Some Chemo-metrics Regression Toolsrdquo Technometrics 35 109ndash148

Fu W J (1998) ldquoPenalized Regression The Bridge versus the LASSOrdquo Jour-nal of Computational and Graphical Statistics 7 397ndash416

Gao F Wahba G Klein R and Klein B (2001) ldquoSmoothing SplineANOVA for Multivariate Bernoulli Observations With Application to Oph-thalmology Datardquo Journal of the American Statistical Association 96127ndash160

Girard D (1998) ldquoAsymptotic Comparison of (Partial) Cross-Validation GCVand Randomized GCV in Nonparametric Regressionrdquo The Annals of Statis-tics 26 315ndash334

Gu C (2002) Smoothing Spline ANOVA Models New York Springer-VerlagGu C and Kim Y J (2002) ldquoPenalized Likelihood Regression General

Formulation and Efficient Approximationrdquo Canadian Journal of Statistics30 619ndash628

Gunn S R and Kandola J S (2002) ldquoStructural Modelling With SparseKernelsrdquo Machine Learning 48 115ndash136

Hastie T J and Tibshirani R J (1990) Generalized Additive ModelsNew York Chapman amp Hall

Hutchinson M (1989) ldquoA Stochastic Estimator for the Trace of the Influ-ence Matrix for Laplacian Smoothing Splinesrdquo Communication in StatisticsSimulation 18 1059ndash1076

Kim K (1995) ldquoA Bivariate Cumulative Probit Regression Model for OrderedCategorical Datardquo Statistics in Medicine 14 1341ndash1352

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebycheffian SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Klein R Klein B Lee K Cruickshanks K and Chappell R (2001)ldquoChanges in Visual Acuity in a Population Over a 10-Year Period BeaverDam Eye Studyrdquo Ophthalmology 108 1757ndash1766

Klein R Klein B Linton K and DeMets D L (1991) ldquoThe Beaver DamEye Study Visual Acuityrdquo Ophthalmology 98 1310ndash1315

Klein R Klein B Moss S and Cruickshanks K (1998) ldquoThe WisconsinEpidemiologic Study of Diabetic Retinopathy XVII The 14-Year Incidenceand Progression of Diabetic Retinopathy and Associated Risk Factors inType 1 Diabetesrdquo Ophthalmology 105 1801ndash1815

Klein R Klein B Moss S E Davis M D and DeMets D L (1984a)ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopathy II Prevalenceand Risk of Diabetes When Age at Diagnosis is Less Than 30 Yearsrdquo Archivesof Ophthalmology 102 520ndash526

(1984b) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy III Prevalence and Risk of Diabetes When Age at Diagnosis is 30 orMore Yearsrdquo Archives of Ophthalmology 102 527ndash532

(1989) ldquoThe Wisconsin Epidemiologic Study of Diabetic Retinopa-thy IX Four-Year Incidence and Progression of Diabetic Retinopathy WhenAge at Diagnosis Is Less Than 30 Yearsrdquo Archives of Ophthalmology 107237ndash243

Knight K and Fu W J (2000) ldquoAsymptotics for Lasso-Type EstimatorsrdquoThe Annals of Statistics 28 1356ndash1378

Lin X Wahba G Xiang D Gao F Klein R and Klein B (2000)ldquoSmoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACVrdquo The Annals of Statistics 281570ndash1600

Linhart H and Zucchini W (1986) Model Selection New York WileyMangasarian O (1969) Nonlinear Programming New York McGraw-HillMurtagh B A and Saunders M A (1983) ldquoMINOS 55 Userrsquos Guiderdquo

Technical Report SOL 83-20R Stanford University OR Dept

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics

Page 14: Variable Selection and Model Building viawahba/ftp1/zhang.lbp.pdf · Variable Selection and Model Building via ... and model build ing in the context of a smoothing spline ANOVA

672 Journal of the American Statistical Association September 2004

Ruppert D and Carroll R J (2000) ldquoSpatially-Adaptive Penalties for SplineFittingrdquo Australian and New Zealand Journal of Statistics 45 205ndash223

Tibshirani R J (1996) ldquoRegression Shrinkage and Selection via the LassordquoJournal of the Royal Statistical Society Ser B 58 267ndash288

Wahba G (1990) Spline Models for Observational Data Vol 59Philadelphia SIAM

Wahba G Wang Y Gu C Klein R and Klein B (1995) ldquoSmoothingSpline ANOVA for Exponential Families With Application to the WisconsinEpidemiological Study of Diabetic Retinopathyrdquo The Annals of Statistics 231865ndash1895

Wahba G and Wold S (1975) ldquoA Completely Automatic French CurverdquoCommunications in Statistics 4 1ndash17

Xiang D and Wahba G (1996) ldquoA Generalized Approximate Cross-Validation for Smoothing Splines With Non-Gaussian Datardquo StatisticaSinica 6 675ndash692

(1998) ldquoApproximate Smoothing Spline Methods for Large Data Setsin the Binary Caserdquo in Proceedings of the American Statistical AssociationJoint Statistical Meetings Biometrics Section pp 94ndash98

Yau P Kohn R and Wood S (2003) ldquoBayesian Variable Selectionand Model Averaging in High-Dimensional Multinomial NonparametricRegressionrdquo Journal of Computational and Graphical Statistics 12 23ndash54

Zhang H H (2002) ldquoNonparametric Variable Selection and Model Build-ing via Likelihood Basis Pursuitrdquo Technical Report 1066 University ofWisconsin-Madison Dept of Statistics