statistical shrinkage model and its applications · the aim of statistical analysis is to identify...
TRANSCRIPT
A STATISTICAL SHRINKAGE MODEL
AND ITS APPLICATIONS
Wenjiang J . Fu
A thesis submi t ted in conformity wit h the requirements
for the degree of Doctor of P hilosophy
Graduate Depart ment of Public HeaIt h Sciences
University of Toronto
@ Copyright by Wenjiang J. Fu, 1998
National Library Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographic Services services bibliographiques
395 Wellington Street 395, rue Wellington Ottawa ON K1A ON4 OttawaON KlAON4 Canada Canada
The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distritbute or seU reproduire, prêter' distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de rnicrofiche/film, de
reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la proprié,té du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
A Statistical Shrinkage Mode1 And Its Applications
Doctor of Philosophy 1998
Wenjiang .I. Fu
Department of Public Health Sciences
University of Toronto
Abst ract
Bridge regression, a special type of penalized regression of a penalty function 1 & I T with 2 1 is considered. The Bridge estimator is obtained by solving the penalized score
equations via the modified Newton-Rapbson method for Î > 1 or the Shooting met hod
for 7 = 1. The Bridge estimator yields small variance with a iittle sacrifice of bias. and
thus achieves small mean squared error and small prediction error wheo collinearity is
present amonp regressors in a linear regression model. The concept of penalization is gen-
eralized via the penalized score equations? which allow the implementat ion of penalization
regardless of the existence of joint likelihood functions. Penalization is then appliecl tu
generalized linear models and generalized est imating equat ions ( GEE) . The penalty pa-
rameter y and the tuning parameter X are selected via the generalized cross-validation
(GCV). A quasi-GCV is developed to select the parameters for the penalized G E L Sim-
ulat ion studies show t hat the Bridge estirnator performs weH compared to the estimators
of ridge regression (7 = 2) and the Lasso (y = 1). Çeveral data sets £rom public health
studies are analyzed using the Bridge penalty model in the statistical settings of a Linear
regression model, a logistic regression model and a GEE rnodel for binary outcornes.
To my parents,
my sister Shufen,
my wife Qi,
and my daughter Martina.
Acknowledgements
1 am in debt to my supervisor, Professor R. Tibshirani, for introducing this very
interesting research topic to me, and for his encouragement, support and supervision
during rny Ph.D. study.
1 am grateful to my cornmittee members, Professors P. Corey. .J. Hsieh and D. Tritchler.
my interna1 and external examiners Professors K. Knight and D. Hamilton. for their
valuable suggestions and critiques.
1 appreciate the valuable discussions with Professors R. Neal and J . Hsieh, which let1
to some very interesting points in my thesis.
I would like to thank Professor P. Corey for providing the environmental health data.
1 aiso would Like to thank my friend Rafal for his help on some progamming techniques.
1 am also grateful to rny parents-in-law, Shouyin and Peiyu, who took care of my
daughter many late nights and weekends while 1 was studying in McMurrich Building.
Contents
Abstract
1 Introduction I
. ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction - . . . . . . . . . . . . . . . . . . . . 1.2 Some Background of Shrinkage Models 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . .3 Problems 10
2 Bridge Regressions 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . '3.1 Introduction 13
. . . . . . . . . . . . . . . . . . . . . . 2.2 Structure of the Bridge Estimators 1 3
. . . . . . . . . . . . . . . 2.3 Algorithms for the Bridge and Lasso Estimators 16
- 1 9 . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Variance of the Bridge Estimator
. . . . . . . . . . . . . . . . . . . . . . 2.5 Illustration of the Shrinkage Effect 26
. . . . . . . . . . . . . . . . . . 2.6 Bridge Regression for Orthonormal Matrix 28
. . . . . . . . . . . . . . . . . . . . . . . 2.7 Bridge Penalty as Bayesian Prior 31
. . . . . . . . . . . . . . . . . 2.8 Relation between Tuning Parmeters X and t 36
3 Penalized Score Equations 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.I Introduction 40
. . . . . . . . . . . . . . . . . . .3. 2 GeneraIized Linear Models and Likelihood 40
. . . . . . . . . . . . . . . . . 3.3 Quasi-Li kelihood and Quasi-Score Functions -4.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Penalized Score Equat ions -48
. . . . . . . . . . . . . . . . . . . 3.5 Algorithms for Penaiized Score Equations 50
4 Penalized GEE 52
4 . L Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3
. . . . . . . . . . . . . . . . . . . . . . . 4.2 Generalized Estimating Equations 54
- - 4 . .3 Penalized GEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TI
5 Selection of Shrinkage Parameters 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Introduction 60
. . . . . . . . . . . . . . 5.2 Cross-Validat ion and Generalized Cross-Validat ion 60
. . . . . . . . . . . . . . . . 5.Q Selectioo of Parameters h and 7 via the GCV 61
. . . . . . . . . . . . . . . . . . . . . . . . 5.4 Quasi-GCV for Penalized GEE 64
6 Simulation Studies 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 A Linear Regression Mode1 71
. . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 A Logistic Regression Model 74
. . . . . . . . . . . . . . . . . 6.3 A Generalized Estimating Equations Mode1 78
. . . . . . . . . . . . . . . . . . . 6.4 A Complicated Linear Regression Mode1 Y1
7 Applications: Analyses of Health Data 92
7.1 Analysis of Prostate Cancer Data . . . . . . . . . . . . . . . . . . . . . . . 93
7.2 Analysis of Kypliosis Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
C . ( . 3 Analysis of Environmental Health Data . . . . . . . . . . . . . . . . . . . . 100
8 Discussions and Future Studies 108
8.1 Discussion . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - 109
8.2 Future Studies, . . - . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . 114
References 116
A A FORTRAN Subroutine of the Shooting Method for the Lasso 119
8 Mathematical Proof 125
vii
List of Figures
. . . . . . . . . . . . . . . . . . . . 1.1 Constrained areas of Bridge regressions 9
2.1 Solution of equation (2.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Algorithms for the Bridge estimators . . . . . . . . . . . . . . . . . . . . . 20
2 . :3 Shrinkage effect of Bridge regressions for fixed A > 0 . . . . . . . . . . . . . 30
. . . . . . . . . . . . . . . . 2.4 Bridge penalty as a Bayesian pnor with A = 1 39
. . . . . . . . . . . . . . . 2.5 Bridge penalty as a Bayesian prior with A = 0.5 34
. . . . . . . . . . . . . . . 2.6 Bridge penalty as a Bayesian prior with X = 10 3.5
. . . . . . . . . . . . . . . 2.7 Relation between shriokage parameters X and t 38
1 Selection of parameters A and y via GCV . . . . . . . . . . . . . . . . . . . 65
5.2 Selection of parameters A a n d y via quasi-GCV . . . . . . . . . . . . . . . 65
6.1 Simulation with true B generated from the Bridge prior with = 1 . . . . . 87
6.2 Simulatim with true B generated from the Bridge pnor with 7 = 1.3. . . . YY
6.3 Simulation with true P generated from the Bridge prior with y = 2 . . . . . Y9
6.4 Simulation with true generated from the Bridge prior with y = 3 . . . . . 90
6.5 Simulation with true P generated h m the Bridge prior with y = 4 . . . . . 91
. * . V l l l
7.1 Selectioo of parameters X and 7 for the prostate cancer data. . . . . . . . . 95
7.2 Selection of parameters A and 7 for the kyphosis data. . . . . . . . . . . . 98
7.:3 Cornparison of prediction errors on test data by box plots . . . . . . . . . . 104
7.4 Selectioo of parameters X and y for the environmental healt h data . . . . . 1 OFj
List of Tables
3.L Bridge estirnators and standard errors for orthonormal X . . . . . . . . . . 126
. . . . . . . 2.2 Bridge estimators and standard errors for non-orthonormal .Y 27
6.1 Mode1 comparison for a iinear regression mode1 . . . . . . . . . . . . . . . 73
6.2 Mode1 comparisoo for a logistic regressioo mode1 . . . . . . . . . . . . . . . 76
6.3 Mode1 cornparison for a GEE mode1 . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Means and SE'S of MSE, and PSE. for different 7 . . . . . . . . . . . . . . 84
7.1 Estimates of the prostate cancer data . . . . . . . . . . . . . . . . . . . . . 95
7 . 2 Cornparison in mode1 selectioo . . . . . . . . . . . . . . . . . . . . . . . . . 96
. . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Estimates of the kyphosis data 99
7.4 Cornparison of prediction errors on test data over 100 random splits . . . . 103
. v
. a Estimates of the environmental health data . . . . . . . . . . . . . . . . . . 106
Chapter 1
Introduction
1.1 Introduction
In many applied science or public heaith studies. the investigators are interested in re-
lations between response variables and explanatory variables. For example. in a breast
cancer study. it is of interest to know whether the probability of deveioping cancer in a
population depends on some potential risk factors. such as a patient's diet. age. height
and weight. S tatistical analysis provides a scientific tool to investigate such relationships
using data obtained from previous or current studies. The aim of statistical analysis is to
identify the risk factors that contribute significantly to the presence or the occurrence of
the event which is under investigation. Very often, the analysis is conducted through a
statistical procedure called regression, which is based on probability t heory and statisticai
modelling. Regression analysis provides information on the si,gnîficance of the cont ribu-
tion of these risk factors to the event. and thus helps the investigators to make scientific
decisions.
Ln some studies. certain explanatory variables present a linear relation' i.r. some vari-
ables depend linearly on some others. Such a phenornenon is called collinearity. Sincr the
presence of collinearity arnong explanatory variables induces large variation and uncer-
tainty in the regression models, the estimates of the mode1 parameters have large variance.
and prediction based on the models may perform very poorly. Therefore the models may
not serve the needs of the investigators.
In this thesis, 1 investigate this coilinearity problem and propose a metliod using a sta-
tistical technique: Bridge penalization. I also demonstrate through statistical simulations
t hat this method works well in terms of estimation and prediction. Finally: I apply this
method to several data sets from public health studies to achieve p o d statistical results.
1.2 Some Background of Shrinkage Models
Consider a linear regression problem
where y is an n-vector of random responses. .Y is an n x p design matris- ,û is a p-vector
of regression parameters and E is an n-vector of independently identically distributed (iid)
random errors. Ordinary least-squares regression (OLS). or least-squares regression as it
is known in the literature? minimizes the residual sum of squares
and yields an unbiased estimator f i o I , = (X~?C) - ' .YTy. if rnatrix .Y is of full rank. witli
expected value
and variance
where c2 is the commoo variance of each individual random error term ~ i . Sioce both
CL
the estimator Pol, and its variance are of simple forrn, computation of these quaotities
is very simple and can easily be done eveo without help of cornputers if the number of
regressors is srnall. In addition, the variance of Po,, is minimal among ail Linear unbiased
estimatorst i.e. for any linear unbiased estimator B. where 18 = -4y and EP = P. one has
A
Hence. Po,. is usually referred to as the Best Linear Cinbiased Estimator ( BLLE) i~niler
the Gauss-Markov condi tioos. see Sen and Srivastava [lggO].
- However, despite its simplicity, unbiasedness and minimum variance. ,Bol, is not always
satisfactory due to the following reasons.
1). The estimator is not unique if the regression matrix X is less than full rank. In fact.
there are infinitely many estimates which attain the minimum residual sum of squares:
2 ) The variance var(pOl,) = ( X T X ) - l u 2 becomes large if the regression rnatrix .Y is
close to collinear. Hence the mean squared error (MSE) is large since
and
For example. consider a simple linrar regression problem of two regressors.
where E is of normal distribution i V ( 0 , r 2 ) . To illustrate the effect of collinearity between
the regressors, we standardize the regression vectors xi and 2 2 by setting B j = O and
I l ~ ~ l l = 1 for j = 1,2, and set o' = 1 for simplicity. Then the sarnple correlation coefficient
T r = X I 2 2 , and
The variance-covariance rnatrix of the OLS estirnator Bol, = ( A A)T is thus
and ~ a r ( 8 ~ ) = 1/(1 - r') for i = 1.2. If the regressors zl and r2 ore uncorrelated. i-e.
r = O , then v a r ( B j ) = 1 , for i = 1.2. However, if xi and 1 2 are correlated. then ~ a ~ - ( , ! j ~ )
can be very large as shown below, for example, V a r ( p j ) = 10.26 for r = 0.95.
Increase of variance wit h correlat ion coefficient
Since mean squared error retlects the overall accuracy of estimation. and large !VISE ..
means poor estimation. predictions based on Pol, may perform very poorly if collinearity is
present in X. For example, consider the prediction squared error (PSE) of a two regressor
case. The expectation of the predicted error at an arbitrary point (zmT. y') by t h e OLS
estimator f i is
where e' is the random error a t the prediction point, and a' is the variance of the random
error. Shen the PSE depends on the location of the vector 2- in the feature space.
Take a special case with high collinearity x T X = diag(l,0.00 1): then E( PSE) = cr2[L + 2y2 + 1000x;~]. If 1x;I « max{ l,lr;l}. then the preciiction error is moderate. Ot herwise.
it is inflated largely by t h e factor of xg due to high collinearity. Detailed discussions
about multi-collinearity can be found in Seber (1977). Sen and Srivastava ( 1990). Hocking
( 1996): Lawsoo and Hanson (1974), Hoerl and Kennard ( L9'iOa. L950b) and Frank and
Friedman ( 1993).
To achieve better prediction, Hoerl and Kennard ( 1970a. 19'ïOb) introcLuced ridge
regression
While ridge regression may shrink the OLS estirnator bol, towards O and yields a biaseci
where X = A ( t ) a function of t , the variance is smailer than that of Pol. for A > 0-
var(&) = (XTX + A I ) - ' X ~ X ( X ~ X + hl)-'a2
< ~ a r ( P , d
Therefore? better estimation can be achieved on the average in terms of MSE with a little
sacrifice of bias. This is welI known as the bias-variance trade-off.
To illustrate the shrinkage effect of ridge regression. consider the linear regression
problem of two regressors in the above example. The variance of the ridge estimator is
the bias is
The variance: bias2 and MSE of the ridge estimator *
biai' and MSE are computed with true B = ( 1, L ) ~ .
r
II
If 11 and r2 are uncorrelated. i.e. r = 0: then I/nr(,djrd,) = l / ( 1 + A)" 00.5 for
A
X = 1: smaller than G'U~(B,,~,) = 1 for X = O. If rl and x2 are correlated. for example. A
r = 0.9. then V U T ( , ~ ~ ~ & ) = 0.15 for X = 1. much smaller than V n ~ ( / 3 ; , ~ , ) 1.- 3.26 for A = 0.
However. the squored bias increases with A as shown in the above table. The squared 1
bias is computed with bias( ,dj) = -Xb/ ( l + X + r ) for a special case of & = ,& = d = 1.
I t can be observed from the above table that the variance of the ridge rstimator
decreases with A, while the squared bias increases. The MSE presents a trade-off ùetween
the bias and the variance: it decreases to a smaU value from X = O to X = L, and increases
from X = 1 to X = 5 or 10.
var bias2 h[SE var bias2 MSE var bias2 MSE var &as2 SISE
Frank and Friedman ( 1993) introduced Bridge regession
It includes ridge regression with -1 = 2 and subset selection with 7 = O as special cases.
For other values of 7 > O, it constrains the estimates to different regions around the origin
in the parameter space as shown in Figure 1.1 for t = 1. While Frank and Friedman dicl
not solve for the estimator of Bridge regression for any @en y > O, they pointed out that
to optimize the y value was desirable.
Tibshirani ( 1996) introduced the Least Xbsolute S hrinkage and Selection Operator
(Lasso)
as a special case of the Bridge with 7 = 1. The special shape of the constrained region
of the Lasso, as pointed out in Tibshirani (1996): allows the Lasso estimator to attain at
a corner of the region by 1 Igj 1 5 t and thus rnakeç j j = O for some j . Therefore. the A
Lasso may shrink the OLS estimator Pois towards O and potentially rnay set some d j = O.
It can be seen clearly from the following formula of the Lasso estimator for orthonormal
where C( t ) is a positive constant depending on 1 but independent of j . Clearly for
orthonormal X: the Lasso shrinks large coordinates of the OLS estimator by a constant
and small to O. Thus it performs as a variable selection operator.
Tibshirani (1996) used a combined quadratic programming method to solve for the
gamma > 2 gamma = 2
gamma = 1 gamma < 1
Figure 1.1: Constrained areas of Bridge regressions in two dimensional parameter space
Lasso estirnator by observing that the Lasso constraint C IPj 1 5 t is equivalent to com-
bining L P linear constraints C wjPj 5 t with wj = f 1. The parameter 1 was optimized
via generalized cross-validation ( GCV). It was also s hown t hrough some intensive simula-
tion stiidies that the Lasso not only shrinks the OLS estimator towards O and potentialiy
achieves bet ter estimation and predict ion, but also selects variables in a cont inuous w a .
Sucb a variable selection process of the Lasso is more stable t han the cliscrete process of
adding one variable to or delet ing one from the model.
Since the Bridge shrinks the OLS estimator towards O. it is referred to in general as a
shrinkage model, and the parameter t as the shrinkage parameter.
1.3 Problems
Although both ridge regression and the Lasso perform much better than OLS regession
when collinearity is present in X as shown in last section. in Frank and Friedman ( 1993)
and Tibshirani ( 1996), the fact that the Lasso outperforms the ridge in soine cases ancl
the ridge outperforms the Lasso in sorne others (Tibshirani 1996) raises the questions:
What is the optimal value of y that performs the best? How to select the optimal value
of y? How to solve for the Bridge estimator for any fixed > O in general?
To answer these questions, we need to cievelop some techniques that will allow the
optimal value of 7 to be selected based on the data itseif rather than based on some
subjective decision, for example, selecting -y = 1 (the Lasso) or y = 2 (the ridge).
In t his thesis, 1 attempt to answer t hese questions by considering Btidge regression as
a whole family, which iocludes the ridge and the Lasso as special members. Specificaily.
I study
min (y - X P ) ~ ( ~ - XP), subject to C I/3jlT 5 tt with 7 > 1. P
In Chapter 2' I study the structure of the Bridge estimators and develop algorithms to
solve for the Bridge estimator for any fixed 7 2 1. Particularly. 1 develop a new algo-
rithm for the Lasso to make the computation much simpler and easier. The variance of the
Bridge estimator is derived. The shrinkage efFect of Bridge regression is illustrated through
a simple example of linear regression, and is examined theoretically for the ort honormal
regression matrix case. The Bridge penalty function is also studied as a Bayesian prior.
In Chapter 3 : I review generalized linear rnodels. likelihood function and quasi-likeliiiood.
I extend Bridge regression to generalized linear models. I furt lier generalize penalizat ion
to be independent of joint likelihood functions by introducing the penalized score equa-
t ions. Algorit hms solving the penalized score equat ions are also developed. In Chap ter
4, 1 review generalized estimat ing equat ions (GEE) in longitudinal st udies, and apply
penalization to the GEE via the penalized score equations. In Chapter 5 . 1 review the
cross-validation and generalized cross-validation (GCV) methods. The shrinkage parame-
ter y and the tuaing parameter X are selected via the GCV for generalized linear models.
A quasi-GCV is derived to select y and X for the penalized GEE. Ln Chapter 6, I compare
the Bridge mode1 with some other shrinkage models: no shrinkage, the Lasso and the
ndge through simulation studies. In Chapter 7, I analyze several data sets from public
health studies using the Bridge penalty model. Chapter Y gives general discussions and
plans of some future studies. Appendix A gives a FORTRAN subroutine to compute the
Lasso estimator via the Shooting method and Appendix B gives the mathematical proofs.
Chapter 2
Bridge Regressions
2.1 Introduction
In Chapter 1, I briefly introduced regressions and shrinkage models. particularlp Bridge
regessions. Although Bridge regression was proposed. its estimators have not been sti~d-
ied yet. As Frank and Friedman (1993) pointed out, it is desirable to study Iiow to select
the optimal value of ni to achieve the best results.
In this chapter, I stildy Bridge regression and its estimators. 1 propose an algorit hm.
the modified Newton-Raphson method, to solve for the Bridge estimator for any fixed
y > 1. I also propose a new algorithrn, the Shooting method, to solve for the Lasso
estimator. The variance of the Bridge estimator is obtained via the delta tnethod. Tlie
shrinkage effect is demonstrated through a simple example and is examineci theoretically
for the orthonormal regression rnatrix case. The Bridge penalty function is also studied
as a Bayesian prior.
2.2 Structure of the Bridge Estimators
To solve Bridge regression for any given y 2 1: we consider the following two problems.
Problems ( P l ) and (Pz) are equivalent, ive. for given X 2 O, there exists a t 2 0.
such thot the two problems share the same solution, and vice versa. We refer (P2) as a
penalized regression with penalty C 113iI7, and X the tuning parameter.
Consider problem (P2). Let G(B, X, y, A,?) = RSS + AC IPjI7- G' -+ +m the
Euclidean norm J(PII -t +W. Thus. the function G' can be minimized. Le.. t here exists a
6 such that
a = arg min G(P, X, y, XJ). P
Since t h e function 1/3,1 is not differentiable at gj = O. one can only take partial deriva-
tives of G' with respect to Pj at Pj # O, j = 1,. - , p. Deaote
a n d
Setting = O leads to
Problem (P2) can then be solved tlirough ( P 3 ) as showo in next section.
To illustrate how to solve problem (P3), we consider a simple example of linear re-
The residual surn of squares R S S = Ci(yi - /91+il - , & x ~ ~ ) ~ . Taking partial derivative of
function G with respect to pj leads to the equations as in (PJ),
If regressors XI and 12 are uncorrelated, i.e. xizilxiz = 0- each individual equation can
be solved indepenciently. if xl and 2 2 are correlated, i.e. Ci xilxin # O. the equations can
be solved iteratively as shown in next section.
To develop an algorit hm for solving (P3) in general. one needs to know t lie structure of
the solutions. W e study the structure t hrough (P9) in this section and provide algorit hms
in next section. We have the following theorems on (P3) for more general function S j -
Let ,û be a vector in the pdimensional parameter space B, X an n x p matris. and y
a vector in an n-dimensional sample space Rn. For fixed X, y? X 2 0. -/ 2 1. define the
following real functions
S j ( - , X ? y ) : B 4 R,p H Sj (P7 X<Y), j = 1:. . . o p *
F ( - . -Y, y): B -t R? ,f3 H F(& X. y) uon-negative, and
d(Pj7 X 1 7) = X7(/3jl'-1.~ign(/9j).
Denote S = (Si,. . . . s,)*. and defuie problem
Given 7 2 1 and X I O . min (~(~..~.~)+.\Cl[3,jl*'). B
We have the following results for problem (P3).
Theorem 1. Given y > 1: X > O. if the function S defined above is continuously
differentiable wit h respect to ,û and the Jaco bian matrix ( 3S ) is posit ive-semi-definite'
t hen
1. (P3) has a unique solution &A, A,), which is cootinuous in ( A 1 7);
2. The Limit of the unique solution @(A, y) exists as y + 1+. Denote the limit solution
by @(A: L+), then
Theorem 2. Given -1 > 1, X > O. If there exists a non-negative convex function F ( P ) ,
i3F with ap = S7 and the Jacobian mattix is positive definite. theo Ûir
1. The unique solution of (P3) is equal to the unique solution of (PY):
2. The limit of the unique solution of (P3). ,8 (X7 1+), is equai to the unique solution of
(PT) with 1/ = 1:
3 Particularly. if there exists a joint likelihood function L(B) . ancl F = -2Zog( L ( P ) ) ? the
unique solution of (P3) is equal to the Bridge estimator of (PI'), the limit of the unique
solution of (P.3) is equal to the Lasso estimator of ( P Z 1 ) . For Gaussian clistribution. the
solution of (P3) is equal to the Bridge estirnator of (P2), the limit of the solution of ( P 3 )
is equal to the Lasso estimator of (P2).
While the existence and uniqueness of the estimator of (Pz') is guaranteetl by the
convexity of functioii F ( P ) . which can be inferred frorn the .lacobian condition on S.
Theorems 1 and 2 provide theoretical support to (P3), which yieids a rather general
approach to this penalization problem as we shâll see in Chapter 4. Particiilarly for
Gaussian distribution, F ( P ) = RSS(P) , problem (PT) simplifies to (Pz) , its imicpe
solution can be solved through (P3) as shown in next section.
2.3 Algorithms for the Bridge and Lasso Estimators
To solve Bridge regression for any given y 2 1 and X > O: we start with problem (P3).
Although we only demonstrate our method below for Gaussian response variables, our
algorithm applies to many other types of responses via the iteratively reweighted least-
squares (IRLS) procedure.
Denote p by (Pj, p-jT)*, where B-' is a p - 1 vector consisting of the ,/A's other tlian
gj -
We study the j-th equation of (P3):
The left band side function of (2.1): LHS = k ; . c j p j + +- +J . 2xTxi& I - 2xTY. isl for fixed
P-', a linear function of Dj with positive slope ~ Z ? . C ~ . The right hand side function of
(2.1). RHS = -X71Pj17-'.sign(,8j), is nonlinear in 13j- The function RHS is of different
shape for different value of 7 as shown in Figure 2.1. I t is continuous. differentiable and
monotonically decreasing for y > 1 except non-differentiable at $8, = O for i < 7 < 2. a
heavy-side function with a jump of height 2X at 9j = O for 7 = 1. Therefore. equation
(3.1 j has a unique solution for 7 > 1, a unique solution or no solution for 7 = 1.
To compute the Bridge estimator for -/ > 1. the Xewton-Raphson method can be used.
However. since function d is not differentiable at ej = O for 7 < 2. modification is needed
to achieve the convergence t o t he solution. We develop t h e following niodified Newton-
Raphson method for y > 1 in general by solving iteratively for the unique solution of t h e
j-t h equation of (P3).
Modified Newton-Raphson (M-N-R) Algorithm for the Bridge y > 1.
(1). Start with ,b, = fi,, = ( ,Ji , . . .,,8JT.
(2). At step rn, for each j = 1,. . . ? p t let So = s ~ ( o , B - ~ , X,y). Set j, = O if .Yo = O.
Otherwise, if y 2 2, apply Newton-Raphson method to solve for the unique solution jj of
equation (2.1); if 7 < 2, modify function -d by changing one part to its tangent line at a
certain point between the origin and the solution (intersection of -d and Sj) as shown in
gamma > 2
oem
gamma = 1
ben
Figure 2.1: The functions in
gamma = 2
OOOI
gamma = 1
1 c gamma c 2 ,
bota
gamma = 1
equation (2.1). Solid is function -d. dashed is Si. The vertical a i s in each bottom panel has a scale of A.
Figure 2.2 (upper left figure). Such a point can easily be found using bisectioo method.
Then the Newton-Raphson method is applied to equation (2.1) with the modifiecl function
-d to solve for the unique solution f i j . Form a new estimator Dm = (Bi - . . . &,)T after
updating al1 b,. A
( 3 ) . Repeat (2) until p, converges.
Remarks
1. To initialize Bo, the OLS estimator Bois is always available. Even wheo p > n. X is A
less than full rank, any general estimate can be used for the initialization of Po.
2. From the rnodified Newton-Raphson algorithm, one can see that if the Bridge estimator - j
sntisfies bj = O for some j, then Pbrg must satisfy ~ ~ ( 0 . ~ 2 , X, y) = O. This irnp[ies that
t h e (p - 1)-dimensional vector ,B;: lies in a (p - 2)-dimensional manifold, which has zero 4
measure. Therefore. one can conclitcle that /!, is almost surely non-zero.
To compute the Lasso estimator for any given X > 0, one can use Theoreni 1: whicli
irnplies that the limit of the Bridge estimator, lirr+i+ f l (& y). is equal to the Lasso
estimator. However, taking the limit nurnerically is not recomrnended in practice for the
followiug reasons. From the computat ional point of view, it is obviously time-consuming
since the M-N-R algorithrn has to be run man i tirnes, one for each single 7; > 1 in a
series of { r i } with yi + 1+. From the theoretical point of view. it is misleadiug. Assume
a sequence of tends to 1+, and the correspondhg estimate sequence {pi} of one
coordinate is (0.1, . . . ,10-~, . . .). Numerically one can not determine whet her the limit
of bi is equal to O. However, taking the limit theoretically leads to a new algorithrn for
the Lasso, which is simple, straightforward and fast as shown below.
M-N-R S hooting
-1 O -5 O 5 10 beta
Shooting
-1 O -5 O 5 1 O
beta
-1 O -5 O 5 10 beta
Shooting
-1 O -5 O 5 10 beta
Figure 2.2: The algorithms. Solid is function -d, dashed is S j - The vertical axis in each bottom panel and the upper right panel has a scale of A. Upper left: the dotted Lne represents the modification of -d to its tangent; Upper right: So > A. the dotted line indicates the solution; Lower left: [Sol 5 A; Lower right: So < -A? the dotted line indicates the solution.
We introduce a new algorithm for the Lasso - the Shooting method.
( 1 ). p = 1. (P3) reduces to a single equation
Start with an initial estirnate ,&, the OLS estirnator. Shoot in the direction of a slope
2xTx from the point (boy 0) on the horizontal axis as shown in Figure 2.2. If a point on
the ceiling ( -d = A ) i s hit as showo in the upper right figure or if a point on t h e floor
( -d = -A) is hit as shown in the lower right figure, then equation (2.2) has a iiniclue
solution, which has a simple closed form and is equal to the Lasso estimator. If no point
is hit, i.e. shootiog through the window as shown in the lower left figure. equation (2.2)
has no solution. One can take the limit of the Bridge estimator theoretically. It is eaçy
to prove that l i q d 1 + .(A, y) = O. Therefore, set @ = O for the Lasso estimator.
(2). p > 1. Start with an initial value Bo, the OLS estirnator. At step m. cornpute fi,,, by updating âj for fixed b-j using (1): j = L?.. . : p . Iterate until fi, converges. 54it3
summarize the method as follows.
Shooting Algorithm for the Lasso
(1). Start with Bo = fi,, = (A , . . . ,&)T- - -1
(2). At step rn, for each j = L;.. . - p , let So = Sj(O,P J. y) and set
where x j is the j- th column vector of X. Set B, = (,Bi;. . . , &)T after updatiog all j j -
(3). Repeat (2) until fi, converges.
The convergence of the M-N- R algorit hm and the Shoot ing algorit hm is paranteen
by the following t heorem.
Theorem 3. (Convergence of the Algorit hms)
Given fixed A > O and -, 2 L. Pm in t h e rnodified Yewton-Raphson algorithm (Ad-N-R)
converges to the Bridge estirnator of (P?). a, in the Shooting algorithni converges to the
Lasso estimator of (P2).
Our experience tells us that both the N-N-R and the Shooting algorithms converge
very fast, as it can be perceived through the mechanism of the convergence in the math-
ematical proof.
2.4 Variance of the Bridge Estimator
Since the Bridge estimator (-y > 1) is t h e unique solution of problem ( P 3 ) and is alniost
surely non-zero! its variance
can be derived as follows from (P3) using the delta methodl
where y0 is any arbit rary fixed point in the sample space. The variance est imate can be
obtained by plugging in B for Biy, and replocing Var(y) with its estimate.
Denote F = ( F I , . . . , Fp)T, where Fj = sj(b, X7 y) + d(bjôj, A, 7). Heuce F, = O by
(PS). For Gaussian distribution, @ = -2xT and aF = 2 X T x + 20(p), where ay a B
-1
By hnpiicit Functioo Theorem 4 = - (3) g. Therefore. ' aY
Two special cases are worthwhile to mention.
I. The OLS regression, i.e. X = O. The function ~ ( b ) becomes a zero matrix; thiis
wbicb is equal to ~nr(f i , , , ) , the variance of the OLS estimator.
2. The ridge regression, i.e. -y = 2. The function D(P) = XI, where I is the identity
matrix.
var(& = ( X ~ X + AI)-' x T v a r ( y ) x ( X ~ X + XI)- '
which is equal to the variance of the ridge estirnator. *
Since the Lasso may set some B, = 0, the delta method does not appiy. However,
the bootstrap or the jackknife method (Shao and Tu 1995) can be used to compute the
variance. A good variance estirnator of the non-zero bj of the Lasso estimator can be
found in Tibshirani ( 1996).
2.5 Illustration of the Shrinkage Effect
Sections 2.2 and 2.3 give the estimator and the algorithms of Bridge regression. Section
2.4 gives the variance of the Bridge estimator. In this section, we demonstrate how to
solve for the Bridge (Lasso) estimator, and illustrate the shrinkage effect of the Bridge
regression through simple examples.
An example with orthonormal matrix X .
We consider a simple linear regression
with 40 observations, where the random error term E bas a normal distribution N(0. a').
To make the regression matrix X orthonormal. we standardize the colurnn vector X, of
X by setting Ci xij = 0: j = 1.. . . . p 7 and
For simplicity, we set ,do = O and O' = 1. Forty observations of response Y are generated
Erom the mode1 with triie values Bi = 1: & = -2 and 23 = 5 . Since shrinkage has no
impact on the intercept. the intercept is removed by centering xi yi = O. For fised pair
of X > O and y > t, each individual equation of (P3)
reduces to
for j = 1,. . . . p. Then the solution is computed via the modified Newton-Raphson met hod
for y > 1 or the Shooting method for y = 1. The standard errors are coniputed following
the variance formula (2.3) in section 2.4 for > 1. The bootstrappiug nietliod (Efron
and Tibshirani 1993) is used to compute the standard errors for y = 1.
The estimates and standard errors for different shrinkage functions are as shown in
Table 2.1. The standard errors of the Lasso estimator y = 1 are computed via 10000
bootstrap samples. It is shown that the parameter estimate and its standard error are
shrunk monotonically with increasing X for fixed -, in general. However for the Lasso
(y = 1). the standard error of ,& does not show a rnonotooic decreasing trend witli A. It
equals to 0.163 at X = 0, 0.157 at X = I O 1 but 0.3.54 at X = 100. This is because the Lasso
standard errors for X > O are computed with a semi-parametric bootstrap metliod.
An example with non-orthogonal matrix X.
W-e consider a similar linear regression
with 40 observations. where the random error term E has a normal distribution N(0 . u').
The regession matrix .Y is not orthonormal and has the correlation coefficient matrix
We standardize the column vector xj of X by setting Ci t i j = O and Ci r:, = 1. For
simplicity, we set ,Bo = O and a' = 1. Forty observations of response Y are generated from
the mode1 with true values a = 2, a = 3 and a = -1. Since shrinkage has no impact
Table 2.1: The Bridge estimators and standard errors for orthonormal X
Table 2.2: The Bridge estimators and standard errors for non-orthogonal X
@I = 2 1 i.94ï(O.Zi) 1 ( O . ) 1 0.39L(0.07?) / O.i65(0.040) 1 0.718(0.022)
on the intercept, the intercept is removed by centering Ci y; = O. For each pair of X > O
and -y 2 1; each individual equation of (P3) is
a
for j = 1.. . . . p. Then the solution is cornputed iteratively by the modified Yewton-
Raphsoo method for 7 > 1 or the Shooting method for 7 = 1. The standard errors are
computed following the variance formula (2.3) in section 2.4 for y > 1. The bootstrappiug
method (Efron and Tibshirani 1993) is used to compute the standard errors for 1 = 1.
The estirnates and standard errors for different shrinkage functions are as shown iu
Table 2.2. The standard errors of the Lasso estiniator y = 1 are computed via 10000
bootstrap samples. It can be observed that the monotonicity of t h e parameter estitiiatr
and its standard error does not hold for this case in general. which can be: shown by the
estimate j3 and its standard error.
2.6 Bridge Regression for Orthonormal Matrix
In last section, an example of Bridge regression for orthonormal regression rnatrix .Y
was given to illustrate the shrinkage effect. In this section. we study Bridge regressiou
for ort honormal regression matrix theoreticaily and show different slriukage effect for
different value of y.
For orthonormal matrix X = (xij),
It can be seen that problem (P3) simplifies to p independent eqiiations
for j = 1.. . . . p . The solution is then computed via the modified Newton-Raphson rnetlioci
for 7 > 1 or via the Shooting method for 7 = 1. To study the shrinkage effect of different
value of 7, we compare the Bridge estimator - the solution of each single equation of (2.4).
with the OLS estimator. Without making any confusion, we omit the subscnpt j of !jj
and xij for simplicity.
Notice that equation (2.4) can be written as
The first term on the right hand side is equal to the OLS estimator, the second term is
due to the shrinkage and thus reflects the shrinkage effect. Therefore
To show the shrinkage effect of Bridge regression, we plot the absolute value of the
Bridge estimator a,,, and compare it with the OLS estirnator? whose absolute value is
plotted on the diagonal as shown in Figure 2.3. It shows clearly that the Lasso (7 = 1)
shrinks small OLS estimates to zero? and large ones by a constant; ndge regression (y = 2)
shrinks the OLS estimates proportionally; Bridge regression (1 < y < 2) shrinks small
OLS estimates by a large rate and large by a small rate; Bridge regression (y > 2) shrinks
srnail OLS estimates by a small rate and large by a large rate. Ln surnmary, Bridge
regression of large value of -y (y 2 2) tends to retain small parameters while srnall value
of y (7 < 2) tends to skrink small parameters to zero.
gamma = 1
gamma = 2
1 < gamma < 2
beta-01s
gamma > 2
Figure 2.3: Shrinkage effect of Bridge regressions for fixed A > 0.
Solid - the Bridge estimator; Dashed - the OLS estimator.
Therefore, it can be implied that if the true model includes many small but non-zero
regression parameters, the Lasso will perform poorly while the Bridge with large -7 value
will perform well. If the true model includes many zero parameters. the Lasso will perform
well while the Bridge with large y value will perform poorly. Tibshiraoi (1996) obtained
similar results by comparing the Lasso with the ridge through intensive simulation studies.
2.7 Bridge Penalty as Bayesian Prior
In this section, we study the Bridge penalty function C lPjlY as a Bayesian prior distri-
bution of the parameter ,û = (8,. . . .A)=.
From the Bayesian point of view, Bridge regression
min [RSS + x C 1,dj3j17] P
can be regarded as maximizing the log posterior distribution of
where C is a constant. Thus the Bridge penalty X C IDj 1' cân be regarded as the logarit hm
of the prior distribution
of the parameter p = (A,. . . , where Ca > O is a normalization constant. Sioce the
prior is a summat ion, the parameters &, . . . , /Il, are mutually independent and identically
distributed. We thus omit the subscript j and study the p i o r C Exp(-Xlfll-//2) of /3 only.
By simple algebra
where r(-) is the gamma function. Thus the probability density function of (3 is
where A-'/' controls the window size of the density. Particularly. when y = 2. -3 has a
Gaussian distribution. Therefore. the posterior distribution of (PIY) is also Gaussian if
Y has a Gaussian distribution. This is a very special property of the ridge estimator for
linear regressions.
So compare the penalty functions of different values of A and 7, we plot the density
function ~ ~ , ~ ( , 8 ) as shown in Figures 2.4 ( A = 1) , 2.5 ( A = 0..5) and 2.6 ( A = 10). It can
be observed that for fixed small values of X as in Figures 2.4 and 2.5. small values of
put much mass on the tails and thus the density has a Iarge window size and tends to
be 0at' while large values of y put much mass in the center around ,3 = O and t h u s the
density has a srnall window size and is l e s spread-out. However. for fixed large values of
A as shown in Figure 2.6, the window size does not change much since is less than 1
and converges to 1 very fast as 7 increases. While srnall values of +y put much mass very
close to /3 = O with a peak at = 0: large values of tend to distribute the m a s evenly
in the window of the density. When -y = 2: the density is a Gaussian density
It con thus be implied that for fixed srnafl value of A, the Bridge penalty of small y
value favors models with large values of regression parameters, while the Bridge penalty
of large y value favors models with small but non-zero values of regression parameters.
For fixed large value of A, the Bridge penalty of small y value favors models with niany
O of regression parameters, while the Bridge penalty of large 7 value favors models with
smaH but non-zero values of regression parameters. Especiauy, the Lasso (y = 1) with
gamma = 1 gamma = 1.5
gamma = 2 gamma = 4
Figure 2.4: Bridge penalty as a Bayesian prior with X = I
gamma = 1 gamma = 1.5
gamma = 2
,4 -2 O 2 4
gamma = 4
Figure 2.5: Bridge penalty as a Bayesian pnor with X = 0.5
gamma = 1 gamma = 1.5
gamma = 2 gamma = 4
Figure 2.6: Bridge penalty as a Bayesian prior with X = 10
large A favors models with many O parameters, and the Lasso with small X favors models
with large parameters. This result agrees with the conclusion for ortlionormal regression
matrix in last section.
2.8 Relation between Tuning Parameters X and t
In Section 2.1, we claimed that problems ( P l ) and (P3) are equivalent, i-e. for given
X 2 0. there exists a t > O such that (P 1) and (P2) share the same solution. and vice
versa. In this section, we study this relationship between A and t for the special case of
orthonormal rnatrix X.
Notice that for fixed y 2 1: the constrained area of (P 1) is convex as shown in Figure
1.1. Hence, the Bridge estimator is achieved on the boundary of the constraint. wliicli
irnpiîes that t ( h ) = C ljj(~.--,)17 for fixed X 2 0.
With orthonormal matnx X? ( P 3 ) simplifies to p independent equations
t
as shown in Section 2.6. Since Ci z i jy i = Jars j . the j-th coordinate of the O LS estimator.
the Bridge estimator ,8 = (a7 . - - . &)T satisfies
L. A
Denotiog cj = &,;, and sj = /3 j /c j , the ratio of the Bridge estirnate to the OLS estimate.
one has
Hence
where s, can be determined by solving the following equations
derived from (2.5). Therefore. t ( X ) can be computed by substituting the s j into the
formula above. For a special case where ci = c: a constant independent of j . then sj = s
is also independent of j,
2~ 2 t ( X ) = -c s ( 1 - s ) . An/
Figure 2.7 shows the function t ( A ) computed for a special case where cj = L with p = 2
for different 7 values, y = 1. 1.5. 2. 10. It demonstrates the one-to-one correspondence
between t and A. For this case, the threshold of t is to = 1&, = p = 2 such that
any t 2 to yields b(t) = Bal,. The threshold value of X for the Lasso ,3j = O is ,\O = 2
such that aoy X 2 Xo yields j j ( ~ ) = O. It cao be seeo clearly hom Fi,we 2.7 that t (X)
is a monotonically decreasing function for fixed 2 1. For > L. X has t o be infinity in
order to shrink al1 bj = O. However, for 7 = l t any X such that A >_ Xo = 2 shrinks al1
pj = O. Consequentiy, t ! X ) = O.
Plot of function t(lam bda)
lambda
Figure 2.7: Relation between shnnkage parameters X and t for orthonormal matrix .Y.
Chapter 3
Penalized Score Equations
3.1 Introduction
In Chapter 2, 1 obtained some theoretical results of the Bridge estimators through The-
orems 1 and 2, and developed a general approach to solve for the Bridge estimators v i a
(P3)' i-e. the modified Newton-Raphson method for -f > 1 and the Shooting method for
y = 1. In tbis chapter. I proceed further in theory. introduce penalized score equations.
and thus generalize the concept of penalization. The algorithms for the penalized score
equations are &en by the modified Newton-Raphson method and the Shooting metliocl
via iterat ively reweighted least-squares procedure ( IRLS ) . First . 1 review generalized linear
rnodels, li keli hood funct ions and quasi-li keli hood.
3 .2 Generalized Linear Models and Likelihood
Ln many applied sciences, the response of interest is not a contiouous variable rangiug
from negative to positive. like the temperature in Celsius. The response can be proportion
(fraction between O and 1)- number of subjects (positive integer), presence or absence of
event (dichotomous) , degree of pain: none, rnild. moderate. severe ( polytomoos) etc.
Since the response is not continuous or the range of the response is not (-m.+m). a
linear model, like
Y = ,%, + p*x l +..- +,r, + E ,
may not be appropriate.
Nelder and Wedderburn (1959) introduced generalized linear models, which is a natural
extension of the Lnear regression model to a more general class of response variables, that
have a distribution of the exponential family
A generalized linear mode1 (GLM) has three components:
1. The random cornponent: the components of Y = (6, . . . . Y,)~ are mutually inde~en-
dent and have an identical distribution of the exponential family with meau E(Y j = p
and variance V ( p ) .
2. The systematic component: covariates xi, xz, . . . , 2, produce a linear predictor
3. The link between the random and systematic components:
where g(-) is a monotone differentiable function. and is called the link function. Hence a
GLM mode1 can be written as
The most popular types of responses and their canonical link functioos are Gaussian
response with identity link g ( p ) = p, binomial response with logit link g ( p ) = log(&),
and Poisson counts with log link g ( p ) = l o g ( p ) , etc. inference on the parameter P =
(pl, . . . , B , ) ~ is based on the Likelihood function
and the inaximum likelihood estimator (MLE) jml,? which is defined as
A
Pmi, = arg max L ( P ) . P
The NILE estimator f iml , can be computed via the following Newton-Raphson rnethod.
Fisher scoring met hod or the i teratively reweighted least-squares met hod.
By large sample theory. the MLE f imle is asymptotically consistent under regularity
conditions (McCullagh and NeIder 1989),
where I(P) is the Fisher information matrix defined as
and 1(P) = l o g L ( P ) , the log-li kelihood function.
To solve for the MLE fiml,? we take the partial derivative of the log-likeliùood function A
l ( P ) with respect to 13, Pmle rnust satisfy the following equations
8 l / d P j is called the score function of likelihood [ (P) .
Newton-Raphson method
Take Taylor expansion of the score functions Bl(P)/c)P and ignore the quaciratic term.
one has
and
Then jm1, con be computed by the iterative formula
The iteratioo continues until the convergence of the estirnate Dm or the cleviaoce
where p,,, is t h e mean of the response of the saturated mode1 and is usually equ
Fisher scoring method
Replacing the observed information matrix
in (3 .2 ) of the Newton-Raphson method with t h e expected information matrix
where 0 is assumed to be the tme value of the parameter, one obtains the following Fisher
scoring method to solve for the MLE bml,
where Ee ( .) depeods on P o d y through 8. This simplifies the computation. The aPaP
observed and erpected Fisher information matrices are identical if Y follows a distribution
of the exponential family with a canonical link function. Therefore, Fisher scoring met hod
coincides with the Newton-Raphson method (McCullagh and Nelder L989. Hastie and
Tibshirani 1990).
Iteratively reweighted least-squares (IRLS) method
Green (1984) introduced the following IRLS method to compute the MLE by taking the
linear expansion of the link function (McCullagh and Xelder 1989)
An acljusted depeodent variable z = r ) + ( y - p)/V(p) is then defined for canonical links.
where is the linear predictor, and V ( P ) , a function of the mean p. is the variance of Y.
The WLE estimator can be cornputed by regressing z on matrix X with weights V ( p ) .
The IRLS procedure can be outlined as follows.
The IRLS procedure
1. Start with initial estimate Bo; 2. Cornpute q = XB and weights V ( p ) = diag(K(pI):. . -. V,(p,)):
3. Define adjusted dependent variable r = q + [If(&]-'(y - p);
4. Regress r on X with weights V ( p ) to obtain a new estirnate 6; 5 . Iterate steps 2 4 until convergence is achieved.
An advantage of the IRLS procedure over the Newton-Raphson method or the Fisher
scoring met hod is t hat it can be implemented t hrough a weighted least-squares procedure
with no extra effort, since the weighted least-squares procedure is a standard procedure
and is easy to implement in most statistical softwares.
3.3 Quasi-Likelihood and Quasi-Score F'unctions
In last section, we b r i d y reviewed generalized Linear models and the distributions of the
exponential family. Very often, when a probability functioo is specified. the li keli hood
function can be constructed and the MLE can be computed easily. However. in certain
cases, it is not necessary to specify the entire probability distribution and t h u s the joint
likelihood function, or it is not possible to specify the joint likelihood function.
Wedderburn (1974) introduced quasi-likelihood, which extends t h e generalized linear
models in probability distribut ion. A quasi-li kelihood requires t hat the variance of t he
random variable is a known function of the mean, V ( p ) ~ Z y without specifying that the
distribution is from the expooential family. First. a quasi-score of 1 dimension is defined
as
U(p, y ) satisfies the three fundamental properties of the ordinary score functions of a
likelihood function
Jwr(~, Y )) = 0:
and
Hence the integral
if it exists, has similar properties to those of a log Likelihood function.
We st udy the quasi-likeli hood for the followiog two cases.
1. Independent observations
Since the observations are independent. the variance-covariance matrix is diagonal
where functions 6,. . . , V, are identical. The quasi-score in (13.5) is well defined. so is t h e
quasi-likelihood function in (3.6). The quasi-likelihood function Q(p, y) plays the same
role as the ordinary log-likelihood function in the generalized linear models. Inference can
be made based on t h e quasi-likelihood estimator satisfying the quasi-score equations
Similar to the MLE of the generalized Iinear models, the estimator of the quasi-
likelihood can be computed through Fisher scoring method
This estimator is also asymptotically consistentl i.e.
under regularity conditions.
2. Dependent observations
Since the observations are dependent, the variance-covariance matrix V ( p ) is no Longer
diagonal. In general, the quasi-score U = (Ul , . . . , has
which irnplies that the vector field defined by the quasi-score U ( p . y) is path dependent.
Thus there does not exist a scalar function Q(p, y) such that its partial derivatives would
be the quasi-scores as if it existed. Therefore, the integral Q(p. y) in (3.6) is path de-
pendent and is not weI1 defined. In such a case. inference can not b e made based on the
function Q(pt y). One would rather use the quasi-score function U ( p . y): which satisfies
the t hree fundamental properties of the log-likelihood functions as pointed out previ-
ously. The asymptotic consistency also holds under some rather complicated conditions
( McCullagh 199 1 ).
Since the expected values of the partial derivatives of the quasi-score function U ( p . y)
is symmetric but not the partial derivatives, McCullagh ( 199 1) pointed out the possibil-
ity of a decomposition of U ( p . y) into two terms, one main term with symmetric partial
derivatives and one srnail "noise" term with asymmetric partial derivative. Such a de-
composition allows the study of the quasi-scores U ( p 7 y) via the quasi-likelihood of the
first term wit hout Losing much information. Li and McCullagh (1994) studied potential
functions and conservative estimating functions. They projected the estimating fiinctions
onto a subspace of conservative estimating functions. in which the estimating functions
have symmetric partial derivat ives and t hus have a quasi-likeli hood function. The quasi-
likelihood is named as the potential function of the estimating function.
The estimating functions are a broad class of functions, whose equations yield the
parameter estimators. The quasi-score funct ions are a special class of the est imating
functions. They are linear in y and yield an asymptoticaliy consistent estimator. The
potent ial funct ions have similar properties asymptoticaily as the ordinary log-iikeliliood
functions, as pointed out by Li and McCullagh, thus may help to determine the desired
solution froni the possible multiple solutions of the quasi-score equat ions.
3.4 Penalized Score Equations
in previous sections, 1 reviewed generalized linear models. li keli hood fiinct ion. score fiinc-
tions and quasi-li kelihood. As a generalization of likeli hood function. quasi-likeli hood
focuses on the first two moments and the relation between them without speciSing the
ent ire li kelihood. Similady, one can generalize penalization via penalized score equat ions
without specilying the entire likelihood. In order to introduce penalized score equations,
we consider the results of Theorems 1 and 2 in Chapter 2. First? 1 give some remarks.
Rernarks
1. Problem (P3) and its solution are independent of joint Likelihood functions. Xotice
that no assurnption is made in Theorem 1 on joint likelihood functions.
2. Theorems 1 and 2 apply to al1 distributions that have a concave joint likelihood
function, particularly the most popular Gaussian, Poisson and binomial distributions in
the exponential family. The Jacobian condition is satisfied by the minus gradient of the
concave joint likelihood funct ions.
3 . Theorem 2 implies that if the Bridge (Lasso) estimator for y 2 1 is defined to be the
unique solution of (P3), it has no con£lict to the Bridge estirnator of (Pz'). Thenfore, one
can regard the unique solution of (P3) as the Bridge estimator.
It can be implied from the above remarks that if a joint likelihood function exists,
(P3) can be solved to obtain the Bridge estimator of (PT): if no joint likelihood fitnctioti
exists, (P3) can still be solved to obtain t h e unique solution as long as the .lacobian
condition is satisfied. However. problem (Pz') does not apply in such a case. Hence. one
can always start with problem (P3) to solve for the estimator regardless of the existence
of joint likelihood functions. Therefore, the concept of penalization and i ts est irnator are
generalized to be independent of joint likelihood functions. We introctuce penalized score
equat ions.
Consider the equations
Definition 1 (Penalized Score Equations)
Equation (3.8) wit h function S satisSing the Jacobian condition that ($1 is positive-
serni-definite is called the penalized score equations with the Bridge penalty lijJ 1'
Definition 2 (Bridge Estimator)
Given A > O and y > 1. The Bridge estimator is defined to be P(A. y). the unique solutioo
of Equation ( 4 . Y ) ; The Lasso estirnator is defined to be D(X? Lf). the lirnit of B(x. y ) as
y + 1+-
Through Definitions 1 and 2, the concept of penaiization and its estimator are gener-
alized. In fact, they can be Further generalized for different penalty function as follows.
Remarks.
1. The concept of penalized score equations can be extended in general for a penalty of
C g ( p j ) , where g is a srnooth convex function. One can define penalized score equations
using (P9) wit h partial derivatives of different penalty functions.
2. The Bridge (Lasso) estimator defined in Definition 2 is independent of joint likelihood
functioos. It can thus be applied to cases in which no joint likelihood function exists.
The penalized score equations approach is broad compared to the classical approach
to penalization, which minimizes the deviance, i.e. - l l o g ( L i k ) , plus a penalty function.
Such a generalization is crucial to circumvent the difficulty of the non-existence of joint
likelihood functions in regression problems where penalization is desirable due to highly
correlated regressors. One major application is to apply this met hoc1 to the GEEI in which
no joint likelihood function exists in general. By solving penalized GEE for the Bridge
(Lasso) estimator, one can achieve better predictions overall when collinearity is present
among regressors, see Chapter 4 for the algorithm and Chapter 6 for simulation results.
3.5 Algorit hms for Penalized Score Equat ions
Ln Section 3.4, penaiized score equations were introduced theoretically. To solve for the
Bridge estimators, the modified Newton-Raphson algorithm and the Shooting algorit hm
were developed in Section 2.3. For Gaussian responses, the above algorithnis can be
applied directly. For non-Gaussian responses, the methods have to be applied via the
IRLS procedure as follows.
Algorithm for the Bridge (Lasso) Estimator via the IRLÇ Procedure
1). Start with an initial value 6,. 2). Define adjusted dependent variable z based on the current estirnate 6 . z = XP + [~(WYY - P I ;
3) . Apply the M-N-R (Shooting) method to linear regression of W z on CVrC to itptlatr
b, where W = V-Il2;
4). Repeat steps 2) and :3) till convergence of ,h is achieved.
Here, 1 would like to point out that even if no joint Iikelihood function exists. one caa
still apply the modified Newton-Raphson method or the Shooting method to obtain the
Bridge (Lasso) estimator as long as the Jacobian condi tiou is satisfied. The convergence
of the above algorithm is guaranteed by the following theorem.
Theorem 4 (Convergence of the Algorithms)
Given fixed X > O. If " iis positivedefinite? then 3P (1) the modified Newton-Raphson algorithm converges to the Bridge estimator of ( P3)
for y > 1;
(2) the Shooting algorithm converges to the Lasso estimator of (P3) for y = 1.
As poioted out in Section 2.3, the modified Newton-Raphson and Shootiog algorithms
converse very fast, even combining with the IRLS procedure.
Chapter 4
Penalized GEE
4.1 Introduction
In public health studies. investigators often observe a series of observations of interest
along time. For example, in an asthmatic study. each of the subjects in the study is
monitored for a period of time, Say one year. The subject's asthmatic status is observed
at each visit, along wit h some factors, like quality of the air in the surrounding area where
the subject lives? the season, temperature, and humidity. etc. Very often. the main interest
of the investigator is to find the relation between the response variable. like the asthmatic
status, and a set of explanatory variables, like quality of the air, humidit- temperature,
etc. Such type of studies is in a special statistical setting, called longitudinal studies. and
the goal is to identify the dependence of the time trend of the response on explanatory
variables.
During the past t wo decades, longitudinal studies have at t racted attention from mauy
statisticiaos and public health researchers, and its applications can be round in many re-
search areas, for example, medical st udies, environmental st udies and psychologie st udies
(Laird and Ware 1982, Liang, Zeger and Qaqish 1993). Statistical methods in longitudinal
studies include random effect models, conditionai Markov chain models. and generalized
estimating equations method, etc. (Diggle, Liang and Zeger 1993). In this chapter, 1 focus
on the generalized estimating equations method and apply penalization via the penalized
score equations approach when collinearity is present among explanatory variables.
4.2 Generalized Estimating Equat ions
Consider a longitudinal study of K subjects. Each subject has a series of observations. the
response variable yjt and a vector of predictors l i t , where i = l ? . . . . K and t = 1.. . . . n;
for the i-th subject. When investigators are mainly interested in the effect of explanatory
variables on the response variable, Liang and Zeger (1986) and Zeger and Liang (1986)
proposed the lollowing generalized est imat ing equat ions (GEE) based on the marginal
distribution of response K t , I ( y i t ) = ezp[{yn&t - a(ei ,) + 6 (y i t ) }@j7
w here
is the working covariance matrix. and Di = d{a:(B)}/dp = AirliXi- 1; = diag(dOi,/dqi,)?
vit = xzp, Ai = diag (a$(B)) , and Si = yi - a:(B). To incorporate the correlation of
observations from the same sub ject , t h e GEE assumes certain correlat ion structure by
specifying a working correlation rnatrix R(a). [t has been shown that the estirnator of
(4.1) is consistent as K tends to infinity even if the working correlation matrix R ( a ) is
speci fied incorrect ly.
f i - ) - ( O V ) ,
where
V = (c D?K-'D~)-' (C DTK-'C~~(~~)F-'D~) (2 DTK-' ~ j ) - ' (4.2)
is called the "Sandwich estimatorn of the variance. The efficiency will be improved if the
correlation matrix is specified correctly. Detailed discussions can be found in Liang and
Zeger (1986) and Zeger and Liang (1986).
We consider the GEE in a regression setting. As in linear regressions. the potential
problem of collinearity aIso occurs, i.e. if the explanatory variables in the GEE mode1
are close to collinear, the variance of the estimator will be large and predictions based on
the estimator may perform poorly. Therefore. penalization is desirable as shown in the
previous chapters. However, the clzssical approach of penalization. for examp le. Bridge
regression, requires the existence of joint likeli hood functions, as discussed in Chap ter :3.
Since the GEE assumes a special structure of the correlation, there does not exist a joint
likelihood functioo in general such t bat its potential score functions ( the partial derivatives
with respect to P j ) would be the estimating functions of the GEE ( McCulIagh and Nelcler
1989, McCullagh 1991). Such difficulty hinders the implementation of penalization to the
GEE.
The penalized score equations approach generalizes penalizat ion and provides t h e tech-
niques to handle the collinearity problem in the GEE since the penalized score equations
do not depend on joint likelihood functions and cao easily be applied via the iteratively
reweighted least-squares procedure (IRLS). Ln the following, 1 apply the penalized score
equations to the GEE and solve the penalized GEE to achieve better estimation and
predict ion.
4.3 Penalized GEE
Since it was proposed, the GEE has been widely used in longitudinal studies. Al t hough
the GEE estimator is asymptotically consistent and efficient, one may encounter that
the explanatory variables of interest are coUinear or close to collinear. especially when a
large number of explanatory variables are involved. This raises a question of accuracy of
estimation and prediction based on the parameter estirnator of (4.1).
It has been known that penalization provides the techniques to bandle t h e collinearity
problem in linear regressions. The classical approach to penalization is to rninimize t h e
deviance of the mode1 plus a penalty function. For example, if the joint likelihood function
is L ( B ) , then the penalization problem is
min ( - ~ z o ~ L ( P ) + h C 1 ~ ~ 1 ' ) P
for Bridge penalization.
However, there does not exist a joint likelihood function L ( P ) For the GEE in general as
discussed in last section. To apply penalization to the GEE, one needs special techniques
which do not depend on joint likelihood functions. The penalized score equatioos approach
provides such a technique and thus serves the need.
In the following, I apply the Bridge penalty to the GEE. Similarly. one can consider
other types of penalty functions as discussed in Chapter 3.
Consider the following equations in the format of ( P 3 j.
where function d(pj, A, y) = Ar lPj17-'sign(@j) and the S,'s are the minus estimating
functions of the GEE, or the minus score functions of some joint Likelihood fuoction,
if it exists. Therefore, it is natural to have the .lacobian condition on function S =
(Si-. . . , SP)=: 9 3 is positive-semi-definite.
Consider the estimating functions on the left hand side of the GEE (4.1). Take partial
derivative of the minus estirnating function with respect to ,B and denote the derivative
by H.
since the partial derivat ive
By the regularity conditions (Liang and Zeger. 1986).
is bounded. Since Si, i = L,. . . , K, are mutually independent with the expected value
E ( S i ) = O and finite variance Var(Si) 5 C < m. where C is a large constant independent
of i , by the Weak Law of Large Numbers (Durrett, 1991, page 29): the first term of (4.4)
converges to O in L2 and in probability as K tends to infinity, and the second term: a
positive definite matrix, converges to a positive-serni-definite matrix. Hence. H l K. and
thus H, satisfies a weak form of the Jacobian condition of Theorem 2.1 for suflicientiy
large value of K. Therefore, the existence and uniqueness of the solution of problem
(4.3) are guaranteed. This implies that the Bridge estimator of (4.3) is well defined. One
can penalize the CEE via the penalized score equations approach. The penalized GEE
shrinks the GEE estimator towards O to achieve small variance and better prediction when
colli nearity is present among regressors.
To solve for the estimator of the penalized GEE. one follows the procedure in Liang
and Zeger ( 1986) and applies penalization to the weighted least-squares in the iteratively
reweightecl least-squares (IRLS) procedure. The algorithm is outlined as follows.
Algorithm for Penalized GEE
(1). Start with initial value fi,. (2). Estimate parameters a- 4 and the working correlation matrix R ( a ) using Pearson or
deviance residuals based on the current estimate 8. (3). Define the adjusted dependent variable r = DB + S.
(4). Update the estirnator ,d for fixed X 2 O and 2 1 by opplying penalizotion to the
regression of z on X ivith weights V using the M-N-R (Shooting) method. -
( 5 ) . Repeat steps (2) to (4) till convergence of P is achieved.
Solving the penalized GEE for the Bridge (Lasso) estimators. one achieves better
estimation and predict ions may perform bet ter when collinearity is present arnong t lie
explanatory variables, as demonstrated in Chapten 6 and 7.
Chapter 5
Selection of Shrinkage Parameters
5.1 Introduction
Ln regression problems, one frequently needs to select models according to the following
general rules: (1) to have a good fit to the data, and (2) to maintain a simple and inter-
pretable model. The former cao usually be achieved by including as many explanatory
variables as possible in the model while the latter by excluding variables that are not
statistically significant. However, if there are a large number of explanatory variables. it
is hard in general to choose a good model to satisfy both (1) and (2) simultaneously. Very
often, it is easy to have a large model with many regressors. Then over-fitting becornes a
major problem in these models.
Over-fitting occurs ivhen models include more regressors than necessary and fit the
data extrernely well at al1 given data points. The models perforrn very poorly in predictioo
because the pattern in the data is mis-identified due to interpolation at given data points
with many unnecessary regressors in the model. These models are misleading and thus
overfttting should be prevented as much as possible.
5.2 Cross-Validat ion and Generalized Cross-Validat ion
To handle the overfitting problem, cross-validation (CV) met hod was introduced (Stone;
1974). It selects model by leaving out one observation point at a time and minimizing t h e
average prediction error a t the leave-out points with the model built on remaining data
points, Le.,
min CV(X), X
where
Y-' = ZTP-~(X) , B-'(x) is the estirnate of the mode1 based on the observations excluding
(xi,yi), and h is a tuning parameter for model selection. There are many applications of
cross-validation methods in model fitting and selections. Major refereoces can be found
in Stone (1974), Hastie and Tibshirani (1990), Wahba (L990), Shao (1993) and Zhang
(1992).
Craven and Wahba ( 1979) introduced the generalized cross-validation (GCV) for linear
smoothing splines to optimize the smoothing parameter A. It takes the form of
for linear operator @ = A(X)y of model Y = g + S .
One advantage of t h e GCV is that it is not necessary to compute the estimates n
times, one for each single leave-out data point selected for cross-validation. It suffices to
compute the total deviance (RSS) of the full model. the degrees of freedom of the model
and the sample size. Therefore, it is less expensive computationally and can easily be
computed with an advanced programming laquage. such as S+.
5.3 Selection of Parameters X and y via the GCV
To select the penalty parameters A and 7, we use the generalized cross-validation (GCV)
method of Craven and Wahba. First, ive have from (P3) that the Bridge estimator of
linear regression model satides
We define the effective number of parameters p ( X 7 7) of the model. following Craven ancl
Wahba to assess the penalty eRect on the degrees of freedorn of the model
where D is a p x p diagonal matrix of elements
a
no is the number of such that $j = O for 7 = 1. compensating the loss of the inverse of
entry zero on the diagonal of matnx D due to ,ij = O. The GCV is defined as
where n is the sample size. It cari be re-wri tten as
n R S S ~ P )
and be interpreted as the average quant ity of squared residual over each remaining effective
degree of freedorn out of the model.
To select the parameters X and 7: we compute the GCV for each pair of (Al -y) over a
grid of X 2 O and 7 2 1. X and y are selected to achieve the minimai value of the GCV
as shown in Figure 5.1.
For generalized Linear models, the GCV must be modified since the residual sum of
squares (RSS) is no longer meaningful for non-Gaussian response variables. Instead, the
Bridge Optimization of Lambda and Gamma via GCV
Figure 5 .1 : Selection of parameters X and 7 via GCV
deviance, -21og(Lik), can be used to replace RSS in the GCV. where Lik is the joint
likelihood function of the response variable. The optimizing procedure remains the same.
We consider two speciaI cases for the effective number of parameters p ( X . 7 ) .
1. A = O. No penalty is applied to the model. p ( X , y ) is the trace of the projection matrix
and is thus equal to p, the number of parameters in the linear model.
2. X » 1 and y = 1. Since the Lasso shrinks the parameters and yields .3, = O?
j = 1, . . . , p, for suffitiently large AI D = diag(0) and no = p. The rnodel is t3us nul1 since .L
ail = O. Hence the effective number of parameters of the model is equal to 0. which
agrees with the calculation of p(X, y ) = p - p = 0.
For other cases, p(X, y) is greater than O and less than p, t he number of parameters in
the mode[.
5.4 Quasi-GCV for Penalized GEE
The GCV method was used to select parameters X and 7 for generalized linear models in
last section. However. as pointed out in Chapter 4. no joint likelihood function exists for
the GEE in geoeral. Hence, the GCV method does not apply for the penalized GEE and
thus must be rnodified.
To generalize the GCV method for the penalized CEE, the correlation structure must
be incorporated. By incorporating the correlation. one may achieve the sanie efFect of
the GCV as in generalized linear models. Notice that the deviance used in the C:CV for
generalized linear models is the sum of squares of deviance residuals. hlthough deviance
does not have a proper rneaning in the GEE due to the correlation. the dcviance residuals
can still be calculated at each single observation point as usual
where L(yr t , f i k t ) is the Likelihood of observation ILt based on its marginal distributio~~.
A weighted deviance D,(Xl 7) for correiated observations is thus forrned as foilows by
incorporating the correlation structure into the deviance residuals to achieve similar effect
to the deviance for independent observations.
where TI , is the deviance residual vector of subject k, &(a) of dimension nk x nk is the
working correlation mat rix.
LVe then define a quasi-GCV to be
Figure 5.2: Select,ion of parameters X and via quasi-GCV
where n is the effective number of degrees of freedom of the correlated observations g k t .
t = 1,. . . , n k , defined as
and IRk(a)l is the sum of aiI elernents C pG of &(a) = (p,) . Since the correlatioo
structure of the GEE is estimated via either Pearson residuals or deviance residuaIs,
deviance residuals are recommended in order to incorporate the correlation structure into
the deviance residuals.
The parameter selection procedure remains the same as for the gneralized Iinear mod-
els, i.e. for each fixeci pair of (A. 7): cornpute the Bridge (Lasso) estimator B(x.-,). Theo
compute the effective number of parameters p(A. y). The quasi-GCV is thus computed
using ( 5 . 3 ) with the deviance residuals and correlation matrix R(a ) obtained from the
last step of the IRLS procedure for the penalized GEE. The parameters X and 7 are t hen
seiected over a grid to minirnize the quasi-GCV as shown in Figure -5.2.
Remarks
1. We refer D,(& y ) as the weighted deviance. It reduces to deviance when the correlation
matrix R(a) reduces to identity matrix for independent observations, accordingly. the
quasi-GCV reduces to the GCV.
2. The effective number of degrees of freeciorn of correlated observations depends on
the correlation coefficient matrix R(a). Since different values of X and a/ yield different
estimates and different value of R(a), n seems to Vary with X and 7. However. since the
effective number of degrees of freedom is intriosic to the observations and the subject,
n must be independent of X and y. Therefore, constant value of n should be used to
compute the quasi-GCV for difFerent X and y. We recommend using the estimate of n
Erom X = 0.
The weighted deviance is motivated by correlated Gaussian responses as follows.
Assume Y = ( K , . . . , are correlated responses from mode1 Y = XP + E with E - N(0, C ) , where C is a non-diagonal variance-covariance matrix of E.
Ln order to apply the GCV met hod for independent responses, we take a transformation
Z = PY, where P = n-'I2Q satisfying C = QThQ. Then Z follows a normal distribution
of N ( P X & 1). Apply the GCV to 2. one ha.
i.e. t be GCV is achieved by incorporating the correlation structure in the residuals.
Similady, one incorporates the correlation structure into deviance residuals as in ( 5 . 3 ) to
achieve the same effect for the penalized GEE.
The effective number of degrees of freedom of correlated observations is also motivated
From correlated Gaussian observations. Assume Y = ( K . . . . follows distribution
N ( 0 , a' R), where mat rix R = ( p i j ) has diagonal elernents pi; = 1.
Consider the variance of the
~ a r ( Y ) =
1 - CC Cov(Y;, Y , ) n2
j
Notice that for the speciai case where x ' s are independent, R is thus an identity matrix.
var(?) = ovn. The denominator n is the oumber of degrees of freedom of the n
independent observations Yi, . . . , Y,. By analogy, we d e h e n'Il RI, the denominator
of (5.4), to be the effective number of degrees of freedom of the correlated observations
h, . . . , Y,. For non-negative correlatioo coefficient p, 3 O, this effective number of degrees
of lreedom is between 1 and n, as the former is for n repeats of & and t he latter is for n
independent observations (Yi,. . . , Y,).
There might be some problems with oegative correlation. However. it is very rare in
practice to have a series of observations wit h negative correlation. Especially for longitudi-
nal çtudies, one expects positively correlateci responses [rom t h e same sub ject. Therefore.
the effective nurnber of degrees of fkeedom works well for longitudinal studies in generai.
Chapter 6
Simulation Studies
In this chapter, 1 conduct a series of statistical simulations based on true models in order to
examine the shrinkage effect of Bridge regression. The Bridge penalty model is compared
with no-penalty. the Lasso penalty and the d g e penalty models in the settings of linear
regression. logistic regression (generalized linear rnodel for binomial distri but ion ) and tlir
GEE for binary outcornes. The standardized mean squared error (MSE) of the regression
paraniet ers
M S E = ace ( f i - P ) ~ ( x = x ) ( ~ - P ) .
and the prediction squared error PSE = arue Dev(y7 j i ) averaged over replicates of tlie
mode1 random error. are computed and compared for different penalty models. For the
logistic regression and the GEE model. the misclassification error (,LI CE) is also computed
as an average over replicates of the model randorn error,
MC E = ave [(y. z j ) ?
where
is the indicator function. and
For each replicate of random error generated for the model? the PSE and MCE are corn-
puted as an average a t some randomly selected points in the covariate space having the
same correlation structure as X. The standard error of each quantity is ais0 computed.
'If
6.1 A Linear Regression Mode1
We compare the Bridge mode1 with the OLS. the Lasso and the ridge in a simulation of
a simple model of 40 observations and .i covariates
where B - N ( O : uZ)- The signal noise ratio is thus calculated by [ v â r ( X J ) / ~ ' ] . where
T vâr is the sarnple variance of (2' B . . . . , znTp), P is the true parameter and xi is the
covariate vector of the i-th observation.
To examine the shrinkage effect on collinearity, we choose a regression matrix .Y with
sttong linear correlation as shown in the correlation matrix of X. The correlation coeffi-
cient between x4 and xs is very Large. p = 0.995. The matrix .Y is generated as follows.
First? a matrix of 40 x .i is generated with random numbers of standard normal distribu-
tion X ( 0 7 1 ) . Then the pairwise correlation coefficients of consecutive colunin vectors of
X are generated from uniform distribution U(- 1, 1 ) . The pairwise correlation coefficients
of the consecutive coiumn vectors are achieved by adding a multiple of the second column
to the first. To shrïnk the parameters of the regressors but not the intercept, we center
and scale the data by
where xj is the j-th column vector of X.
Since the Lasso performs weli compared to the ridge if the true model has coefficients O.
but perfonns poorly if the true model has smaU but non-zero coefficients, two sets of true
p were selected to examine the shrinkage effect on models with coefficients O and models
Correlation matrix of the linear model (6.1)
! 1.000 0.1 10 -0.144 0.0:36 0.066
~2 0.1 10 1.000 -0.:315 0.021 0,034
~3 -0.144 -0.315 1.000 -0.1 18 -0.109
~4 0.0:36 0.021 -0.1 18 1.000 0.995
i 5 5 0.066 0.034 -0.109 0.995 1.000
wit h srnall but non-zero coefficients: Pt,, = (0,O: O.5,0, - 1)* wit h intercept ,& = O for
Model (a) and Pt,, = (0.5,3? -0.1,2.5,9)~ with intercept ,& = O for Model (b). The
response Y is generated from model (6.1) with a signal-noise ratio equal to 6.
Table 6.1 shows the parameter estimates, the standard errors in parent heses, the &ISE
and PSE of the OLS, the Bridge, the Lasso and the ndge models. The standard errors of
,& and ,& are relatively large in both Models (a) and (b) due to collinearity
In Model (a), the Bridge and the Lasso achieve the smallest LISE = 1.104. followed by
the ridge MSE = 1.212. The OLS bas the greatest MSE = L.385 due to collinearity- The
Bridge and the Lasso also achieve the smallest predictioo error PSE = ?.-LE followed by
the ridge PSE = 2.745. The OLS lias the geates t prediction error PSE = 3.17Y. The
reduction of MSE by the Bridge is 20% from the OLS. and the reduction of PSE by the
Bridge is 22% from the OLS.
In Model (b), the ridge achieves the srnaflest MSE = L27.90, foiiowed closely by the
Bridge MSE = 129.60 and the Lasso MSE = 130.16. The OLS has the greatest MSE =
145.17. The ridge also achieves the srnailest prediction error PSE = 286.42, followed by
the Bridge PSE = 290.29, and by the Lasso PSE = 292.10. The OLS has the greatest
Table 6.1 : Model cornpanson by sirnulat ion of 200 runs
Model (aj
PSE
1 MSE 1 145.17(5.98) ] 129.60(5.50) 1 130.16(5.52) 1 127.90(5.70) 1
True ,d
&,=O.Cl
XlT8(O. 141)
OLS
0.005(0.75'7)
PSE
2.482(0.15%)
Bridge
329.:30(14.22) 290.%9(12.87)
I
2.4Y2(0.132)
Lasso
292.10(12.90)
2.'iEi(O. 127)
Ridge
286.42(13.%6)
0.003(0.757) 0.003(0.157) 0.005(0.ï57)
prediction error PSE = 329.30. The reduction of MSE by the Bridge is 11% from the
OLS, and the reduction of PSE by the Bridge is 12% from t h e OLS.
I t is shown in the above example that the Bridge regression shrinks the OLS estimators
and achieves small variance, small mean squared error and srnall prediction error. I t is
also dernonstrated that the Bridge estirnator performs well compared to the Lasso and
the ridge estimators, and perforrns better than the OLS estimator.
6.2 A Logistic Regression Mode1
We apply Bridge penalty to a logistic regression model and compare the Bridge mode1 witli
logistic regression of no penalty, of Lasso penalty and of ridge penalty for the following
model of 20 binary responses and 3 regressors:
As above, we standardize the regression matrix by
To examine the shriakage effect with the preseoce of collinearity, we choose a regression
matrix X such that the covariates 1 2 and z3 are highly correlated with correlation coeffi-
cient 0.9975. The matrix X is generated as follows. First, a matrix of 20 x 3 is generated
with random oumbers of standard normal distribution N ( O , 1). Then a large multiple
of the third column vector is added to the second in order to achieve a high correlation
between the two columns.
Correlation matrix of the logistic regression model (6.2)
Since the Lasso penalty model and the ridge penalty model perform ciifferently de-
pending on whether the true model ha. coefficients O or small non-zero coefficients.
two sets of true ,B are selected to examine the shrinkage effect of Bridge penalization.
/3 = (O.i,O.l, - 0 . 1 ) ~ with intercept /Io = 0.1 for Model (a): and ,f3 = (O. I. - I ) ~ with
intercept ,& = O for Model (b). The Bridge model is comparecl with Logistic regression.
the Lasso ancl the ridge through a simulation of 100 runs. Table 6.2 shows the estimates.
the standard errors in the parentheses, MSE, MCE and PSE averaged at 20 randomly
selected points having the same correlation structure as X.
Overall, the standard errors of and b3 are relatively large for both Models (a)
and (b ) due to collin-rity. The Iogistic regression estimator has the greatest standard
errors in both Models (a) and (b), which leads to poor performance in prediction with
the greatest MCE and PSE.
In Mode1 (a), the ridge achieves the smallest MSE = l.6R8, followed by the Bridge
MSE = 1.902 and the Lasso MSE = 1.907. The Logistic regression has the greatest
MSE = 3.058. The ridge has the smallest prediction error PSE = 1.569, foUowed by the
Lasso PSE = 1.588 and the Bridge PSE = 1.590- The Iogistic regression has the greatest
predictioo error PSE = 1.782. However, the Lasso has the smallest misclassification
Table 6.2: Model comparison by simulation of LOO runs
Mode1 (a)
T ruePI Logistic 1 Bridge 1 Lasso 1 Ridge !
MCE 1 0.519(0-012) 1 0.458(0.015) 1 O.-LYS(0.015) 1 O.SL-C(O.012) 1
,&=O.i
fl,=O. 1
,&=-O- 1
MSE
PSE 1 1.782(0.047) 1 1.590(0.031) 1 1.5YY(0.038) 1 L.569(0.039) 1 Model (b)
0.046(0.674)
0.060(7.953)
0.015(7.991)
3.058(0.147)
0.007(0.445)
-O.l39(3.ïO3)
0.203(3.866)
l.gO2(O. 173)
True ,8
-
MCE
0.006(0.4-44)
-O.I38(3.703)
0.20 L(3.866)
L O O 1 1
Logistic
PSE
0.006(0.407)
O.O%(-L.2 15) ,
-0.0 lg(4.29'i)
L.6:3$(0.172)
0.509(0.010)
Bridge
1.839(0.057)
0.477(0.013)
I
Lasso
1.621 (0.040)
Ridge
0.450(0.013) OAW(O.0 10)
1.620(0.039) 1.645(O.O54)
error MCE = 0.485, followed ciosely by the Bridge MCE = 0.488: and by the ridge
MCE = 0.514. The logistic regression has the greatest misciassification error MCE =
0.519. The reduction of LISE by the Bridge and the Lasso is about 38% from the logistic
regression. The reduction of PSE by t be Bridge and the Lasso is about 1 1% froni the
logistic regression. The reduction of MCE by the Bridge and the Lasso is about 6% from
the logistic regression.
In Mode1 (b), the ridge achieves the srnailest MSE = 1.93 1. followed hy the Lasso &ISE
= 2.149 and the Bridge MÇE = 2.150. The lagistic regression bas the geatest MSE =
3.435. The Lasso achieves the smallest prediction error PSE = 1.620. followed closely by
the Bridge PSE = 1.621, and by t h e ridge PSE = 1.645. The logistic regression Lias the
greatest prediction error PSE = f .839. The Bridge achieves the smallest misclassification
error MCE = 0.417, followed closely by the Lasso MCE = 0.480, and by the ridge MCE
= 0.494. The logistic regression lias the geatest MCE = 0.509. The reduction of MSE
by the Bridge is 37% from the logistic regression. The reduction of PSE by the Bridge is
12% €rom the logistic regression. The reduction of MCE by the Bridge is about 6% Erom
the logistic regression.
It is showo in the above example that the Bridge penalization shrinks the estimator
towards O and achieves small mean squared error' small prediction error and small mis-
classification error for logistic regression model. Therefore the Bridge estimator performs
well in prediction compared to the Lasso and the ridge estimators, and performs better
than the logistic regression estimator.
6.3 A Generalized Est imat ing Equat ions Mode1
We apply the Bridge penalization to the GEE via the penalized score equations approach
and compare the Bridge model with the GEE of no penalty. of the Lasso penalty and
of the ridge penalty for the following model of 20 subjects and :I regressors. The matrix
X is the same as in the logistic tegression model in last section. Five binary responses
are generated for each subject with exchangeable positive correlation and the covariates
remain the same for different observations within each subject. The correlated binary
responses are generated using a methoci by Lee (1993) for an exchangeable correlation
structure with positive pairwise correlation determined by a parameter O < rb 5 1. As
I/J tends to 0, the Kendall coefficient r converges to 1 while (1. = I corresponds to r = O
wit h independence as a special case. Here we choose $J = 0.2 to generate the positively
correlated responses. The MSE. MCE and PSE at some randomly selected prediction
points are computed for each model. The PSE is defined to be the deviance averaged over
M = 20 randomly selected prediction points.
1 PSE = -C-2[yilog(ji) + (1 - yi)log(l - P i ) ] ,
M
with the assumption that the prediction points are from independent sub jects.
The linear component of the GEE is
Since the Lasso penalty rnodel and the ridge penalty model perforrn differently de-
pending on whether the true model has coefficients O or small but non-zero coefficients,
two sets of true p are selected to examine the shrinkage effect of the Bridge penalization:
Table 6.3: Mode1 Comparison by Simulation of 100 Runs
Mode1 (a )
True /3
@*=O
1 MCE 1 0.341(0.010) 1 0.340(0.010) 1 0.:340(0.010) 1 0.345(0.010) 1 MSE 1 5.190(0.:349)
I
1 PSE 1 1.244(0.019) 1 1.241(0.018) 1 1241(0.018) 1 1.245(0.018) 1
CEE
Mode1 (b)
4.848(0.:36:3) 1 -1.845(0.363) I
True p GEE Bridge Lasso Ridge I ,&ci,=O.l O.lSS(0.242) 0.185(0.241) 0.185(0.241) 0.18.5(0.241)
Bridge
5-08 9(0.:309)
p3=0.0i O. 155(3.938) 0.028(2.02 1) 0.025(2.02 1) 0.040(2.482)
MSE 4.807(0.295) :3.527(0.264) 3.527(0.264) 3.713(0.294)
-O.OOS(O.ZBl) 1 -0.009(0.277)
MCE 0.460(0.0 12) ( 0.451(0.011) 1 O.45l(O.Oll) 0.46 l(0.0 11)
Lasso
1 PSE 1 1.413(0.013) 1 1.397(0.012) 1 1.397(0.012) 1 1.:398(0.013)
Ridge
-0.009(0.277) -0.0 lO(O.2'iï)
/3 = (0,2, - 3 ) = with intercept ,Bo = O for Model (a), and ,û = (0.1. - O . I . O . O l ) * with
intercept Do = 0.1 for Mode1 (b). Table 6.3 shows the parameter estimates, the standard
errors in parentheses, MSE, MCE and PSE. The standard errors of & and j3 are relatively
large in both Models (a) and (b) due to collinezrity.
In Model (a), the Lasso achieves the smallest MSE = 4.845. followed closely by the
Bridge MSE = 4.848, and by the ridge MSE = 5.089. The GEE mode1 has the greatest
MSE = 5.190. The Bridge and the Lasso achieve the smallest misclassification error MCE
= 0.340, followed by the GEE MCE = O.:Wl. The ridge has the greatest misclassification
error MCE = 0.:345. The Bridge and the Lasso achieve the smallest prediction error PSE
= 1.241, followed by the GEE PSE = 1.244 and the ridge PSE = 1.245. The reduction
of MSE by the Bridge is 7% from the GEE. The reduction of MCE by the Bridge is 0.3%
from the GEE. The reduction of PSE by the Bridge is 0.25% from the GEE.
In Model (b); the Bridge and the Lasso achieve the smallest &ISE = 3.57'7. followed
by the ridge MSE = 3.713. The GEE has the greatest $ISE = 4.807. The Bridge, the
Lasso and the ridge achieves the smallest misclassification error MCE = O . M . The GEE
has MCE = 0.460. The Bridge and the Lasso achieve the smallest prediction error PSE
= 1.39'7. foUowed closely by the ridge PSE = 1.398. The CEE has the greatest prediction
error PSE = 1.413. The reduction of MSE by the Bridge and the Lasso is 27% from the
GEE. The reduction of MCE by the Bridge, the Lasso and the ridge is 2% from the CEE.
The reduction of PSE by the Bridge and the Lasso is 1.2% from the CEE.
It is shown in the above example that the Bridge penalization shrinks the estimator
and achieves small meao squared error, misclassification error and prediction error for
the GEE model. The Bridge estimator performs well compared to the Lasso and ridge
estimators, and performs better than the GEE estimator.
6.4 A Complicated Linear Regression Model
In Section 6.1, a simple linear regression model was studied and the shrinkage effect of
different penalties: the OLS, the Bridge? the Lasso and the ridge were compared in terms of
MSE and PSE with two typical sets of true parameters: one with zeros and the other with
smaU but non-zeros. In this section, we study the shrinkage effects of diEerent penalties
on more complicated linear regression models with different correlation structures of the
regressors. The true parameters are generated from the prior distribution of the Bridge
penalty for different values of y, as discussed in Section 2.7.
Model
CVe study a linear regression model of 10 regressors with sample size n = 30
Ten regression matrices X,,, , rn = 1. . . . . 10, are generated Erom an ort honormal rnatrix .Y
of dimension 30 x 10 with different pairwise correlation coefficients { p ) , generated Erom
a uniform distribution Li(-1,l).
Data
For each X,, 30 true ,Bk: k = 1, . . . .30, are generated, where each component of Pk is
generated from the Bridge pnor q(@), Le. nAsî(8) with X = 1 and fixed y 2 1. With each
X, and pk, 30 observations are generated £rom Y = X,P, + E with iid normal random
error Ei from N ( 0 , a2) with a signal-noise ratio equal to 6. For different penalty models:
the OLS. the Bridge. the Lasso and the ridge. the MSE and PSE are computed as
P S E = ave (y, - rTfi)'
averaged over 20 randomly selected points (xtt y,) generated from the same rnodel. where
xt , the covariate vector of each predict ion point. consists of the covariates having t h e samr
correlation structure as X,. Then t h e MSE and PSE are averaged over 50 replicates of
the mode1 random error S. Hence for each P, generated from the pnor distribution T-,($).
MSE and PSE are computed for the OLSt the Bridge. the Lasso and the ridge nmdels.
Therefore 10 x 30 = 300 sets of SISE and PSE are computed. The above procedure is
repeated for different values of 7 = 1. 1.5. 2. 3 . 4.
Method
Since each set of $ISE and PSE of different penalties are computed Erom the same
,Bk generated from 7r-,(,8) and their values Vary in a large range with different but
the differences between the models are relatively srnall as shown in Figures 6.1 - 6.5. we
choose to compare the relative M S E , and relative PSE, to the OLS by setting the OLS
to be the baseline:
and
PSE - l'SEoL, PSE, =
PSEOlS
It can be seen clearly fkom the plots of the MSE and PSE in the original scale that
the MSE's of difFerent penalty models are highly correlated, and so are t h e PSE's. It
is appropriate to compare the relative MSE and PSE rather than the original .LISE and
PSE.
Result
For each fixed -y value. the means and their standard errors of the 300 sets of &f S Er anci
PSE, are computed and reported in Table (6.4). It is shown that for 7 = L and 1.5. the
Bridge: the Lasso and the ridge have significant reduction of %ISE and PSE frorn the OLS.
For y = 1, the Bridge has the gea tes t reduction with :\.ISE, = -0.0860 and PSE, =
-0.0021, followed closely by the Lasso with MSE, = -0.0841 and PSET = -0.0020. and
followed by the ridge with M S E , = -0.0595 and P S E , = -0.0013. For y = 1 .J. The
ridge has the greatest reduction wit h M S Er = -0.0566 and PS Er = -0.00 17. followed
by the Bridge with MSE, = -0.0225 and PSE, = -0.0009, and followed by the Lasso
with MSET = -0.0224 and PSET = -0.0009.
For 7 = 2, 3 and 4: the ridge has a significant reduction of &ISE and PSE from the
OLS with ILISE, = -0.0-519 and PSE, = -0.00'21 for y = 2. LWSE, = -0.0566 and
PSET = -0.0016 for 7 = 3. and MSE, = -0.0577 and P S E , = -0.00 Li3 for 7 = 4:
while both the Bridge and the Lasso have a si,hficant increase of &ISE and no significant
change of PSE fiom the OLS.
It is shown in Table 6.4 that the Bridge and the Lasso perfonn welI for smaiI y values,
but not as well for large -y values. The ridge performs well for au of the -1 values coasidereci
here. It performs better than the Bridge and the Lasso for large values of y (y = l.5?
Table 6.4: Means and SE'S of Itl SE, and PSE, for differeut y
1 Bridge 1 Lasso 1 Ridge 1
2: 3 and 4, but not as well for small y value (7 = 1). As discussed in Sections 5.6 and
2.7, large value of y generates smal1 but non-zero regression parameters ,d for t h e niodeIl
and small value of y generates large regression parameters 3. It can thus b e implied that
the Lasso performs well if the true model has large parameters. but performs poorly if
the true model has many small but non-zero parameters. Such a result agrees with the
results obtained in Sections 2.6 and 2.7. [t also agrees with the results obtained tlirough
intensive simulation in Tibshirani ( 1996). The Bridge demonstrates a similar effect to t lie
Lasso. it performs well for srnali 7 values (7 = l 1 1.5). but does oot for large values,
even though it can potentially select the best y value.
Lo Fiemes 6.1 - 6.5 for fixed 7 = 1, 1.5, 2, 3 and 4, respectively, on the right hand side
are the box plots of the MSE, and PSE,, and on the left hand side are the plots of ten
randomly selected sets of MSE and PSE in the original scale including the maximum and
minimum. It is shown that the MSE's of different penalty models are highly correlated,
and so are the PSE's. The values of the MSE's and the PSE's Vary in a large range. I t can
be concluded that the cornparison of MSE, and PSE, between different penalty models
is appropriate rather than the cornparison of the original MSE and PSE.
It is shown from the above result that Bridge regression achieves small MSE and PS E.
and performs well compared to the Lasso and the ridge for linear regression models with
large regression parameters, but may perform poorly if the true models have many small
but non-zero parameters.
Summary of the Simulation Resdts
In summary, it can be concluded from the above simulation studies that the shrinkage
estimators ( the Bridge, the Lasso and the ridge) achieve smaller variance and better
estimation than the non-shrinkage estimator by shrinking the parameters towards O witli
a Little sacrifice of bias when coilinearity is present in regression problems. For different
cases, the shrinkage estirnators of the Bridge? the Lasso and the ridge perform differently.
in general, the Bridge estimator with small value of y. such as the Lasso estimator.
tends to favor models with many zero parameters or models with large paranieters. but
does not perform weii on models with many small but non-zero parameters in terrns
of estimation error and prediction error. The Bridge estimator witli large value of y.
such as the ridge estimator? tends to favor models with moderate parameters or models
with many small but non-zero parameters, but does not perform as well as the Bridge
and the Lasso estimators on models with many zero parameters or models with large
parameters. However, the ridge estimator performs well for a wide range of 7 values,
which include models with small but non-zero parameten and models with many zero
parameters. Therefore, the ridge estimator is recommended in generai to deal witli the
collinearity problem in regressions. In practice, one does not have much acknowledge of
the true models, a training-and-testing method is recommended as shown in Section 7-3
in next chapter. This method randomly splits the data set into a training set and a test
set, and biiilds several penalty models on the training set. Then t lie prediction errors on
the test set are computed. The above procedure is repeated many times. and the averaged
prediction errors of the penalty models over different random splits are compared. The
model having the least prediction errors is selected to be an optimal model.
OLS Bridge Lasso Ridge
OLS Bridge Lasso Ridge
Relative MSE to OLS
Bridge Lasso Ridge
Relative PSE to OLS
L = - - - - Bridge Lasso Ridge
Figure 6.1: Simulation with true P generated £rom the Bridge pnor with y = 1. Left: ten randornly selected sets of hISE or PSE including the maximum and
the minimum. Right: box plots of 300 sets of the relative MSE and PSE.
OLS Bridge Lasso Ridge Relative MSE to OLS
OLS Bridge Lasso Ridge
Bridge Lasso Ridge
Relative PSE to OLS
Bridge Lasso Ridge
Figure 6.2: Simulation with tme P generated from the Bridge prior with 7 = 1.5. Left: ten randomly selected sets of MSE or PSE including the maximum and
the minimum. Right: box plots of 300 sets of the relative MSE and PSE.
OLS Bridge Lasso Ridge Relative MSE to OLS
OLS Bridge Lasso Ridge
- - - - Bridge Lasso Ridge
Relative PSE to OLS
Bridge Lasso Ridge
Figure 6.3: Simulation with true ,û generated from the Bridge pnor with -/ = 2 . Left: ten randomly selected sets of MSE or PSE including the maximum and
the minimum. Right: box plots of 300 sets of the relative MSE and PSE.
OLS Bridge Lasso Ridge
OLS Bridge Lasso Ridge
Relative MSE to OLS
Bridge Lasso Ridge
Bridge Lasso Ridge
Relative PSE to OLS
Figure 6.4: Simulation with true P generated from the Bridge pnor with y = 3 . Left: ten randomly selected sets of MSE or PSE including the maximum and
the minimum. Right: box plots of 300 sets of the relative MSE and PSE.
OLS Bridge Lasso Ridge
OLS Bridge Lasso Ridc
Relative MSE to OLS
Bridge Lasso Ridge
Relative PSE to OLS
I = = 1 Bridge Lasso Ridge
Figure 6.5: Simulation with true generated from the Bridge prior with 7 = 4. Left: ten randomly selected sets of MSE or PSE includiog the maximum and
the minimum. Right: box plots of 300 sets of t he relative MSE and PSE.
Chapter 7
Applications: Analyses of Healt h
Data
In this chapter, 1 apply the Bridge penalty model to analyze several data sets obtained
from public health studies to achieve good statistical results.
7.1 Analysis of Prostate Cancer Data
We apply Bridge regression to a prostate cancer data set. The data cornes from a study
by Çtamey et. al. (1989) to examine the correlation between the level of prostate-specific
antigen and a number of clinicai rneasures in men who were about to receive a radical
prostatectorny. The study had a total of 95 observations of male patients aged froni 4 1
to 79 years. The covariates are log cancer volume (lcnvol). log prostate weight ( fwe igh t ) .
age: log of benign prostatic hyperplasia amount ( l b p h ) . seminal vesicle invasion ( m i ) . log
of capsular penetration ( lcp). Gleason score (gleason) and percent Gleasou scores 4 or 5
( p g g 4 5 ) . The data was later studied in Tibshirani ( 1996). A more detailed description of
the data set can be found in either of the above papers.
Some linear correlation is present among the covariates as shown in the correlation
coefficient matrix of X. The pairwise correlation coeficients are moderate. with t h e
largest one 0.752 between gleason and pgg45, and the next 0.675 between icauol and Zcp'
etc. No strong linear relationship can be found through an examination on the condition
number of the standardized covariate matrix X, the ratio of the greatest eigenvalue to
the srnallest eigenvalue of matrix X ~ X . which is 16.9 for the covariates considered here.
Two Linear regression models are fitted to the centered data. one with no penalty,
the other with the Bridge penalty. Table 7.1 shows the parameter estimates and their
standard errors for the OLS model and the Bridge model. The OLS model Lias no vanishing
Correlation matrix of the Iinear model for the prostate cancer da ta
( lcavol
Zweight
nSle
lbph
s v i
[cp
gleason
( ~ 9 9 4 5
coefficients though some of them are not ~ignif icant~ for example. icp. g l e a s a and pgg45
are not significant, etc. The Bridge estimator is obtained by the 41-Y-R or Shooting
algorithm for each pair of fixed h 2 O and 2 1. The values of X and -y are selected
by the GCV as shown in Figure 7.1. A Lasso model with A = 7.2 is sdected. This
Bridge model sets the coefficients of lcp and gleason to O and leaves no cocariates of
pairwise correlation coefficient greater than 0.6 in the model. The standard errors of the
Lasso estimator were omputed from 10000 bootstrap samples. It shows a much smailer
standard error than the OLS due to the shrinkage effect.
While the OLS model yields a significant effect of the intercept. lcavol, h e i g h t . sui,
and a marginally significant effect of age and Zbph, the Bridge mode1 yields a significant
effect of the intercept, lcavol, fweight, sui, and a marginaily significant effect of lbph.
The effect of age becomes non-significant in the Bridge model. Two regressors: lcp and
Bridge Optimization of Lambda and Gamma via GCV
Figure 7.1 : Selection of parameters X and y for the prostate cancer data.
Table 7.1: Estimates of the prostate cancer data
OLS Bridge
intercept
k a vol
r
lweight
age
%.478(0.072)
0.688(0. 10s)
2.478(0.072)
0.6 18(0.090)
0.225(0.084)
-0.145(0.OY2)
O.lW(O.076)
-0.048(0.046)
Table 7.2: Cornparison in model selection'
Predictor
lcavol
*. Y - significant effect in model: IV - non-significant effect in mode!.
OLS
gleason
gleason becomes zero in the Bridge model.
Y
We compare the Bridge model with the model obtained from the subset selection by
the leaps and bounds (L-B) method (Furnival and Wilson 1974, Seber 1977). The subset
selection chooses the best model with the covariates lcavol~ lweight. llph ancl sui. The
Bridge
Y 1 Y
N
covariates age and pgg45 are in the Bridge model but not in the subset selection model.
.
Subset(L-B)
However. these two covariates are not significant at ail. Therefore, the Bridge mode1
N
agrees with the best mode1 Lom the subset selection by the leaps and bounds method as
shown in Table 7.2,
N
Correlation matrix of the predictors of the Kyphosis data
i a9e 1.000 0.946 -0.023 0.059
nge2 0.946 1.000 -0.004 0-076
nun~ber -0.023 -0.004 1 .O00 -0-466
start 0.059 0.076 -0.466 1-000
7.2 Analysis of Kyphosis Data
We analyze the kyphosis data from a study of multiple level thoracic and lumbar iaminec-
tomy, a corrective spinal surgery commonly performed in children for turnor and congenital
or developmental abnormali t ies such as syrinx, diastematumyelia and tet hered cord. The
study had à3 observations of children aged from 1 to 243 rnonths. It was studied by Bell
e t al. (1994) and analyzed by Hastie and Tibshirani (1990) using generalized additive
model (GAM). A detailed description of this study can be found in Bell e t al. ( 1994) or
Hastie and Tibshirani ( 1990).
The outcorne of this study is binary. either the presence (1) or the absence ( O ) of
kyphosis. The predictors are age in months a t time of the operation. the startinp vertebrae
level and the number of vertebrae levels involved in the operation (start and nurnber).
The quadratic term age2 is also included to study the quadratic efkct of age.
A strong linear relation can be observed from the correlation matrix. The coefficient
between age and age2 is 0.946. T h e condition number of this matrix is 37.1. which also
indicates t bat t here exists a st rong linear relationship among the covariates.
Two logistic regression models are fitted to the data: the no-penalty model and the
Bridge Optirniration of Lambda and Gamma via GCV
O \
Figure 7.2: Selection of parameters X and 7 for the kyphosis data.
Bridge penalty model. A Lasso model with slirinkage parameter X = 0 . 2 was selected via
the GCV for the Bridge model as shown in Figure 7.2. We compare the logistic regression
mode1 with the Bridge rnodel in Table ' 7 . 3 .
Table 7.3 shows the parameter estimates and t heir standard errors of bot h models. The
standard errors of t he Bridge estimates are obtained by the jackknife method (Shao and
Tu 19%). It is shown that the logistic regressioo model yields a very significant effect of
aLi the preàictors considered. The age bas an increasing-then-decreasing quadratic effect,
the nvmber has an increasing effect and the start has a decreasing efFect. However, the
Bridge model yields a very different result. It shrinks the estirnate of number to non-
si,hficant, the estimates of age and age2 to margindy significant. The effect of start
Table 7.3: Estimates of the kyphosis data
intercept
age
age2
nurnber
*. TWO outliers removed from the model: number = 14 or age = '24.3. I. A Lasso model with X = 0.22 is selected by the GCV. 2. A Lasso model with X = 0.24 is selected by the GCV.
Logistic
-2.256(0.546)
st art
and the intercept remain significant .
Hastie and Tibshirani (1990) fitted a GAM model on the entire data and obtained a
4.863(2.025)
-4.5 l3(2.125)
0.910(0.427)
quadratic age effect, a n increasing number effect and a decreasing start effect. However.
Bridge'
-2.249(0.826)
- 1 .OO?(O.N9)
after removing two outliers from the model (one wit h nvmber = 14, the ot her with age =
4.4 lS(2.591)
-4.O29(2.8 12)
0.893(0.763)
243), they ended with a best model, which yields a significant start effect, a rnarginally
Logistic'
-2.265(0.534)
- 1 .O l6(O.427)
significant increasing-t hen-decreasing quadrat ic age effect. The effect of n u m ber becornes
non-significant in the best GAM model. The resd t of the Bridge model agrees with that
of the best GAM model in Hastie and Tibshirani (1990), and also agrees with the result
reported in Bell et al. (l994).
Bridge''
-2.258(0.843)
4.714(1.994)
-4.Oï7(1.958)
To examine the robustness of the Bridge model, we further fit the logistic model and
the Bridge mode1 to the data with the two outliers removed. A Lasso model with X = 0.24
is selected by the GCV for the Bridge model. As shown in Table 7.3, the logistic mode1
4.204(2.618)
-3.553(2.636)
I
shows a marginally significant effect of number with the outliers removed, while the Bridge
-0.989(0.343)
0.687(0.381)
-0.998(0.428)
0.685(0.680)
model shows a non-significant effect of number. No major difference is thus observed in
the model selection with the two outliers either included or excluded. The Bridge mode1
iç robust to the outliers in this data set. Therefore. it can be concluded t h a t tlie Bridge
penalty model pedorms very well for this kyphosis data.
7.3 Analysis of Environmental Healt h Data
In this section? we apply the Bridge penalty models to analyze a data set obtained froni
an environmental health study in Windsor, Ontario. The study was conducted froni 1992
to 1993 to study the effect of air pollution on health. For years. there had been a coucern
over the air pollution in the Windsor area. The major source of the pollution is the
industrial activity and municipal incineration in the urban region of Detroit. !vIichigan.
The study was based on a population of asthmatics in Windsor. consisting of 39 subjects
aged 12 years and older. Each subject had 21 records with 4 weeks apart of asthmatic
status and some variables assessing the quality of the air. for exampleo tlie ozoue level.
the carbon monoxide level. etc. The response of asthmatic status was recorded using time
interval in the evening the asthmatics suffer from the status. For the analpsis purpose.
we dichotomize this variable and define a new response variable: Asthma Status = 1 if
the night time interval is positive, or O otherwise. Therefore? we have a binary response
variable, whether the asthmatics suffer €rom the symptoms or not, and a set of independent
variables: the measures assessing the quality of the air, the mean temperature and the
rnean humidity, etc. There were 112 out of 819 total observations wliich indicates that
the asthmatics suffered. Since the 21 observations of each subject are correlated, the
Corretation rnatrix of the environmentai healt h data
h m n o 1.000 0.606 0.077 -0.458 0.414 0.660 0.275 -0.281 -0.025
clmno2 0.606 1.000 0.109 -0.4 10 0.:378 0.712 0.491 -0.160 -O.O:3 1
clmtrs 0.077 0-109 1.000 0.154 -0.028 0.0'79 0.040 0.065 -0.147
clmoz -0.458 -0-410 0.154 1.000 -0.149 -0.21 1 -0.052 0.705 -0.230
C ~ ~ C O 0.414 0.378 -0.028 -0.149 1.000 0.692 0.:379 -0.051 0.24:3
clrncoh 0.660 0.712 0.079-0.211 0.692 1.000 0.484-0.123 0.084
clrnso2 0.275 0-491 0.040 -0.052 0.379 0.484 1.000 0.0:3.5 0.006
mtemp -0.281 -0.160 0.1)65 0.705 -0.051 -0.12:3 0.0:35 1.000 -0.084
i mhumd -0.025 -0.031 -0.347 -0.230 0.243 0.084 0.006 -0.084 1.000
generalized estimating equations (GEE) approach is adopted to study the relation between
the asthmatic status and the pollutant factors.
Included in the GEE mode1 are the following covariates: the closest measurement
of nitrogen oxide (clm.no), nitrogen dioxide (clm.no?), total reduced sulphur (clm.tr.s),
ozone (clm-oz), carbon monoxide (clmxo), coefficient of h u e (cln-coh), sulphur dioxide
(clm.s02), mean temperat ure (mt e n p ) and mean humiciity (m humd) .
First , we examine the correlat ions bet ween the covariates. Some collinearity is present
as shown in the pairwise correlat ion coefficient matrix. The correlation coefficient is 0.71 2
between clm.coh and clrn.no2, 0.705 between clm-or and mtemp, etc. The condition
number of the covariate matrix is 28.92, which indicates that there exists a moderate
linear relation among these variables.
To compare the different penalty models, we split the data set into two: one training
set and one test set. The test set coosists of the observations of 9 randomly selected
subjects, i.e. 9 x 21 = 189 observations. The 6.30 observations of the remaining 30
subjects are included in the training set.
Four GEE models of exchangeable working correlation structure with different penal-
ties are fitted to the training set: no-penalty, the Bridge penalty. the Lasso penalty ancl
the ridge penalty. For each penalty model, the shrinkage parameters are selected via the
quasi-GCV. Then the prediction errors PSE and MCE of the selected model are computed
at each point of the test set as
PSE = Dev(y,b), MCE = I { a b s ( y - b ) 2 0.5},
where Dev is the deviance based on the marginal distribution, and I is the indicator
function. The PSE and MCE are further averaged over different points of the test set.
For each single split, the PSE and MCE computed as above depend on the split. To
compare the different penalty models, one needs to repeat the above procedure for many
different splits? and compare the prediction errors averaged over different random splits
of the data set.
To assess the performance of the different modeIst we repeat the above procedure for
100 random splits of the data set. Table 7.4 shows the mean prediction errors and the
standard errors over 100 randorn splits of the data set. The relative PSE and MCE to
the baseline of the no-penalty CEE model, defined as
PSE - PSEGEE MCE - MCEGEE PSE, = and MCE, =
MCEGEE PSECEE
are also reported to examine the reduction of the prediction error from the no-penalty
GEE model. It is shown that the ridge penalty model has a significant reduction of
Table 7.4: Comparison of prediction errors on test data over 100 random splits
1 PSE 1 :35.928(2.818) 1 :34.749(2.946) ( 37.652(2.996) ( 23.008(2.430) ( GEE
both PSE and MCE from the no-penalty model. while the Bridge and the Lasso have a
significant reduction of MCE but no significant change of PSE. Figure 7.3 shows the box
plots of MCE and PSE in the original scale for different penalty models and the relative
MCE and PSE. It is clearly shown that the ridge penalty niodel actiieves the srnailest
MCE and PSE and thus performs the best in terms of prediction. It is also stiown that
the Bridge and the Lasso penalty models achieve better prediction in terms of MCE than
Ridge Bridge
L I
the no-penalty GEE model, but not in terms of PSE. Therefore. it can be concludeci tha t
the ridge penalty model achieves the best prediction for this data set.
Lasso
0.313(0.029) MCE
Having studied t h e performance of the different penalty models in terms of prediction
0.370(0.028)
errors, we compare the four different penalty GEE models with exchangeable correlation
1
structure on the entire da ta set. Table 7.5 shows the estimates and the standard errors of
the different models. Since a Lasso penalty model with X = 0.5 is selected for t he Bridge
penalty model by t h e quasi-GCV as shown in Figure 7.4, the Lasso model is virtually
0.319(0.030)
the same as the Bridge model. A ridge penalty model with X = 0.4 is selected by the
quasi-GCV. The standard errors for the no-penalty GEE model are computed with the
O.l%(O.O 19)
MCE Relative MCE to GEE
GE€ ~ n d ~ e Lasso Ridge
PSE
GEE Bridge Lasso Ridge
Bridge Lasso Ridge
Relative PSE to GEE
Bridge Lasso Ridge
Figure 7.3: Cornpa~son of prediction errors on test data by box plots
Bridge O p t i r n e - n d ---- .- Gamma via Q-GCV
Fieme 7.4: Selection of parameters A and for the environmental Iiealth data
"sandwich" estimator (4.2): while the standard errors for the other penalty models are
computed with the jackknife method (Shao and Tu. 1995). The jackknife is applied on
the subjects rather than on the observations since the subjects are independent but not
the observations.
The no-penalty GEE model has only one significant effect, i-e. the effect of clm.trs.
The negative effect of clm-trs is not satisfactory since it is known to be air pollutant
and is expected to have a positive effect. Therefore. the no-penalty GEE model does not
yield a meaningfui result. The Bridge and the Lasso models set the effects of clrn.no,
clm.ot, clm.co, clrn.coh, cZm.so2 and mtemp to zero. No effects in the Bridge and the
Lasso model are significant. Therefore, both models fail to explain the variation of the
Table 7.5: Estimates of the environmental health data
int ercept I GEE Bridge1 Lasso2 Ridge3
1. A Lasso mode1 (y = 1) with X = 0.5 is selected. 2. -4 Laso mode1 with X = 0.5 is selected. 3. A ridge mode1 with X = 0.4 is selected. *. S i m c a n t effect.
LOG
response variable and thus are not satisfactory. However. the ridge penalty rnodel yields a
very different result. The negative significant effect of the total reduced sulphur (c1m.tr.s)
in the no-penalty GEE model becomes insignificant. The ridge penalty model yields a
positive significant effect of t he coefficient of haze (clrn.coh). which is not significant in
the no-penalty GEE model. hl1 of the other covariates do not have a significant effect.
Although this ridge penalty GEE model still fails to detect the effect of many pollutant
factors from the data: it certainly supplies the information on the significance of the
contribution of the coefficient of haze to the asthmatic status. The positive significant
effect means t hat the larger the coefficient of haze (or the more severe the haze). the more
likely the ast hmatics suffer from the p o h t e d environment. This result is achieved only
with the ndge penalty model, which has the smallest prediction errors among clifferent
penalty GEE models as shown in the previous cornparison based on prediction errors with
random splits of the data set.
Overall. the GEE model with the ridge penalty achieves better prediction by shrinking
the regession parameters, and yields a more meaningful result than the no-penalty GEE
mode1 for this environmental health study. Even though further investigation is stiu
needed for the ndge penalty model in order to capture more information on the significance
of the effects of other poilutants than the coeficient of Liaze, the ridge penalty model
captures more informat ion than the no-penalty GEE model and makes the interpretation
of the model parameters more meaningful. It is demonstrated through this analysis that
the penaiized GEE model is very important and is potentiauy a good approach to handle
collinearity among covariates of the GEE models.
Chapter 8
Discussions and Future Studies
8.1 Discussion
Regression is a widely used statistical tool for quantitative analysis in scient i fic researches.
Collineari ty is a problem associated wit h regression. It influences estimation and preclic-
tion, and thus has a large impact on researches.
Although there are many niethods dealing wit b collinearity. for example. principal
component analysis, shrinkage model is still an important met hod, which yielcls a simple
and easy-to-interpret linear or geoeralized linear regression model.
Bridge regression, as a special farnily of penalized regressions with two very important
members: ridge regression and the Lasso, plays an important role in handling collioearity
problem. It yields small variance of the estimator and achieves good estimation and
prediction by shrinking the estimator towards O.
The simple and special structure of the Bridge estimators for 7 2 1 makes the com-
piitation very simple. The modified Newton-Raphson method for y > L and the Shooting
method for -y = 1 were developed based on the theoretical results of the structure of
the Bridge estimators. Particularly, the Shooting method for the Lasso benefits from the
theoretical result that the Lasso estimator is the liniit of the Bridge estimator as 7 tends
to 1 from above. It has a very simple close form at each single step, and a simple iteration
leads to fast convergence. These propert ies make it very attractive computat ionally. as
can be seen from the simple and concise programming codes in Appendix A. In contrast?
the combined quadratic programming method by Tibshirani (1996) has a finite-step ( Z P )
convergence, and potentially has even better convergence rate ( 0 . 5 ~ to 0 . 7 5 ~ ) as pointed
out by Tibshirani (1996). In addition, the combined quadratic programming method has
range of [O, L] and is easy to optimize via grid search; while the Shooting method has no
such a standardized range, even though it has a threshold Xo > O such tha t any tuning pa- A
rameter h 2 Xo sets the Lasso estimates 13, = O for j = 1.. . . . p (Gill. Murray and CVriripht
1981). We believe that the Shooting method has a convergence rate of order plog(p)
although a theoretical result of the order has not been obtained. I t is easy to see t hat for
orthogonal X, only p steps is required to solve t he p independent equations in (P9) by the
Shooting methocl. Both the modified Newton-Raphson methoci and the Shooting met hod
can be applied to generalized linear models via the IRLS procedure without extra effort.
The classical approach to penalization depends on joint likelihood f~mctions, and is
thus lirnited to the cases in which a joint likelihood function exists. However. cases in
which there does not exist a joint likelihood function have a broad range of applications
in many scientific researches, for example, the GEE as discussed in Chapter 4. The clas-
sical approach to penalization does not apply in these cases even though penalization is
desirable due to highly correlated regressors. T h e penalized score equations introduced
in Chapter 5 generalize penalization to be independent of joint likelihood functions and
yield a shrinkage estimator, which has the same properties as the estimators of penal-
ized regressions with joint likelihood. Therefore the penalized score equatioos provide a
technique, which enables penalization to be applied to cases in which no joint likelihood
function exists, such as the GEE.
As a new concept, the penalized score equations not only provide the techniques to
handle the colinearity problem in t h e GEEI but also suggest different types of penaliza-
tion. They present penalization in a different ivay compared to the constrained regressions.
The former emphasize the solutions of the equations of (P3) as shown in Figure 2.1. while
the latter the constrained area as shown in Figure 1.1. One can then consider rnaiiy
types of penalty and comprehend the structure of the estimators from the penalized score
equations by studying the solutions of (P3) as in Figure 2.1.
The generalized estimating equatioos is an important statistical met hod in longitudinal
st udies. The consistency of the GEE est imator and t lie working covariance structure
make it very attractive in longitudinal studies. However, the correlation structure may
induce the non-existence of joint likelihood function: which hinders the implementation
of penalization in the classical way via joint Likeliliood functions. The penalized GEE is
a method of applying penalty to the GEE structure via the penalized score equations.
It circumvents the difficulty of the non-existence of joint likelihood functions. Therefore.
the penalized score equations provide a theoretical support to the penalized GEE.
The generalized cross-validation ( GCV) met hod was proposed ini tially to op timize
the tuning parameter of smoothing splines, which are linear operators. This technique
is borrowed here to select the shrinkage parameters X and y: as suggested by Tibshirani
(1996) for the Lasso. It is evidently true in the literature that the GCV rnethod works
weU for Lnear operators, including ridge regression. The simulation results of the linear
regression model in Chapter 6 show that the GCV does not always select the best value
of 7 for the Bridge regression model, even though Bridge regression has the potential to
select y £rom a wide range [l , 00). The foilowing facts may partially but not compietely
explain why the GCV does not select the best 7.
1. The Bridge operator is non-linear for # 2. This can be seen clearly from (5.1)
since the matrix D in (5.1) is a function of p. The non-linearity of t h e Bridge operator
can be visually seen in Figure 2.3 for the special case of orthonormal matrix. Since the
Bridge operator (7 # 2) performs very differently from the ridge operator (-1 = 2 ) or the
OLS operator ( A = O), the linear approximation to the Bridge operator as in the GCV
definition (5.2) does not yield the best y value for the model selection.
2. The range of y value is limited to [l, ca). As the simulation results show tliat the
Bridge yields very similar results to those of the Lasso in many câses. In certain cases.
the GCV achieves the minimum a t y = 1 as shown in Figure 7.1. This may be due to
the truncation of the range of y at = 1. .4 value of less than 1 may be seiected by
the GCV if [O, 1) of 7 is also considered in the Bridge model. Hence the truncation of the
range of y at y = 1 may contribute to the frequent selection of the Lasso (7 = 1 ) by the
GCV.
Because of the above reasons? it is not a surprise that the Bridge model does not
always perform the best in estimation and prediction compared to the other shrinkage
models: the Lasso and the ridge. Therefore, new optimization techniques are desirable.
especially for non-linear operators.
The quasi-GCV, motivated Lom correlated Gaussian responses, incorporates the work-
ing correlation structure of the correlated responses into the deviance residuals and yields
a weighted deviance. The weighted deviance reduces to the deviance for independent
responses, and accordingly, the quasi-GCV reduces to the GCV. By incorporating the
working correlation structure, the quasi-GCV achieves the same effect in model selection
for the penalized GEE as the GCV for generalized linear models. It selects shrinkage
parameters of the penalized GEE in a very easy and simple way. and yields good esti-
mation and predict ion. The quasi-GCV generalizes the GCV to correlated responses and
performs well in rnodel selection.
The effective nurnber of degrees of Eieedom of the correlated observations, motivatecl
from the correlated Gaussian responses, takes the correlation structure into consideration.
It corrects the total degrees of freedom of the data from the total number of observations
to a reduced degree for positively correlated observations. It reflects the effect of the cor-
relation structure on the observations and t hus close1 y captures the int rinsic relationship
among the observations. Since the degree of freedom p l a y an important role in statistical
inference and model adequacy checking, the correction of the degrees of freedom by the ef-
fective number of degrees of freedom is expected to have some effect on the interpretation
of the GEE model.
Due to the lack of joint likelihood f~nc t ions~ many statistical procedures based on joint
likeliliood, for example, likelihood ratio test, do not apply to the GEE models. Thus the
standard errors of the regression estimates become more important for inference. However,
the complexity of the estimate of the penalty models makes it very hard CO obtain simple
formula for the standard errors of the Bridge estimators. Ve r - often, the standard errors
are computed from some semi-parametric methods, like the bootstrap or the jackknife,
which rely on large sample theory. Caution must be used when the observations are
correlated, or when the number of independent units is not large.
8.2 Future Studies
It has been shown theoretically that the Bridge estimator has a simple structure and
can be computed via simple and efficient algorithrns. It has also been demonstrated
through simulation studies that the Bridge estimator performs well in terms of estimation
and prediction for linear regression models, generalized linear niodels and GEE models.
There are still many interesting aspects about the Bridge estimator or other shrinkage
estimators, which need further investigation. They are summarized as follows.
1. Theoretical resdt of the asymptotic consistency.
I t is not known whet her the Bridge estimator, especially of the penalized GEE, is asymp-
totically consistent although we believe the asymptotic consistency is true. It is also
not known how an incorrect specification of the correlation structure influences the con-
sistency of the Bridge estirnator and the selection of the shrinkage parameters of the
penalized GEE. Therefore, it is of great importance to study the asymptotic consistency
of the Bridge estimator of the penalized score equations in general, and particularly to
investigate how incorrect specificat ion of the correlation structure of the GEE influences
the selection of the shrinkage parameters.
2. New model selection met hods. especially for non-linear operators.
As discussed in last section, the GCV method was initially introduced to select the tuning
parameters for smoothing splines, which are Linear operators. Since the Bridge operator
is non-linear, the GCV does not always select the best value of y for the Bridge model.
It is desirable to develop some met hods to select the shrinkage parameter for non-linear
operators, particularly for the Bridge penalty.
3. The Bridge penalty mode1 with y < 1.
The Bridge penalty with y < is not considered in this thesis. It is also of great interest to
investigate t his case. Al t hough some difficult ies are expected. for example. multiple solu-
tions of the penalized score equations. etc.. much work needs to be done both theoretically
and computat ionally.
4. Other types of penalties.
As discussed in Chapter 3 ? other types of penalties can be considered in the form of the
penalized score equations. I t can be expected t hat different types of penalties may y ield
different structures of the estimators' and may lead to different yet interest ing niodels
and results.
Overall. further investigation is needed in the near future to comprebend this inter-
esting topic of penalization in statistical modelling.
References
Bell, D.F., Walker, J.L., O'Connor, G. and Tibshirani. R. ( 1994). Spinal deformity after
multiple-level cervical laminectomy in children. Spine. 19 (4):406- 1 1. Feb. 1.5.
Craven, P and Wrrhba, G. (1979) Smoothing noisy data with spline functions. :Vumerischz
Mathernatk :31, 377-403.
Diggle, P.J.; Liang, K.-Y. and Zeger. Ç.L. (1994) Analysis 01 Longitudinal Data. Claren-
don, Oxford.
Durret t, R. (1 99 1 ) Probabilit y Theory and Examples. Wadswort h. Belmont.
Efroo, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. Xew York, Chap-
man and Hall.
Frank. LE. and Friedman, .J.H. ( 199:3). A S tatistical view of some chemometrics regression
tools. Technometn'cs. Vol :35 Nol- 109- 148
Furnival, G.M. and Wilson, R. W..Jr. ( 1974) Regressions by Leaps and Bounds. Tech,no-
metrics, 16,.L.99-5 1 1.
Gill, P.E., Murray, W. and Wright, M.H. (1981) Fractical Optimization. Academic Press.
London.
Green, P.J. ( 1984) Iteratively reweighted least squares for maximum likelihood estimation.
and some robust and resistant alternatives (witli rtiscussion). Journal of Royal Statistical
Society, B 46, 149-192.
Hast ie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. Chopman and
Hall, New York.
Hocking, R. R. (1996) Methods and Applications of Linear Models: Regression and the
Aaalysis of Variance. Wiley, New York.
Hoerl, A.E. and Kennard, R. W. ( 1 WOa) Ridge regression: biased estimation for nonort hog-
onal problems. Technometrics. Vol. 2 o . 1. 55-67.
Hoerl. A.E. and Kennard. R. W. ( 1970b) Ridge regression: applications to nonortliogonal
problems. Technometrics, Vol. 12: Xo. 1. 69-82.
Laird, N.M. and Ware, J.H. (1982) Random-effects models for longitudinal data. Biomet-
rics, 38, 963-974.
Lawson, C. and Hansen, R. (1974). Solving least squares problems. Prentice-Hall.
Lee, A..J. (1993) Generating random bioary deviates having fixed marginal distributions
and specified degrees of association. The Amen'can Statistician. Vol. 47. Xo. 3 - 209-21.3.
Li, B. and McCullagh, P. ( 1994) Potential functions and conservative estimating funct ions.
The Annals of Statkt ics Vol. 2%. Xo. 1. 340-356.
Liang, K.-Y. and Zeger? S. L. ( 1986) Longitudinal data analysis using generalized linear
models. Biometrika 73, 1 : 3 - E .
Liang, K.Y., Zeger, S.L. and Qaqish,B. ( 1992) Multivariate regression analyses for cate-
gorical data (with discussion). Journal of the Royal Statistical Society B 34. 4-40.
McCullagh, P. ( 199 1) Quasi-likelihood and estimating functions. In Statistical Theory
and iWodelling: In Honour of Sir David Cox (D. Ci-Kinkley, !V.Reid and E.J.Snel1, eds.)
265-268 Chapman and Hall, London.
McCullagh, P. and Nelder, d.A. (1989) Generalized Linear ibfodels (2nd ed.). Chapman
and Hall, London.
Nelder, J . A. and Wedderbum, R. W.M. ( 1979) Generalized Linear models. Journal O/
Royal Statistical Society A 135, 370-384.
Seber, G.A.F. ( 1977) Linear Regression Analysis. Wiley, New York.
Sen, A. and Srivastava, M. ( 1990) Reg~ession Analysis Theory, Methods, and Applications.
Springer, New York.
Stamey? T., Kabalin, J . , McNeaI. d.. Johnston, I., Freiha, F.. Redwine. E. and Yang. Y.
( 1989). Prostate specific antigen in the d iapos i s and treatment of adenocarcinorna of the
prostate ii. radical prostatectorny treated patients, Journal of L+ologl/. 16. 1076- 10Y3.
Shao, J . (1993) Linear model selection by cross-validation. .Journal 01 the Amen'can
Statistical Association. 88, 486-494.
Shao, J . and Tu, D. (1995) The Jackknife and Bootstrap. Springer New York.
Stone, M. ( 1974) Cross-validatory choice and assessrnent of stat istical predict ions. . .Jour-
nal of Royal Statistical Society B. Vo1.36. 1 1 1- 147.
Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Jownal 01 Royal
Statistical Society B o V01.58~ No. 1. '6'TZYY.
Wahba, G. (1990) Spline Models for Observational Data. Society for Industrial and
Applied Mathematics, Philadelphia.
Wedderburno R. W.M. ( 1974) Quasi-likelihood functions. generalized lioear models and
the Gauss-Newton method. Biometrika 6 1, 4:39-47.
Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal da t a analysis for discrete and continuous
outcomes. Biometrics 42, 121-130.
Zhang, P. (1992) On the distributional properties of model selection criteria. Jo,urnal of
the Amencan Statistical Association. 87, 733-737.
Appendix A
A FORTRAN Subroutine of the
Shooting Method for the Lasso
In this appendix, we provide a FORTRAN subroutine of the Shooting method for the
Lasso estimator. This subroutine is self-contained and can be called from any FORTRAN
or S+ program.
Variables called in the subroutine:
N - sample size:
P - number of regressors;
X - regression matrix of dimension n x p;
Y - response variable of dimension n x 1:
B - regression parameters of dimension p x L 1
input - the OLS estimates. output - the Lasso estimates:
LAM - the tuning parameter X for the Lasso penalty;
EPS - the thresthold of the convergence, about l.E-12.
Matrices used for working space:
BB, BO - matrices of dimension p x l:
XII XB, YSB, - matrices of dimension n x 1.
DO 1000 KK = 1, 1000
DO 1 II = 1, P
1 BO(11,l) = B(11,l)
DO 10 1 = 1, P
DO 20 J = 1, P
IF (J .EQ. 1) THEN
B B C J , ~ ) = O,
ELSE
BB(J,I) = B(J,I)
END IF
CONTINUE
CALL MATM(XB,X,BB,N,P,1)
CALL MATs(YXB,Y,XB,N,~)
CALL MATSUBCOL(XI,X,N,P,I)
CALL MATTM(S,XI,YXB, 1jN,i)
IF (-2.*S(IY 1) .LT. -LAM) THEN
3UNK = (~.*S(I,~)-M)/~./NORM~(XI,N)
ELSE IF (-2.*S(l,l) .GT. LAM) THEN
JUNK = (~.*S(~,~)+M)/~./NORM~(XI,N)
ELSE
J U N K = 0.
END IF
DO 5 II = 1, P
IF ( II .NE. 1 ) THEN
B(11, 1) = B(11,l)
ELSE
B ( f 1 , 1) = JUNK
END IF
5 CONTINUE
10 CONTINUE
IF (DIST(B, BO, P) .LT. EPS) GOTO 5000
1000 CONTINUE
5000 RETURN
END
FUNCTION NORM2 (X , N) INTEGER N, 1
DOUBLE PRECISION NORM2, X (N, 1)
NOM2 = 0.
DO 50 1 = 1, N
N O M 2 = NORM2 + (~(1,1))**2
50 CONTINUE
RETURN
END
FUNCTION DIST(X,Y,N)
INTEGER N j 1
DOUBLE PRECISION DIST , X(N, 1) , Y (N, 1)
DIST = 0.
DO 500 1 = 1, N
DIST = DIST + (X(1,l) -Y (1,1))**2
500 CONTINUE
DIST = SQRT(D1ST)
RETURN
END
SUBROUTINE MATM(A,B,C,N,M,P)
C MATRIX MULTIPLICATION A = B*C
INTEGERN, M, P, 1, J, K
DOUBLE PRECISION A(N,P) , B(N,M) , C(M,P)
DO 10 1 = 1, N
DO 20 J = 1, P
A(1,J) = 0.
DO 30 K = 1, M
A(I,J) = A(I,J) + B(I,K)*C(K,J)
30 CONTINUE
20 CONTINUE
10 CONTINUE
R E m
END
SUBROUTINE MATTM(A,B,C,N,M,P)
C MATRIX MULTIPLICATION A = T (B) *C T(B) -TRANSPOSE
INTEGERN, M, P, 1, J, K
DOUBLE PRECISION A(N,P), B ( M , N ) , C(M,P)
30 CONTINUE
20 CONTINUE
10 CONTINUE
RETURN
END
SUBROUTINE MATS(A,B,C,N,P)
C MATRIX SUBTRACTION A = B-C
INTEGER N, P, 1, J
DOUBLE PRECISION A(N,P) , B ( N , P ) , C(N,P)
DO 100 1 = 1, N
DO 200 J = 1, P
A(I,J) = B ( 1 , J ) - C(I,J) 200 CONTINUE
100 CONTINUE
RETURN
END
SUSROUTINE MATSUBCOL(XI,X,N,P,I)
INTEGERN, P, 1, 3
DOUBLE PRECISION XI(N,1), X(N,P)
Appendix B
Mat hematical Proof
[n this appendix, 1 give the mathematical proof of Tbeorems 1' 2 and :3. Theorem 4
is the same as Theorern 3. Therefore, the proof of Theorem 4 is omitted.
Let F = ( F i , . . ., F,), where Fj = S j ( /3 ,X7y ) + d(i3,,A,y), j = 1.. . . . p .
Lemma 1 Given h > O, y > 1. If the Jacobian ($1 ir positive-senii-definite. t ben
(5) is positive-definite at # O, j = 1,. . . ? p .
Proof Observe that
where D(P,h7-y) = diag - L)IP~~'-*)- D is positive definite for > 1 and ,ijj # O.
j = 1.. . . ,p. This completes the proof.
Lemma 2 Given A > O. The function -d(i3j? A. y) = -A71~jl-f-'sign(ijl,) converges to
the beavyside functioo -As ign (&) a t ,Ljj # O as -7 i L+.
Proof It is obvious by observing that the function d is continuous in y at 4 # O.
Proof of Theorem 1.
1. First , it is easy to prove the existence of the solution of (P3) by mat hemat ical incluct ion
on dimension p, and t hat the solution is almost surely non-zero. Secondly, the conditions
of the Implicit Function Theorern are satisfied by Lemma 1. Therefore. there exists a
unique solution &A, y) satisfying (P3), and &A, 7) is continuous in ( A ? 7).
2. We prove the existence of the Limit of ,&Alr) as 7 + I+ by mathematical induction.
(1). p = 1- Lf there is an intersection of functions S(P, X, y) and -d(p, A, y), by contiouity
of these two functions and Lemma 2, l i q - l + B(x, 7) exists and is equal to the coordinate
of the intersection. If there is no intersection of functions S(P7 X, y ) and -d(P. A. y). it is
easy to prove that l i w d I + &A, y) = O. Therefore. the resuit holds for p = 1.
( 2 ) . In the remaining of the proof. we omit X from the expressions because it is kept as
a constant. Assume t hat the result holds for al1 dimensions 1. . . . . ( p - 1 ). We prove t liat
it also holds for pdimension. Consider a sub-problem formed by the first p - I equatious
q s ,..... s - 1 ) > 0. of (P3) for fixed &. %y the assumption, " > O. whicli irnplies that ~7(1i1, . . . , i1~-i) q - , -
Then the result of Theorern 1 holds for tliis ( p - 1)-dimensional sub-problern for fixed 4.
Therefore, the lirnit of the unique solution ( A (13,: y). . . . ,&-l (19,: Y)) of t his su b-problern
exists as 7 - l+. Substitute this solution into the last equation of (PS):
Then we need to prove that this equation has a unique solution 13,(y) of which the limit
exists as -y 4 t+. Denote the first term of the left hand side function of (B-L) by L ( J p . - / ) .
By chain rule
ar, as, ab, - - - -- asPajp- , as, +.*.+-- a& B& 8pp + -.
a.dp dPp
By the implicit Function Theorem on the (p - 1)-dimensional sub-problem. the partial
derivat ives, $ j = 1 . - . , p - 1, sat isfy
From (B.2), one con eâsily show that - 2 O by simple calculation in Linear algebra. a&
Therefore, t here exists â unique solution & (?) satisfying equation (B. 1).
To prove that the limit of jp(-y) exists, notice that 2 O for any > 1. Similady. one
can prove that the solution of the following equatioo exists.
where $j ( /3p , 1 +) is the lirnit of the solution j j ( [ j p . y) for fixed a , , j = 1 . . . . , p - 1.
Denote the solution of (B.3) by Bp(-y). By the assumption of the induction, lia,,_,+
elcists.
Rewrite equatioo (B. 1) as
where function A(@,? y) is defined as
We need to prove that the solutions of (B.l) and (B.3) have the same lirnit. This can be
achieved by proving that
l W p J ) l s w where 6(y) is independent of /3, and converges to O as 7 + l+. Since S, is differentiable
wi th bounded partial derivatives 5 and bj(flpl y ) is differentiable with bounded partial a&
derivatives % by Lmplicit Function Theorem and jj(,8p7 7) + fij(&, I+) for any value
of ,Bp7 it can be shown in functional analysis that there exists such a function &(y) -+ O
uniformly in #?,.
This completes the proof of Theorem 1.
Proof of Theorem 2.
(1). Given X > 0, 7 > 1. Since there exists a joint likelihood function and is positive 3 definite, then function -2 log(Lik) is convex. By the same argument of Lemma 1. ftinction
G(P, A, y) is convex and can be minimized uniquely a t some finite point. Therefore. the
Bridge estirnator is unique. Since (P3) has a unique solution &A,-(): which satisfies that
bj # Ol alrnost surely for j = 1,. . . ,pl and function G is differentiable a t &A. 7): thus
G is minimized ot 7). By the uniqueness of the Bndge estirnator of (P?)' D(A. Y) is
equal to the Bridge estimator of (P2).
(2). Given X > O. By Theorern 1, &A7 exists. We denote the lirnit by &A. i f ) .
Sioce G(P7 A, y ) is continuous in (PT A, r), l iq- , , G' (B(& -y), A: = G (B(& l+) ' A- 1) -
Also notice that B(x, y ) is the unique estimator rninimizing G'(X,-/). and fi,,,,, is the
unique estimator minimizing G'( A, 1) since G' is convex for = 1. We prove t hat B(h. 1 +) =
by contradiction. If not true, by the uniqueness of the Lasso estimator'
Take > O such that
Po < IG (m, 1+), A 7 1) - G (P.,,(W, A, 1) 1
Since G (b(A, -y), A y r ) and G ( ~ l a s , o ( ~ ) , & - y ) are continuous in 7, there exists a > L
and
However, this contradicts the fact that
for any ,B + ,&A, T ~ ) by the uniqueness of the Bridge estimator. This cornpletes the proof.
Proof of Theorem 3.
Notice that the limit of the Bridge estimator is, by Theorem 1, the Lasso estimator as y
tends to l+. Taking this limit at each step of the Modified-Newton-Raphsou (M-N-R)
algot-ithm leads to the Shooting algorithm. Hence the convergence of the M-N-R algorithm
irnplies the convergence of the Shooting algorithm by simply taking the limit as 7 tends
to If at each step. Therefore it suffices to prove the convergence of the M-N-R algorithm.
We prove it for the following two cases.
(1). There exists a joint likelihood function. By Lemma 1. function G(B. A, y ) is
*
convex. There exists a unique solution minimizing G, i.e. Pbrg = arg min G'. For p = L. the
M-N-R algorithm converges to the unique solution of (P:3). which is the Bridge estimator
by Theorem 2. Hence, it minirnizes function G in 8. For p > 1 and fixed 6-jl updating
jj by the M-N-R algorithm achieves the local minimum of G in ,ai for fixed ,8-J. Denote
the value of G by Gmj and the updated value of 6 by pmj after updating ;j, at step rn
by the M-N-R a lg~r i th rn~ one has
By the convexity of function G and the uniqueness of the Bridge estimator rvhich miui-
mizes G', Gmj converges to the unique minimum min(G) and Bmj converges to the unique
Bridge estirnator bbrg. Consecpently: the subsequeoce b,,: which is equal to Pm by
definition, converjes to Pb,.
(2). There exists no joint likelihood function. We prove that in a smail neiglibourhood
A
of the Bridge estimator Pbrs of (PS), there exïsts a potential function of P such that the
gradient of this potential Eunction is equal to the vector field of S. Then the convergence
can be proved through (1) above.
We prove there exists an approximation of such a potential function. Since the Jaco-
bian is positive definite, by Theorern 1, there exists a unique solution Bbrg. Denote aP as T
the matrix Q = - Define a real function L ( P ) = f S*Q-'S and take the ( 3 ~ ) lp=Bh,
partial denvative wi th respect to ,B in a neighbourhood of &-
where 4 = O( 1) by the continuity of the Jacobian (3) a t &,. Therefore. hiaction
L ( P ) is an approximation to the potential function of which the gradient is equal to S.
This cornpletes the proof.
Remark The existence of a local potential function in a neighbourhood of some point
p of the vector field S does not imply the existence of a global potential function that
would have a gradient equal to S since S may be path-dependent.
IMAGE EVALUATION TEST TARGET (QA-3)
APPLIED 1 IMAGE. lnc 1653 East Main Street -
-* , Fiachester. NY 14609 USA -- -- - - Phone: i l 6/48Z-O3OO -- -- - - Fax 71 61288-5989
O 993. &lied Image. lm. All Rishm Resewed