lecture 1: variable selection - lassohzhang/math574m/2020lect13_shri… · hao helen zhang lecture...

45
Oracle Properties Shrinkage Methods Ridge Penalty Lecture 1: Variable Selection - LASSO Hao Helen Zhang Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Upload: others

Post on 26-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

Lecture 1: Variable Selection - LASSO

Hao Helen Zhang

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 2: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

Outline

Modern Penalization Methods

Lq penalty, ridge

LASSO

computation, solution path, LARS algorithmsparameter tuning, cross validationtheoretical properties

ridge regression

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 3: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyOracle Properties

Notations

data (X i ,Yi ), i = 1, · · · , nn: sample size

d : the number of predictors, X i ∈ Rd .

the full index set S = {1, 2, · · · , d}.the selected index set given by a procedure is A, its size is |A|.the linear coefficient vector β = (β1, · · · , βd)T .

the true linear coefficients β0 = (β10, · · · , βd0)T .

the true model A0 = {j : j = 1, · · · , d , |βj0| 6= 0}.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 4: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyOracle Properties

Motivations

Shrinkage method for variable selection a continuous process

more stable, not suffering from high variability.

one unified selection and estimation framework; easy to makeinferences

advances in theories: oracle properties, valid asymptoticinferences

In general, we standardize the inputs first before applying themethods.

X ’s are centered to mean 0 and variance 1

y is centered to mean 0

We often fit a model without an intercept.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 5: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyOracle Properties

Ideal Variable Selection

The best results we can expect from a selection procedure:

able to obtain the correct model structure

keep all the important variables in the modelfilter all the noisy variable out of the model

has some optimal inferential properties

consistent, optimal convergence rateasymptotic normality, most efficient

True model: yi = β00 +∑d

j=1 xijβj0 + εi ,

important index set: A0 = {j : βj0 6= 0, j = 1, · · · , d}unimportant index set: Ac

0 = {j : βj0 = 0, j = 1, · · · , d}

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 6: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyOracle Properties

Oracle Properties

An oracle performs as if the true model were known:

1 selection consistency

βj 6= 0 for j ∈ A0; βj = 0 for j ∈ Ac0;

2 estimation consistency:

√n(βA0

− βA0)→d N(0,ΣI ),

βA0= {βj , j ∈ A0} and ΣI is the covariance matrix if

knowing the true model.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 7: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Penalized Regularization Framework

One revolutionary work is the development of regularizationframework:

minβ

L(β; y,X ) + λJ(β)

L(β; y,X ) is the loss function

For OLS, L =∑n

i=1[yi − β0 −∑d

j=1 βjxij ]2

For MLE methods, L = log likelihood

For Cox’s PH models, L is the partial likelihood

In supervised learning, L is the hinge loss function (SVM), orexponential loss (AdaBoost)

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 8: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Lq Shrinkage Penalty

J(β) is the penalty function.

Jq(|β|) = λ‖β‖qq =d∑

j=1

|βj |q, q ≥ 0.

J0(|β|) =∑d

j=1 I (βj 6= 0) (L0; Donoho and Johnstone1988)

J1(|β|) =∑d

j=1 |βj | (LASSO; Tibshirani 1996)

J2(|β|) =∑d

j=1 β2j (Ridge; Hoerl and Kennard, 1970)

J∞(|β|) = maxj |βj | (Supnorm penalty; Zhang et al. 2008)

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 9: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3

PSfrag repla ementsq = 4 q = 2 q = 1 q = 0:5 q = 0:1

Figure 3.13: Contours of onstant value of Pj j�j jqfor given values of q.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 10: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

The LASSO

Applying the absolute penalty on the coefficients

minβ

n∑i=1

(yi −d∑

j=1

xijβj)2 + λ

d∑j=1

|βj |

λ ≥ 0 is a tuning parameter

λ controls the amount of shrinkage; the larger λ, the greateramount of shrinkage

What happens if λ→ 0? (no penalty)

What happens if λ→∞?

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 11: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Equivalent formulation for LASSO

minβ

n∑i=1

(yi −d∑

j=1

xijβj)2, subject to

d∑j=1

|βj | ≤ t.

Explicitly apply the magnitude constraint on the parameters

Making t sufficiently small will cause some coefficients to beexactly zero.

The lasso is a kind of continuous subset selection

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 12: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3

β^ β^2. .β

1

β2

β1β

Figure 3.12: Estimation pi ture for the lasso (left)and ridge regression (right). Shown are ontours of theerror and onstraint fun tions. The solid blue areas arethe onstraint regions j�1j+ j�2j � t and �21 + �22 � t2,respe tively, while the red ellipses are the ontours ofthe least squares error fun tion.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 13: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

LASSO Solution: soft thresholding

For the orthogonal design case with X ′X = I , the LASSO methodis equivalent to: solve βj ’s componentwisely by solving

minβj

(βj − βlsj )2 + λ|βj |

The solution to the above problem is

βlassoj = sign(βls

j )(|βlsj | −

λ

2)+

=

βlsj −

λ2 if βls

j > λ2

0 if |βlsj | ≤

λ2

βlsj + λ

2 if βlsj < −λ

2

shrinks big coefficients by a constant λ2 towards zero.

truncates small coefficients to zero exactly.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 14: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Comparison: Different Methods Under Orthogonal Design

When X is orthonormal, i.e. XTX = I

Best subset (of size M): βlsj .

keeps the largest coefficients

Ridge regression: βlsj /(1 + λ)

does a proportional shrinkage

Lasso: sign(βlsj )(βls

j −λ2 )+

transform each coefficient by a constant factor first, thentruncate it at zero with a certain threshold“soft thresholding”, used often in wavelet-based smoothing

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 15: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

How to Choose t or λ

The tuning parameter t or λ should be adaptively chosen tominimize the MSE or PE. Common tuning methods include

the tuning error, cross validation, leave-out-one crossvalidation (LOOCV), AIC, or BIC.

If t is chosen larger than t0 =∑d

j=1 |βlsj |, then βlasso

j = βlsj .

If t = t0/2, the least squares coefficients are shrunk by 50%.

In practice, we sometimes standardize t by s = t/t0 in [0, 1], andchoose the best value s from [0, 1].

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 16: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

K-fold Cross Validation

The simplest and most popular way to estimate the tuning error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K

leave the kth portion out, and fit the model using the other

K − 1 parts. Denote the solution by β−k

calculate prediction errors of β−k

on the left-out kth portion3 Average the prediction errors

Define the Index function (allocating memberships)

κ : {1, ..., n} → {1, ...,K}.The K -fold cross-validation estimate of prediction error (PE) is

CV =1

n

n∑i=1

(yi − xTi β

−κ(i))2Typical choice: K = 5, 10

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 17: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Choose λ

Choose a sequence of λ values.

For each λ, fit the LASSO and denote the solution by βlasso

λ .

Compute the CV (λ) curve as

CV (λ) =1

n

n∑i=1

(yi − xTi β

−κ(i)λ

),

which provides an estimate of the test error curve.

Find the best parameter λ∗ which minimizes CV (λ).

Fit the final LASSO model with λ∗. The final solution isdenoted as β

lasso

λ∗

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 18: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

����������� ������������������������ �!���" �#�$�&% �')( �*�+����-,/.���01��2#�3" �-�#��4 5#"����768�9�-�;:�<�<1= >?2#��@&A�7"CB

Subset Size p

Miscla

ssifica

tion Er

ror

5 10 15 20

0.00.1

0.20.3

0.40.5

0.6•

••

••

••

••

• • • • • • • • • • •

••

• •

••

• •

• • • • • • •• • • • •

DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z���bX^�Uef�]�c� W~rXpqS#U �X�+���

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 19: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

One-standard Error for CV

One-standard error rule:

choose the most parsimonious model with error no more thanone standard error above the best error.

used often in model selection (for a smaller model size).

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 20: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

����������� ������������������������ �!���" �#�$�&% �')( �*�+����-,/.���01��2#�3" �-�#��4 5#"����768�9�-�;:�<�<1= >?2#��@&A�7"CB

Subset Size p

Miscla

ssifica

tion Er

ror

5 10 15 20

0.00.1

0.20.3

0.40.5

0.6•

••

••

••

••

• • • • • • • • • • •

••

• •

••

• •

• • • • • • •• • • • •

DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z���bX^�Uef�]�c� W~rXpqS#U �X�+���

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 21: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Computation of LASSO

Quadratic Programming

Shooting Algorithms (Fu 1998; Zhang and Lu, 2007)

LARS solution path (Efron et al. 2001)

the most efficient algorithm for LASSOdesigned for standard linear models

Coordinate Descent Algorithm (CDA; Friedman et al. 2010)

designed for GLMsuitable for ultra-high dimensional problems.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 22: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

R package: lars

Provides the entire sequence of coefficients and fits, starting fromzero, to the least squares fit.

library(lars)

lars(x, y, type = c("lasso", "lar", "forward.stagewise",

"stepwise"), trace = FALSE, normalize = TRUE,

intercept = TRUE, Gram, eps = .Machine$double.eps,

max.steps, use.Gram = TRUE)

A “lars” object is returned, for which print, plot, predict, coefand summary methods exist.

Details in Efron, Hastie, Johnstone and Tibshirani (2002).

With ”lasso” option, it computes the complete lasso solutionsimultaneously for ALL values of the shrinkage parameter (λor t or s) in the same computational cost as a least squares fit.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 23: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Details About LARS function

x : matrix of predictors

y : response

type: One of ”lasso”, ”lar”, ”forward.stagewise” or”stepwise”, all variants of Lasso. Default is ”lasso”.

trace: if TRUE, lars prints out its progress

normalize: if TRUE, each variable is standardized to have unitL2 norm, otherwise it is left alone. Default is TRUE.

intercept: if TRUE, an intercept is included (not penalized),otherwise no intercept is included. Default is TRUE.

Gram: XTX matrix; useful for repeated runs (bootstrap)where a large XTX stays the same.

eps: An effective zero

max.steps: the number of steps taken. Default is8 min(d , n − intercept), n is the sample size.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 24: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3

Shrinkage Factor s

Coeff

icients

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.00.2

0.40.6

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

••

••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Pro�les of lasso oeÆ ients, as tuningparameter t is varied. CoeÆ ients are plotted versuss = t=Pp1 j�j j. A verti al line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 25: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Prediction for LARS

predict(object, newx, s, type = c("fit", "coefficients"),

mode = c("step", "fraction", "norm", "lambda"), ...)

coef(object, ...)

object: A fitted lars object

newx: If type=“fit”, then newx should be the x values atwhich the fit is required. If type=“coefficients”, then newxcan be omitted.

s: a value, or vector of values, indexing the path. Its valuesdepends on the mode . By default (mode=“step”), s takes onvalues between 0 and p.

type: If type=“fit”, predict returns the fitted values. Iftype=”coefficients”, predict returns the coefficients.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 26: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

“Mode” for Prediction in LARS

If mode=“step”, then s should be the argument indexes thelars step number, and the coefficients will be returnedcorresponding to the values corresponding to step s.

If mode=“fraction”, then s should be a number between 0and 1, and it refers to the ratio of the L1 norm of thecoefficient vector, relative to the norm at the full LS solution.

If mode=“norm”, then s refers to the L1 norm of thecoefficient vector.

If mode= λ, then s is the lasso regularization parameter

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 27: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Example: Diabetes Data

library(lars)

data(diabetes)

attach(diabetes)

## fit LASSO

object <- lars(x,y)

plot(object)

summary(object)

predict(object,newx=x)

## fit LARS

object2 <- lars(x,y,type="lar")

detach(diabetes)

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 28: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

R Code to Compute K-CV Error Curve for Lars

cv.lars(x, y, K = 10, index, type = c("lasso",

"lar", "forward.stagewise", "stepwise"),

mode=c("fraction", "step"), ...)

x : Input to lars

y : Input to lars

K : Number of folds

index: Abscissa values at which CV curve should becomputed. If mode=”fraction”, this is the fraction of thesaturated |β|. The default value index=seq(from = 0, to = 1,length = 100). If mode=”step”, this is the number of steps inlars procedure.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 29: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

R Code: CV.LARS

attach(diabetes)

lasso.s<-seq(0,1,length=100)

lasso.cv<-cv.lars(x,y,K=10,index=lasso.s,mode="fraction")

lasso.mcv<-which.min(lasso.cv$cv)

## minimum CV rule

bests1 <- lasso.s[lasso.mcv]

lasso.coef1 <- predict(fit.lasso, s=bests1, type="coef",

mode="frac")

## one-standard rule

bound<-lasso.cv$cv[lasso.mcv]+lasso.cv$cv.error[lasso.mcv]

bests2<-lasso.s[min(which(lasso.cv$cv<bound))]

lasso.coef2 <- coef(fit.lasso, s=bests2, mode="frac")

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 30: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Consistency - Definition

Let the true parameters be β0 and their estimates by βn. Thesubscript ‘n’ means the sample size n.

Estimation consistency

βn − β0 →p 0, as n→∞.

Model selection consistency

P({j : βj 6= 0} = {j : β0j 6= 0}

)→ 1, as n→∞.

Sign consistency

P(βn =s β0

)→ 1, as n→∞,

whereβn =s β0 ⇔ sign(βn) = sign(β0).

Sign consistency is stronger than model selection consistency.Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 31: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Consistency of LASSO Estimator

Knight and Fu (2000) have shown that

Estimation Consistency: The LASSO solution is ofestimation consistency for fixed p. In other words,

βlasso

(λn)→p β, as λn = o(n).

It is root-n consistent and asymptotically normal.

Model Selection Property: For λn ∝ n12 , as n→∞, there is

a non-vanishing positive probability for lasso to select the truemodel.

The l1 penalization approach is called basis pursuit in signalprocessing (Chen, Donoho, and Saunders, 2001).

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 32: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Model Selection Consistency of LASSO Estimator

Zhao and Yu (2006). On Model Selection Consistency of Lasso.JMLR, 7, 2541–2563.

If an irrelevant predictor is highly correlated with thepredictors in the true model, Lasso may not be able todistinguish it from the true predictors with any amount ofdata and any amount of regularization.

Under Irrepresentable Condition (IC), the LASSO is modelselection consistent in both fixed and large p settings.

The IC condition is represented as

|[Xn(1)TXn(1)]−1XTn (1)Xn(2)| < 1− η,

i.e., the regression coefficients of irrevelant covariates Xn(2)on the relevant covariates Xn(1) is constrained. It is almostnecessary and sufficient for Lasso to select the true model.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 33: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge Penalty

LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO

Special Cases for LASSO to be Selection Consistent

Assume the true model size |A0| = q.

The underlying model must satisfy a nontrivial condition if thelasso variable selection is consistent (Zou, 2006).

The lasso is always consistent in model selection under thefollowing special cases:

when p = 2;when the design is orthgonal;when the covariates have bounded constant correlation,0 < rn <

11+cq for some c ;

when the design has power decay correlation with

corr(Xi ,Xj) = ρ|i−j|n , where |ρn| ≤ c < 1.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 34: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Ridge Regression (q = 2)

Applying the squared penalty on the coefficients

minβ

n∑i=1

(yi −d∑

j=1

xijβj)2 + λ

d∑j=1

β2j

λ ≥ 0 is a tuning parameter

The solution

βridge

= (XTX + λId)−1XTy.

Note the similarity to the ordinary least squares solution, but withthe addition of a “ridge” down the diagonal.

As λ→ 0, βridge

→ βols

.

As λ→∞, βridge

→ 0.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 35: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Ridge Regression Solution

In the special case of an orthonormal design matrix,

βridgej =

βolsj

1 + λ.

This illustrates the essential “shrinkage” feature of ridge regression

Applying the ridge regression penalty has the effect ofshrinking the estimates toward zero

The penalty introducing bias but reducing the variance of theestimate

Ridge estimator does not threshold, since the shrinkage is smooth(proportional to the original coefficient).

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 36: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Invertibility

If the design matrix X is not full rank, then

XTX is not invertible,

For OLS estimator,, there is no unique solution for βols

.

This problem does not occur with ridge regression, however.

Theorem: For any design matrix X , the quantity XTX + λId is

always invertible; thus, there is always a unique solution βridge

.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 37: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Bias and Variance

The variance of the ridge regression estimate is

Var(βridge

) = σ2WXTXW ,

where W = (XTX + λId)−1.

The bias of the ridge regression estimate is

Bias(βridge

) = −λWXTβ.

It can be shown that

the total variance∑d

j=1 Var(βridgej ) is a monotone decreasing

sequence with respect to λ

while the total squared bias∑d

j=1 Bias2(βridgej ) is a

monotone increasing sequence with respect to λ.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 38: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Mean Squared Error (MSE) of Ridge

Existence Theorem:

There always exists a λ such that the MSE of βridge

is less than

the MSE of βols

.

a rather surprising result with somewhat radical implications:even if the model we fit is exactly correct and follows theexact distribution we specify, we can always obtain a better(in terms of MSE) estimator by shrinking towards zero.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 39: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Bayesian Interpretation of Ridge Estimation

Penalized regression can be interpreted in a Bayesian context:

Suppose y = Xβ + ε, where ε ∼ N(0, σ2In).

Given the prior β ∼ N(0, τ2Id), then the posterior mean of βgiven the data is

(XTX +σ2

τ2Id)−1XTy.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 40: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

R package: lm.ridge

library(MASS)

lm.ridge(formula, data, subset, na.action, lambda)

formula: a formula expression as for regression models, of form

response ∼ predictors

data: an optional data frame in which to interpret thevariables occurring in formula.

subset: expression saying which subset of the rows of the datashould be used in the fit. All observations are included bydefault.

na.action a function to filter missing data.

lambda: A scalar or vector of ridge constants.

If an intercept is present, its coefficient is not penalized.Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 41: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Output for lm.ridge

Returns a list with components

coef: matrix of coefficients, one row for each value of lambda.Note that these are not on the original scale and are for useby the coef method.

scales: scalings used on the X matrix.

Inter: was intercept included?

lambda: vector of lambda values

ym: mean of y

xm: column means of x matrix

GCV: vector of GCV values

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 42: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Example for Ridge Regression

library(MASS)

longley

names(longley)[1] <- "y"

lm.ridge(y ~ ., longley)

plot(lm.ridge(y ~ ., longley,

lambda = seq(0,0.1,0.001)))

(demo)

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 43: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3Co

efficie

nts

0 2 4 6 8

-0.2

0.00.2

0.40.6

••••

••

••

••

••

••

••

••

••

•••

lcavol

••••••••••••••••••••••••

lweight

••••••••••••••••••••••••

age

•••••••••••••••••••••••••

lbph

••••••••••••••••••••••••

svi

•••

••

••

••

••

••••••••••••

lcp

••••••••••••••••••••••••

•gleason

•••••••••••••••••••••••

pgg45

PSfrag repla ements df(�)Figure 3.7: Pro�les of ridge oeÆ ients for theprostate an er example, as tuning parameter � is var-ied. CoeÆ ients are plotted versus df(�), the e�e -tive degrees of freedom. A verti al line is drawn atdf = 4:16, the value hosen by ross-validation.Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 44: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Lasso vs Ridge

Ridge regression:

is a continuous shrinkage method; achieves better predictionperformance than OLS through a biase-variance trade-off.

cannot produce a parsimonious model, for it always keeps allthe predictors in the model.

The lasso

does both continuous shrinkage and automatic variableselection simultaneously.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO

Page 45: Lecture 1: Variable Selection - LASSOhzhang/math574m/2020Lect13_shri… · Hao Helen Zhang Lecture 1: Variable Selection - LASSO. Oracle Properties Shrinkage Methods Ridge Penalty

Oracle PropertiesShrinkage Methods

Ridge PenaltyRidge vs LASSO

Empirical Comparison

Tibshirani (1996) and Fu (1998) compared the predictionperformance of the lasso, ridge and bridge regression (Frank andFriedman, 1993) and found

None of them uniformly dominates the other two.

However, as variable selection becomes increasingly importantin modern data analysis, the lasso is much more appealingowing to its sparse representation.

Hao Helen Zhang Lecture 1: Variable Selection - LASSO