rudy regularization

Upload: dexxt0r

Post on 09-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Rudy Regularization

TRANSCRIPT

  • Agenda

    Regularization: Ridge Regression and the LASSO

    Statistics 305: Autumn Quarter 2006/2007

    Wednesday, November 29, 2006

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Agenda

    Agenda

    1 The Bias-Variance Tradeoff2 Ridge Regression

    Solution to the 2 problemData Augmentation ApproachBayesian InterpretationThe SVD and Ridge Regression

    3 Cross Validation

    K -Fold Cross ValidationGeneralized CV

    4 The LASSO

    5 Model Selection, Oracles, and the Dantzig Selector

    6 References

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    Part I

    The Bias-Variance Tradeoff

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    Estimating

    As usual, we assume the model:

    y = f (z) + , (0, 2)

    In regression analysis, our major goal is to come up with somegood regression function f (z) = z

    So far, weve been dealing with ls, or the least squares

    solution:

    lshas well known properties (e.g., Gauss-Markov, ML)

    But can we do better?

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    Choosing a good regression function

    Suppose we have an estimator f (z) = z

    To see if f (z) = z is a good candidate, we can askourselves two questions:

    1.) Is close to the true ?2.) Will f (z) fit future observations well?

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    1.) Is close to the true ?

    To answer this question, we might consider the meansquared error of our estimate :

    i.e., consider squared distance of to the true :

    MSE () = E[|| ||2] = E[( )( )]

    Example: In least squares (LS), we now that:

    E[(ls )(ls )] = 2tr[(ZZ)1]

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    2.) Will f (z) fit future observations well?

    Just because f (z) fits our data well, this doesnt mean that itwill be a good fit to new data

    In fact, suppose that we take new measurements y i at thesame zi s:

    (z1, y1), (z2, y

    2), . . . , (zn, y

    n)

    So if f () is a good model, then f (zi ) should also be close tothe new target y iThis is the notion of prediction error (PE)

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    Prediction error and the bias-variance tradeoff

    So good estimators should, on average have, small predictionerrors

    Lets consider the PE at a particular target point z0 (see theboard for a derivation):

    PE(z0) = EY |Z=z0{(Y f (Z))2|Z = z0}= 2 + Bias

    2(f (z0)) + Var(f (z0))

    Such a decomposition is known as the bias-variance tradeoffAs model becomes more complex (more terms included), localstructure/curvature can be picked upBut coefficient estimates suffer from high variance as moreterms are included in the model

    So introducing a little bias in our estimate for might lead toa substantial decrease in variance, and hence to a substantialdecrease in PE

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part I: The Bias-Variance Tradeoff

    Depicting the bias-variance tradeoff

    Model Complexity

    Squa

    red

    Erro

    r

    BiasVariance Tradeoff

    Prediction ErrorBias^2Variance

    Figure: A graph depicting the bias-variance tradeoff.

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    Part II

    Ridge Regression

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Ridge regression as regularization

    If the j s are unconstrained...They can explodeAnd hence are susceptible to very high variance

    To control variance, we might regularize the coefficientsi.e., Might control how large the coefficients grow

    Might impose the ridge constraint:

    minimizen

    i=1

    (yi zi)2 s.t.p

    j=1

    2j t

    minimize (y Z)(y Z) s.t.p

    j=1

    2j t

    By convention (very important!):Z is assumed to be standardized (mean 0, unit variance)y is assumed to be centered

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Ridge regression: 2-penalty

    Can write the ridge constraint as the following penalizedresidual sum of squares (PRSS):

    PRSS()2 =n

    i=1

    (yi zi )2 + p

    j=1

    2j

    = (y Z)(y Z) + ||||22

    Its solution may have smaller average PE than ls

    PRSS()2 is convex, and hence has a unique solution

    Taking derivatives, we obtain:

    PRSS()2

    = 2Z(y Z) + 2

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    The ridge solutions

    The solution to PRSS()2 is now seen to be:

    ridge = (Z

    Z+ Ip)1Zy

    Remember that Z is standardizedy is centered

    Solution is indexed by the tuning parameter (more on thislater)

    Inclusion of makes problem non-singular even if ZZ is notinvertible

    This was the original motivation for ridge regression (Hoerland Kennard, 1970)

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Tuning parameter

    Notice that the solution is indexed by the parameter

    So for each , we have a solutionHence, the s trace out a path of solutions (see next page)

    is the shrinkage parameter

    controls the size of the coefficients controls amount of regularizationAs 0, we obtain the least squares solutionsAs , we have ridge= = 0 (intercept-only model)

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Ridge coefficient paths

    The s trace out a set of ridge solutions, as illustrated below

    DF

    Coef

    ficie

    nt

    0 2 4 6 8 10

    age

    sex

    bmi

    map

    tc

    ldl

    hdltch

    ltg

    glu

    Ridge Regression Coefficient Paths

    Figure: Ridge coefficient path for the diabetes data set found inthe lars library in R.

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Choosing

    Need disciplined way of selecting :

    That is, we need to tune the value of

    In their original paper, Hoerl and Kennard introduced ridgetraces:

    Plot the components of ridge

    against Choose for which the coefficients are not rapidly changingand have sensible signsNo objective basis; heavily criticized by many

    Standard practice now is to use cross-validation (deferdiscussion until Part 3)

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Proving that ridge

    is biased

    Let R = ZZThen:

    ridge = (Z

    Z+ Ip)1Zy

    = (R+ Ip)1R(R1Zy)

    = [R(Ip + R1)]1R[(ZZ)1Zy]

    = (Ip + R1)1R1R

    ls

    = (Ip + R1)

    ls

    So:

    E(ridge ) = E{(Ip + R1)

    ls}= (Ip + R

    1)

    (if 6=0)

    6= .Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Data augmentation approach

    The 2 PRSS can be written as:

    PRSS()2 =

    ni=1

    (yi zi )2 + p

    j=1

    2j

    =

    ni=1

    (yi zi )2 +p

    j=1

    (0j)

    2

    Hence, the 2 criterion can be recast as another least squaresproblem for another data set

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Data augmentation approach continued

    The 2 criterion is the RSS for the augmented data set:

    Z =

    z1,1 z1,2 z1,3 z1,p...

    ......

    ......

    zn,1 zn,2 zn,3 zn,p 0 0 00

    0 0

    0 0

    . . . 0

    0 0 0. . . 0

    0 0 0 0

    ; y =

    y1...yn000...0

    So:

    Z =

    (ZIp

    )y =

    (y

    0

    )

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Solving the augmented data set

    So the least squares solution for the augmented data set is:

    (Z Z)1Z y =

    ((Z,

    Ip)

    (ZIp

    ))1(Z,

    Ip)

    (y

    0

    )= (ZZ+ Ip)

    1Zy,

    which is simply the ridge solution

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Bayesian framework

    Suppose we imposed a multivariate Gaussian prior for :

    N(0,

    1

    2pIp

    )

    Then the posterior mean (and also posterior mode) of is:

    ridge

    = (ZZ+ Ip)1Zy

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Computing the ridge solutions via the SVD

    Recall ridge = (Z

    Z+ Ip)1Zy

    When computing ridge numerically, matrix inversion is

    avoided:

    Inverting ZZ can be computationally expensive: O(p3)

    Rather, the singular value decomposition is utilized; that is,

    Z = UDV,

    where:

    U = (u1,u2, . . . ,up) is an n p orthogonal matrixD = diag(d1, d2, . . . , dp) is a p p diagonal matrixconsisting of the singular values d1 d2 dp 0V = (v1 , v

    2 , . . . , v

    p ) is a p p matrix orthogonal matrix

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Numerical computation of ridge

    Will show on the board that:

    ridge = (Z

    Z+ Ip)1Zy

    = V diag

    (dj

    d2j +

    )Uy

    Result uses the eigen (or spectral) decomposition of ZZ:

    ZZ = (UDV)(UDV)

    = VDUUDV

    = VDDV

    = VD2V

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    yridge and principal components

    A consequence is that:

    yridge = Zridge

    =

    pj=1

    (uj

    d2j

    d2j + uj

    )y

    Ridge regression has a relationship with principal componentsanalysis (PCA):

    Fact: The derived variable j = Zvj = ujdj is the jth principalcomponent (PC) of ZHence, ridge regression projects y onto these components withlarge djRidge regression shrinks the coefficients of low-variancecomponents

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Orthonormal Z in ridge regression

    If Z is orthonormal, then ZZ = Ip, then a couple of closedform properties exist

    Let lsdenote the LS solution for our orthonormal Z; then

    ridge =

    1

    1 + ls

    The optimal choice of minimizing the expected predictionerror is:

    =p2pj=1

    2j

    ,

    where = (1, 2, . . . , p) is the true coefficient vector

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Smoother matrices and effective degrees of freedom

    A smoother matrix S is a linear operator satisfying:

    y = Sy

    Smoothers put the hats on ySo the fits are a linear combination of the yi s, i = 1, . . . , n

    Example: In ordinary least squares, recall the hat matrix

    H = Z(ZZ)1Z

    For rank(Z) = p, we know that tr(H) = p, which is how manydegrees of freedom are used in the model

    By analogy, define the effective degrees of freedom (oreffective number of parameters) for a smoother to be:

    df(S) = tr(S)

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part II: Ridge Regression

    1. Solution to the 2 Problem and Some Properties2. Data Augmentation Approach3. Bayesian Interpretation4. The SVD and Ridge Regression

    Degrees of freedom for ridge regression

    In ridge regression, the fits are given by:

    y = Z(ZZ+ Ip)1Zy

    So the smoother or hat matrix in ridge takes the form:

    S = Z(ZZ+ Ip)

    1Z

    So the effective degrees of freedom in ridge regression aregiven by:

    df() = tr(S) = tr[Z(ZZ+ Ip)

    1Z] =

    pj=1

    d2j

    d2j +

    Note that df() is monotone decreasing in Question: What happens when = 0?

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation

    Part III

    Cross Validation

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    How do we choose ?

    We need a disciplined way of choosing

    Obviously want to choose that minimizes the mean squarederror

    Issue is part of the bigger problem of model selection

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    Training sets versus test sets

    If we have a good model, it should predict well when we havenew data

    In machine learning terms, we compute our statistical modelf () from the training setA good estimator f () should then perform well on a new,independent set of data

    We test or assess how well f () performs on the new data,which we call the test set

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    More on training and testing

    Ideally, we would separate our available data into bothtraining and test sets

    Of course, this is not always possible, especially if we have afew observations

    Hope to come up with the best-trained algorithm that willstand up to the test

    Example: The Netflix contest(http://www.netflixprize.com/)

    How can we try to find the best-trained algorithm?

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    K -fold cross validation

    Most common approach is K -fold cross validation:(i) Partition the training data T into K separate sets of equal size

    Suppose T = (T1, T2, . . . , TK )Commonly chosen K s are K = 5 and K = 10

    (ii) For each k = 1, 2, . . . ,K , fit the model f()k (z) to the training

    set excluding the kth-fold Tk(iii) Compute the fitted values for the observations in Tk , based on

    the training data that excluded this fold(iv) Compute the cross-validation (CV) error for the k-th fold:

    (CV Error)()k = |Tk |1

    (z,y)Tk

    (y f ()k (z))

    2

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    K -fold cross validation (continued)

    The model then has overall cross-validation error:

    (CV Error)() = K1K

    k=1

    (CV Error)()k

    Select as the one with minimum (CV Error)()

    Compute the chosen model f (z)() on the entire training set

    T = (T1,T2, . . . ,Tk)

    Apply f (z)() to the test set to assess test error

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    Plot of CV errors and standard error bands

    30 35 40 45 50 55

    0.16

    0.18

    0.20

    0.22

    0.24

    CV Bands from a Ridge Regression on Spam Data

    df

    Squa

    red

    Erro

    r

    Figure: Cross validation errors from a ridge regression example on spamdata.

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    Cross validation with few observations

    Remark: Our data set might be small, so we might not haveenough observations to put aside a test set:

    In this case, let all of the available data be our training setStill apply K -fold cross validationStill choose as the minimizer of CV errorThen refit the model with on the entire training set

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    Leave-one-out CV

    What happens when K = 1?

    This is called leave-one-out cross validation

    For squared error loss, there is a convenient approximation toCV(1), which is the leave one-out CV error

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part III: Cross Validation1. K -Fold Cross Validation2. Generalized CV

    Generalized CV for smoother matrices

    Recall that a smoother matrix S satisfies:

    y = Sy

    In many linear fitting methods (as in LS), we have:

    CV(1) =1

    n

    ni=1

    (yi fi(zi ))2 = 1n

    ni=1

    (yi f (zi )1 Sii

    )2

    A convenient approximation to CV(1) is called thegeneralized cross validation, or GCV error:

    GCV =1

    n

    ni=1

    (yi f (zi )1 tr(S)

    n

    )2

    Recall that tr(S) is the effective degrees of freedom, oreffective number of parameters

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    Part IV

    The LASSO

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    The LASSO: 1 penalty

    Tibshirani (Journal of the Royal Statistical Society 1996)introduced the LASSO: least absolute shrinkage and selectionoperator

    LASSO coefficients are the solutions to the 1 optimizationproblem:

    minimize (y Z)(y Z) s.t.p

    j=1

    |j | t

    This is equivalent to loss function:

    PRSS()1 =n

    i=1

    (yi zi )2 + p

    j=1

    |j |

    = (y Z)(y Z) + ||||1Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    (or t) as a tuning parameter

    Again, we have a tuning parameter that controls theamount of regularization

    One-to-one correspondence with the threshhold t:recall the constraint:

    pj=1

    |j | t

    Hence, have a path of solutions indexed by tIf t0 =

    pj=1 |lsj | (equivalently, = 0), we obtain no shrinkage

    (and hence obtain the LS solutions as our solution)Often, the path of solutions is indexed by a fraction ofshrinkage factor of t0

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    Sparsity and exact zeros

    Often, we believe that many of the j s should be 0

    Hence, we seek a set of sparse solutions

    Large enough (or small enough t) will set some coefficientsexactly equal to 0!

    So the LASSO will perform model selection for us!

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    Computing the LASSO solution

    Unlike ridge regression, lasso has no closed form

    Original implementation involves quadratic programmingtechniques from convex optimization

    lars package in R implements the LASSO

    But Efron et al. (Annals of Statistics 2004) proposed LARS(least angle regression), which computes the LASSO pathefficiently

    Interesting modification called is called forward stagewiseIn many cases it is the same as the LASSO solutionForward stagewise is easy to implement:http://www-stat.stanford.edu/~hastie/TALKS/nips2005.pdf

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    Forward stagewise algorithm

    As usual, assume Z is standardized and y is centered

    Choose a small . The forward-stagewise algorithm thenproceeds as follows:

    1 Start with initial residual r = y, and 1 = 2 = = p = 0.2 Find the predictor Zj (j = 1, . . . , p) most correlated with r3 Update j j + j , where j = signr,Zj = sign(Zj r).4 Set r r jZj , and repeat Steps 2 and 3 many times.

    Try implementing forward stagewise yourself! Its easy!

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    Example: diabetes data

    Example taken from lars package documentation:

    Call:

    lars(x = x, y = y)

    R-squared: 0.518

    Sequence of LASSO moves:

    bmi ltg map hdl sex glu tc tch ldl age hdl hdl

    Var 3 9 4 7 2 10 5 8 6 1 -7 7

    Step 1 2 3 4 5 6 7 8 9 10 11 12

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part IV: The LASSO

    The LASSO, LARS, and Forward Stagewise paths

    ** * * * ** * ** ** *

    0.0 0.2 0.4 0.6 0.8 1.0

    50

    00

    500

    |beta|/max|beta|

    Stan

    dard

    ized

    Coef

    ficie

    nts

    ** * * ***

    * ** ** *

    **

    **

    * ** * ** ** *

    ** **

    * **

    * ** ** *

    ** * * * ***

    **

    **

    *

    ** * * * ** * **

    **

    *

    ** * *

    *** *

    **

    ***

    ** * * * ** *** **

    *

    **

    **

    * *** **

    ***

    ** * * * *** ** ** *

    LASSO

    52

    14

    9

    0 2 4 7 10 12

    ** * * * ** * ** *

    0.0 0.2 0.4 0.6 0.8 1.0

    50

    00

    500

    |beta|/max|beta|

    Stan

    dard

    ized

    Coef

    ficie

    nts

    ** * * ***

    * ** *

    **

    **

    * ** * ** *

    ** **

    * **

    * ** *

    ** * * * ***

    **

    *

    ** * * * ** * **

    *

    ** * *

    *** *

    **

    *** * * * ** *

    ***

    **

    **

    * *** **

    *

    ** * * * *** ** *

    LAR

    52

    14

    9

    0 2 4 7 10

    ** * * * ** ** ***** * *

    0.0 0.2 0.4 0.6 0.8 1.0

    50

    00

    500

    |beta|/max|beta|

    Stan

    dard

    ized

    Coef

    ficie

    nts

    ** * * ***

    ** ***** * *

    **

    **

    * ** ** ***** * *

    ** **

    * **

    ** ***** * *

    ** * * * ****

    *****

    *

    *

    ** * * * ** ** *****

    *

    *

    ** * *

    *** ** *****

    **

    ** * * * ** ** *****

    * *

    **

    **

    * **** *****

    **

    ** * * * **** ***** * *

    Forward Stagewise

    52

    14

    9

    0 2 4 7 14

    Figure: Comparison of the LASSO, LARS, and Forward Stagewisecoefficient paths for the diabetes data set.

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    Part V

    Model Selection, Oracles, and the Dantzig Selector

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    Comparing LS, Ridge, and the LASSO

    Even though ZZ may not be of full rank, both ridgeregression and the LASSO admit solutions

    We have a problem when p n (more predictor variablesthan observations)

    But both ridge regression and the LASSO have solutionsRegularization tends to reduce prediction error

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    Variable selection

    The ridge and LASSO solutions are indexed by the continuousparameter :

    Variable selection in least squares is discrete:

    Perhaps consider best subsets, which is of order O(2p)(combinatorial explosion compare to ridge and LASSO)Stepwise selection

    In stepwise procedures, a new variable may be added into themodel even with a miniscule improvement in R2

    When applying stepwise to a perturbation of the data,probably have different set of variables enter into the model ateach stage

    Many model selection techniques based on Mallows Cp, AIC ,and BIC

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    More comments on variable selection

    Now suppose p nOf course, we would like a parsimonious model (OccamsRazor)

    Ridge regression produces coefficient values for each of thep-variables

    But because of its 1 penalty, the LASSO will set many of thevariables exactly equal to 0!

    That is, the LASSO produces sparse solutions

    So LASSO takes care of model selection for us

    And we can even see when variables jump into the model bylooking at the LASSO path

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    Variants

    Zou and Hastie (2005) propose the elastic net, which is aconvex combination of ridge and the LASSO

    Paper asserts that the elastic net can improve error overLASSOStill produces sparse solutions

    Frank and Friedman (1993) introduce bridge regression,which generalizes q norms

    Regularization ideas extended to other contexts:

    Park (Ph.D. Thesis, 2006) computes 1 regularized paths forgeneralized linear models

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    High-dimensional data and underdetermined systems

    In many modern data analysis problems, we have p nThese comprise high-dimensional problems

    When fitting the model y = z, we can have many solutions

    i.e., our system is underdetermined

    Reasonable to suppose that most of the coefficients areexactly equal to 0

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    S-sparsity and Oracles

    Suppose that only S elements of are non-zero

    Cande`s and Tao call this S-sparsity

    Now suppose we had an Oracle that told us whichcomponents of the = (1, 2, . . . , p) are truly non-zero

    Let be the least squares estimate of this ideal estimator;

    So is 0 in every component that is 0The non-zero elements of are computed by regressing y ononly the S important covariates

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part V: Model Selection, Oracles, and the Dantzig Selector

    The Dantzig selector

    Cande`s and Tao developed the Dantzig selector Dantzig

    :

    minimize||||1 s.t. ||Zj r|| (1 + t1)2 log p

    Here, r is the residual vector and t > 0 is a scalar

    They showed that with high probability,

    ||Dantzig ||2 = O(log p)E(|| ||2)

    So the Dantzig selector does comparably well as someone whowas told was S variables to regress on

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part VI: References

    Part VI

    References

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part VI: References

    References

    Cande`s E. and Tao T. The Dantzig selector: statisticalestimation when p is much larger than n. Available athttp://www.acm.caltech.edu/~emmanuel/papers/DantzigSelector.pdf.

    Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004).Least angle regression. Annals of Statistics, 32 (2): 409499.

    Frank, I. and Friedman, J. (1993). A statistical view of somechemometrics regression tools. Technometrics, 35, 109148.

    Hastie, T. and Efron, B. The lars package. Available fromhttp://cran.r-project.org/src/contrib/Descriptions/lars.html.

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

  • Part VI: References

    References continued

    Hastie, T., Tibshirani, R., and Friedman, J. (2001). TheElements of Statistical Learning: Data Mining, Inference, and

    Prediction. Springer Series in Statistics.

    Hoerl, A.E. and Kennard, R. (1970). Ridge regression: Biasedestimation for nonorthogonal problems. Technometrics, 12:55-67

    Seber, G. and Lee, A. (2003). Linear Regression Analysis, 2ndEdition. Wiley Series in Probability and Statistics.

    Zou, H. and Hastie, T. (2005). Regularization and variableselection via the elastic net. Journal of the Royal StatisticalSociety, Series B. 67: pp. 301320.

    Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO