lecture 4 - math.bu.edumath.bu.edu/people/mkon/ma751/l4ml.pdflecture 4. 1. lasso this motivates a...

66
Lecture 4

Upload: others

Post on 01-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Lecture 4

  • 1. Lasso

    The above motivates a similar algorithm,Least squares Shrinkage and SelectionOperations, i.e., .LASSO

    Assume we have training setg œ ÖÐx3 3 3œ"Rß C Ñ× .

  • As before assume that both havex and Cbeen centered, i.e. that

    x xœ œ !à C œ C œ !Þ" "R R

    3 3

    3 3

    This time we will replace the m"m# sum ofsquares regularization term in ridgeregression with a new term,

    m m œ l lÞ" " 44œ"

    :

    "

  • We have as before

    "s œLasso

    argmin" ðóóóóóóóóóóóóóóóñóóóóóóóóóóóóóóóò"#Ðy X y X Ð Ñ m Þ" " ")X "- m

    _œLagrangian

    (3)[Note that for convenience we include the

    factor of in front in (3); this is used in the"#book and the same as substituting ]- -Ä #

  • Equivalently, it can be shown using (reverse)Lagrange multipliers that we are minimizing

    RSS , œ Ððóóóóóóóóñóóóóóóóóòy X y X Ð Ñ" ")Xœm my X" #

    (4a)

    subject to a constraint of the form

    4œ"

    :

    4l l Ÿ >" - (4b)

    for some which depends on .>- -

  • This method is LASSO.

    Note that if as usual

    " "s œ ´ m m ßols argmin argmin" "

    RSS y X #

    then adding the constraint means that(4b)we have the following diagram:

  • Fig. 1: Feasible set of (blue) and the point (corner)" that makes RSS smallestœ m m Þy X" #

  • Note that above we again assume we havecentered data and centered values, i.e.,CC œ !.

    Note above that the solution (intersection ofred and blue region) has , i.e., this"# œ !method zeroes out some of instead of"3just shrinking them.

  • Above: Trevor Hastie, 2007(https://web.stanford.edu/~hastie/TALKS/bradfest.pdf)

    Above figure shows growth of coefficients fora least squares regression as the P"

    constraint is increased.> œ l l4

    4"

    Each curve traces the evolution of one of thecoefficients as a function of ."4 >

    Increasing is equivalent to decreasing the > -coefficient in the above version (3).

  • Note can show the coefficients changepiecewise linearly with .>

  • Notes:

    1. A convex optimization algorithm can beused to solve this problem for "lasso.

    2. An alternative algorithm, Least AngleRegression (LAR) does something similarwith regard to variable selection.

    3. We can choose by cross-validation-

  • 4. This exact method also arisesindependently from Bayesian statisticsthrough the use of a double exponentialprior probability distribution on , rather"than a Gaussian prior distribution(Gaussian prior distribution gives ridgeregression).

    5. This method has the ability to dropvariables, and it shrinks their coefficients ifthey are not dropped.

  • 2. Example: orthogonal regression

    Consider the simplified situation where datamatrix X happens to have orthonormalcolumns.

    We have in this case (see text, problems)that

    " -s sœ Ð l Ñ ß4 4 4 Lasso ols ols

    sgn )(|" "s (4a)

  • where OLS stands for ordinary least squaresß

    and .if ow+ œ+ + !! œ

    Here, sgn if if Ð+Ñ œ Þ" +   !" + !œ

    [Equation implies has the same(4a) "sLasso4

    sign as " "s s4 4ols ols

    (since has same sign as

    sgn ), but it is shrunk ]Ð Ñ"s4ols

  • Note can show that solving the constraintversion Lasso yields(4ab) (4) of above

    " "sLasso

    œ sols

    if .>   l l-4œ"

    :

    4"ols

    This can then be shown to correspond to- œ ! in the regularization version (3) ofLasso.

  • More on the algorithm: finding

    argmin"

    Ðy X y X " "Ñ Ð ÑX

    such that is a 4œ"

    :

    4l l Ÿ >" quadratic

    programming problem, i.e. a problem thatcan be solved by the well known and easyto use. quadratic programming algorithm

  • More detail: we are effectively trying tosolve a problem of the form

    argmin argminx x

    x Qx x xX X - œ 0Ð Ñ (15)

    with the constraint that and Ax b Ex dŸ œ .

  • If the matrix is Q positive semi-definite(i.e., all eigenvalues are non-negative),then this implies that is convex. A0ÐxÑglobal minimum exists if at least one xsatisfies constraints and is bounded(15)below in the feasibility region (i.e. regionsatisfying the constraints)

    Typical methods for solving this include(a) interior point methods(b) active set methods(c) conjugate gradient methods

  • 3. An alternative to LASSO -- Least AngleRegression (LAR)

    Definition 1: Given two vectors and ,x ytheir iscorrelation

    Corr , Cov(

    Ð ß Ñ ´Ñ

    Z Ð ÑZ Ð Ñx y x y

    x yßÈ

    where Cov , ,Ð Ñ œ B B ÐC CÑx y3

    3 3

    Z ÐxÑ œ ÐB BÑ3

    3# and B œ B Þ":

    3œ"

    :

    3

  • The algorithm:LAR (least angle regression)

    1. Start with for . Let"s à! 4œ œ ! 4   "sy "r y yœ Þs

    2. Find an (column of such thatx Xë4 )corr is maximized.Ð ß Ñx rë4

  • 3. Increase from towards"s4 !sgn Corr omputing current residualÐ Ðx rë4ß ÑÑß cr y yœ s.

    4. Stop when such thatb xë5l Ð ß Ñl œ l Ð ß ÑlCorr Corr .x r x rë ë5 4

  • 5. Increase in their joint OLSÐ ß Ñs s" "4 5(ordinary least squares) direction based oncurrent residual (their correlations with r rstay the same).

    Stop when there is a that catches up, withxë6l Ð Ñl œ l Ð ß ÑlßCorr Corrx r x rë6ß 4ë

    then proceed similarly with included withxë6x xë ë4 5ß (note these are correlations acrosssamples)Þ

  • 6. Continue until all xë3 have been included;eventually arrive at ordinary OLS solution.

    7. If stop early then have not yet included theleast needed variables.

    8. Complexity is SÐ: R: Ñ$ $

    The behavior of the evolving coefficient vector"Ð=Ñ (parameterized appropriately is *very*similar to the evolution ( ) of the LASSO" -coefficients as a increases:-

  • Above: Trevor Hastie, Least Angle Regression, 2007(https://web.stanford.edu/~hastie/TALKS/bradfest.pdf)

    [There is an iterative algorithm for LASSOthat works very similarly to LAR]

    [The so-called LARS package (Stanford)combines Lasso, LAR, and ForwardStagewise regression]

  • 4. Coordinate descent

    The coordinate descent method cannumerically implement LASSO in anefficient way.

    Start by writing (assume again x yß centered,i.e., ):x œ C œ !

  • " "

    # # l œ ly X y X y X " " " " "# X- -l l"

    œ C B Ð Ñ l Ð Ñl"

    3œ"

    R

    3 35 5 5

    5œ" 5œ"

    : :#

    " - - " -

    [note that entry of B œ Ð3ß 5Ñ35 X xœ Ð Ñ ß3 5 i.e.5 3>2 >2 entry of sample]

  • [Now fix , and freeze to a fixed value if4 " "ë5 55 Á 4à 4note is currently active variable]

    œ C B Ð Ñ B l Ð Ñl l l"

    #Œ ðóóóñóóóò3œ"

    R

    3 35 34 4 4

    5Á4 5Á4

    : :

    5 5

    #

    " - " - " - - "ë ë

    define as Cë3

    [now optimize only; keep all other " " "ë4 5 5œfixed]

  • œ ÐÒC C Ó B Ñ l Ð Ñl l l"

    # ðóóóñóóóò4œ"R

    3 34 4 43#

    5Á4

    :

    5ë " - " - - "ë

    frozen; can leave out

    Now consider ÐD œ D C C Ñ3 33Ð4Ñ

    3œ ë

    min "4"

    #ÐD B Ñ l l

    3œ"

    R

    3 34 4 4#" - "

    assume normalization 3

    34#B œ "

  • œ ÒB #D B D Ó „"

    #3

    34 4 3# # #

    3 34 4 4" " -"

    œ D B D „ Þ" "

    # #" " -"4 3# #

    4 3 34 4

    3 3

    This yields a value curve as a function of "4which looks like:

  • So just minimize this -

  • 5. Linear Methods for Classification

    Our agenda: now look at classification ratherthan regression

    1. Background on classification

    2. Three methods:(a) Linear discriminant analysis(b) Logistic regression(c) Separating hyperplanes

  • (a) Assumptions

    Recall: Regression vs. classification

    We always want to predict fromCx œ ÐB ßá ß B Ñ −" : :‘ .

    Regression case: is a real numberC − ‘

    Classification case: C − Ö"ßá ßO×

    where number of classesO œ

  • In both cases:We always begin by assuming that valuesÐx,CÑ are instances of underlying randomvariables (RVs) .Ð ] ÑX,

    Assume we have fixed joint probabilitydistribution on the space of

    ‘ l: ‚ œ ÖÐx xß CÑ À − ß C − ×ß‘ l:

    of possible values.

  • Here

    l‘

    œ ÞÖ"ßáOל in the regression casein the classification case

    Specifically, we will say that for a(measurable) set of possibleE § ‚‘ l:values of ÐXß ] Ñ,

    Ð Ð ß ] Ñ − EÑProbability that X

    ´ ÐÐ ß ] Ñ − EÑ ´ ÐEÑÞ X

    In short we write

  • ÐX xß ] Ñ µ Ð ß CÑ

    In regression case we often assumeC − ß‘that the measure has a joint probabilitydensity function

    3Ðxß CÑ œ ÐB ßá ß B ß CÑÞ3 " :

    That is, for a set of possible values of (E ß CÑx

  • 3ÖÐ ß CÑ − E× œ Ð ß CÑ . .Cx x x(E

    œ ÐB ßá ß B ß CÑ.B .B á.B .C(E

    " : " # :3

    where 3Ðxß CÑ œ ÐB ßá ß B ß CÑ œ3 " : joint densityfunction of (\ ßá\ ß]" : before we seethem .)

  • Our goal is to predict through some]function .0ÐXÑ

  • (b) Bayes method

    Emphasis: we will start by using the so-calledBayes method in which a joint probabilitydensity function 3Ðxß CÑ is assumed fullyknown for now.

    What is a good 'guess' of given ?] X

    If we actually know the joint distribution ofÐ ß ] ÑX (Bayes method):

  • We will find the 'best' guess at by]minimizing a given loss (penalty) function

    PÐCß 0ÐxÑÑ œ estimate of our loss if the truevalue is but we guess C 0Ð ÑÞx

    So we choose our best estimate of 0Ð 0Ðs x xÑ Ñby first viewing x X Ä C Ä ]and asrandom variables and then replacing theserandom variables again by and .x C

    We have

  • [note X notœ Ð\ ßá ß\ Ñ" : data matrix here!]

    EPE 0 œ Ð0Ñs argmin0

    œ ÒPÐ] ß 0Ðargmin0

    Є X,] Ñ XÑÑÓ

    œ ÒPÐ] ß 0Ð ÑÑl œ Óargmin0

    „ „X xðóóóóóóóóóóóóñóóóóóóóóóóóóòˆ ‰] l x X x´ ÒPÐ] ß0Ð ÑÓ„] lX X

    (1)

  • [Note notational convenience of writing

    „ „] l ] lX xÒPÐ] ß 0Ð ÑÓ ´ ÒPÐ] ß 0Ð ÑÑl œ ÓX x X x ](1a)

    [Both sides of mean same thing since(1a) X xœ are fixed; right side of isboth (1a) more convenient notationally]

  • Remark: Note we have defined a newrandom variable .] lx

    This refers to the values of ] conditioned onX x Xœ , i.e., on the assumption that rv takes on a specific value.

    The probability measure (distribution) of ] lxon is the ‘ conditional probabilitydistribution of ] œ Þ given X x

  • [See probability notes for more carefuldefinition of conditional probabilitydistributions]

  • (c) Minimization of error

    Note to minimize the penalty sufficient(1) it isto minimize the inside expression

    I ÒPÐ] ß 0Ð] lx x X xÑÑl Óœ

    pointwise for each fixed .x

  • For regression (with a continuous] − ‘variable) we often choose squared errorloss:

    PÐ] ß 0ÐX XÑÑ œ Ð] 0Ð ÑÑ#

    In this case we obtain

    0Ð Ñ œ ] 0Ð Ñ ls x x X xarg min [(0Ð

    ] l#

    xx

    Ñ

    „ Ñ œ Ó

    œ ] -Ñ larg min [(-

    ] l#„ x X xœ Ó

  • œ Ò] l Ó ´ Ð] l œ Ñ„ „x X x (2)

    [where now x is frozen, so we define -(dummy variable) ]œ 0Ð Ñx

    Note that above we use the following easyfact:

    For any random variable (RV) , the number]- that minimizes the average value ofÐ] -Ñ - œ Ð] ÑÞ# is „

  • So (2) shows that 0Ðs x xÑ œ Ð] l Ñ„ .

    What do we do now for classification?

  • 6. Bayes estimate for classification case

    (a) Special case: :] œ !ß "

    Now suppose that .if class 1 if class 2] œ"!œ

    Then it makes sense to use

    0Ðs x X xÑ œ IÐ] l œ Ñ

    again here:

  • [after this just choose the value or that! "0Ð Ñs x is closest to]

    So we have

    0Ðs x X xÑ œ IÐ] l œ Ñ

    œ Ð l œ † " Ð # l œ Ñ † ! ðñò ðñòclass 1 ) class ] œ" ] œ!

    X x X x

    œ Ð l œ ðñòclass 1 ).] œ"

    X x

  • Now take the real number value 0Ðs xÑ, andchoose a threshold on for class 0Ð Ñ ] œ !s xor ] œ "Þ

    For example pick the value closer toÖ!ß "×0Ð Ñ ]s x and make that .

  • (b) General multiclass case: ] œ !ßá ßOagain:

    Now again consider classes O "ßá ßOÞ

    Let be categorical RV with these ] Opossible values

    Let us estimate the category of input by] x0s ( .xÑ

  • Above (in the two class case ) we usedO œ #real-valued 0Ðs xÑ and found the closestvalue ( or .] œ ! ] œ "Ñ

    Here (in the class case) we will let O 0Ð Ñs xtake on just the possible class values"ßá ßOÞ

  • We have

    0Ð œ I ÒPÐ] ß 0Ðs x X X xÑ ÑÑl œ Óarg min0Ð ÑX

    Xß]

    œ I I ÒPÐ] ß 0Ð ÑÑl œ Óarg min 0Ð Ñx

    X x] l x X x

    (this can again be minimized pointwise overthe internal expectation expression)

  • Now instead of 0ÐxÑ œ - œ real number weonly allow with an0Ð Ñ œ 5 " Ÿ 5 Ÿ Oxinteger.

  • (d) Minimization for general case

    Thus best 0Ðx xÑ œ 0Ð Ñs minimizes the internalexpression

    0Ðs x X xÑ œ I ÒPÐ] ß 5Ñl œ Óargmin5

    ] lx

    (note .C − Ö"ßá ßO×Ñ

  • So

    0Ð Ñ œ PÐCß 5Ñ Ð] œ C ls x X xargmin5 Cœ"

    O

    œ Ñ (3)

    Often we choose ( to be a 0/1 loss:P Cß 5Ñi.e.

    PÐCß 5Ñ œ" C Á 5! C œ 5œ if if Þ

    Then in note that if and(3) PÐCß 5Ñ œ ! C œ 5PÐCß 5Ñ œ " C Á 5Þ if Thus rewrite as(3)

  • CÐ Ñ œ 0Ð Ñ œ ÐCl œ Ñs sx x X xargmin5 CÀCÁ5

    œ " Ð5l œ Ñargmin5

    c d X xœ Ð5l œ Ñsargmax

    5 X x

    Minimizing means that we want to be(3) Cthe that makes the largest.5 Ð5l œ Ñ X x

  • This formula for is called the 0Ð Ñs x Bayes'srule, in this discrete case and the abovecontinuous case.

    The full expected prediction error

    EPE Ð0 Ñ ´ ÒPÐ] ß 0Ðs „ÐX,] Ñ XÑÑÓ

    above is called the .Bayes (error) rate

  • (e) Bayes decision boundary

    Finally

    Öx X x X xÀ ÖC œ 4l œ × œ ÖC œ 5l œ ××

    is called the Bayes decision boundarybetween classes and .4 5

    As an example, if we consider :x − ‘#

  • So above is more likely, while below] œ 4] œ 5 is more likely. On the Bayesdecision boundary they are equally likely.

  • (f) Summary

    Special case classes (again)O œ #

    Assume again that or .] œ " !

    Then can again use real-valued 0Ðs xÑ and thenchoose closest value or to .] œ ! " 0Ð Ñs x

    Then a good choice for is0Ð Ñx

    0Ð Ñ œ Ö] œ "l œ ×x X x .

  • [So if we choose otherwise0Ð Ñ "Î# ] œ "àx] œ !]

    This would be approximated by

    0Ð Ñ œ Ö] œ "l œ ×s sx X x ; (4)

    [here represents a probability estimatedsfrom the dataset ].g œ ÖÐ ß C Ñ×x3 3 3œ"R

  • Recall that using (4) we choose estimated] œ C − Ö!ß "×s as being closest value to0Ð Ñs x .

    This estimate of lettingC Cs is equivalent to

    C œ Ö] œ 5l œ ×s sargmax5œ!ß"

    X x

    More generally with classes, andOC œ "ßá ßO.

  • This becomes the prediction

    CÐ Ñ œ ÐC œ 5l œ ÑÞs sx X xargmax5

    In the class case we can see5

    EPE Ð Ñ œ TÖCÐ Ñ Á C×sx x

    [assuming ]if otherwisePÐCß 5Ñ œ" C Á 5!œ

    [can easily show this note we againàcondition on ]x Þ