lecture 4 - math.bu.edumath.bu.edu/people/mkon/ma751/l4ml.pdflecture 4. 1. lasso this motivates a...

Lecture 4

1. Lasso

The above motivates a similar algorithm,Least squares Shrinkage and SelectionOperations, i.e., .LASSO

Assume we have training setg œ ÖÐx3 3 3œ"Rß C Ñ× .

As before assume that both havex and Cbeen centered, i.e. that

x xœ œ !à C œ C œ !Þ" "R R

3 3

3 3

This time we will replace the m"m# sum ofsquares regularization term in ridgeregression with a new term,

m m œ l lÞ" " 44œ"

:

"

We have as before

"s œLasso

argmin" ðóóóóóóóóóóóóóóóñóóóóóóóóóóóóóóóò"#Ðy X y X Ð Ñ m Þ" " ")X "- m

_œLagrangian

(3)[Note that for convenience we include the

factor of in front in (3); this is used in the"#book and the same as substituting ]- -Ä #

Equivalently, it can be shown using (reverse)Lagrange multipliers that we are minimizing

RSS , œ Ððóóóóóóóóñóóóóóóóóòy X y X Ð Ñ" ")Xœm my X" #

(4a)

subject to a constraint of the form

4œ"

:

4l l Ÿ >" - (4b)

for some which depends on .>- -

This method is LASSO.

Note that if as usual

" "s œ ´ m m ßols argmin argmin" "

RSS y X #

then adding the constraint means that(4b)we have the following diagram:

Fig. 1: Feasible set of (blue) and the point (corner)" that makes RSS smallestœ m m Þy X" #

Note that above we again assume we havecentered data and centered values, i.e.,CC œ !.

Note above that the solution (intersection ofred and blue region) has , i.e., this"# œ !method zeroes out some of instead of"3just shrinking them.

Above: Trevor Hastie, 2007(https://web.stanford.edu/~hastie/TALKS/bradfest.pdf)

Above figure shows growth of coefficients fora least squares regression as the P"

constraint is increased.> œ l l4

4"

Each curve traces the evolution of one of thecoefficients as a function of ."4 >

Increasing is equivalent to decreasing the > -coefficient in the above version (3).

Note can show the coefficients changepiecewise linearly with .>

Notes:

1. A convex optimization algorithm can beused to solve this problem for "lasso.

2. An alternative algorithm, Least AngleRegression (LAR) does something similarwith regard to variable selection.

3. We can choose by cross-validation-

4. This exact method also arisesindependently from Bayesian statisticsthrough the use of a double exponentialprior probability distribution on , rather"than a Gaussian prior distribution(Gaussian prior distribution gives ridgeregression).

5. This method has the ability to dropvariables, and it shrinks their coefficients ifthey are not dropped.

2. Example: orthogonal regression

Consider the simplified situation where datamatrix X happens to have orthonormalcolumns.

We have in this case (see text, problems)that

" -s sœ Ð l Ñ ß4 4 4 Lasso ols ols

sgn )(|" "s (4a)

where OLS stands for ordinary least squaresß

and .if ow+ œ+ + !! œ

Here, sgn if if Ð+Ñ œ Þ" + !" + !œ

[Equation implies has the same(4a) "sLasso4

sign as " "s s4 4ols ols

(since has same sign as

sgn ), but it is shrunk ]Ð Ñ"s4ols

Note can show that solving the constraintversion Lasso yields(4ab) (4) of above

" "sLasso

œ sols

if .> l l-4œ"

:

4"ols

This can then be shown to correspond to- œ ! in the regularization version (3) ofLasso.

More on the algorithm: finding

argmin"

Ðy X y X " "Ñ Ð ÑX

such that is a 4œ"

:

4l l Ÿ >" quadratic

programming problem, i.e. a problem thatcan be solved by the well known and easyto use. quadratic programming algorithm

More detail: we are effectively trying tosolve a problem of the form

argmin argminx x

x Qx x xX X - œ 0Ð Ñ (15)

with the constraint that and Ax b Ex dŸ œ .

If the matrix is Q positive semi-definite(i.e., all eigenvalues are non-negative),then this implies that is convex. A0ÐxÑglobal minimum exists if at least one xsatisfies constraints and is bounded(15)below in the feasibility region (i.e. regionsatisfying the constraints)

Typical methods for solving this include(a) interior point methods(b) active set methods(c) conjugate gradient methods

3. An alternative to LASSO -- Least AngleRegression (LAR)

Definition 1: Given two vectors and ,x ytheir iscorrelation

Corr , Cov(

Ð ß Ñ ´Ñ

Z Ð ÑZ Ð Ñx y x y

x yßÈ

where Cov , ,Ð Ñ œ B B ÐC CÑx y3

3 3

Z ÐxÑ œ ÐB BÑ3

3# and B œ B Þ":

3œ"

:

3

The algorithm:LAR (least angle regression)

1. Start with for . Let"s à! 4œ œ ! 4 "sy "r y yœ Þs

2. Find an (column of such thatx Xë4 )corr is maximized.Ð ß Ñx rë4

3. Increase from towards"s4 !sgn Corr omputing current residualÐ Ðx rë4ß ÑÑß cr y yœ s.

4. Stop when such thatb xë5l Ð ß Ñl œ l Ð ß ÑlCorr Corr .x r x rë ë5 4

5. Increase in their joint OLSÐ ß Ñs s" "4 5(ordinary least squares) direction based oncurrent residual (their correlations with r rstay the same).

Stop when there is a that catches up, withxë6l Ð Ñl œ l Ð ß ÑlßCorr Corrx r x rë6ß 4ë

then proceed similarly with included withxë6x xë ë4 5ß (note these are correlations acrosssamples)Þ

6. Continue until all xë3 have been included;eventually arrive at ordinary OLS solution.

7. If stop early then have not yet included theleast needed variables.

8. Complexity is SÐ: R: Ñ$ $

The behavior of the evolving coefficient vector"Ð=Ñ (parameterized appropriately is *very*similar to the evolution ( ) of the LASSO" -coefficients as a increases:-

Above: Trevor Hastie, Least Angle Regression, 2007(https://web.stanford.edu/~hastie/TALKS/bradfest.pdf)

[There is an iterative algorithm for LASSOthat works very similarly to LAR]

[The so-called LARS package (Stanford)combines Lasso, LAR, and ForwardStagewise regression]

4. Coordinate descent

The coordinate descent method cannumerically implement LASSO in anefficient way.

Start by writing (assume again x yß centered,i.e., ):x œ C œ !

" "

# # l œ ly X y X y X " " " " "# X- -l l"

œ C B Ð Ñ l Ð Ñl"

#Œ

3œ"

R

3 35 5 5

5œ" 5œ"

: :#

" - - " -

[note that entry of B œ Ð3ß 5Ñ35 X xœ Ð Ñ ß3 5 i.e.5 3>2 >2 entry of sample]

[Now fix , and freeze to a fixed value if4 " "ë5 55 Á 4à 4note is currently active variable]

œ C B Ð Ñ B l Ð Ñl l l"

#Œ ðóóóñóóóò3œ"

R

3 35 34 4 4

5Á4 5Á4

: :

5 5

#

" - " - " - - "ë ë

define as Cë3

[now optimize only; keep all other " " "ë4 5 5œfixed]

œ ÐÒC C Ó B Ñ l Ð Ñl l l"

# ðóóóñóóóò4œ"R

3 34 4 43#

5Á4

:

5ë " - " - - "ë

frozen; can leave out

Now consider ÐD œ D C C Ñ3 33Ð4Ñ

3œ ë

min "4"

#ÐD B Ñ l l

3œ"

R

3 34 4 4#" - "

assume normalization 3

34#B œ "

œ ÒB #D B D Ó „"

#3

34 4 3# # #

3 34 4 4" " -"

œ D B D „ Þ" "

# #" " -"4 3# #

4 3 34 4

3 3

This yields a value curve as a function of "4which looks like:

So just minimize this -

5. Linear Methods for Classification

Our agenda: now look at classification ratherthan regression

1. Background on classification

2. Three methods:(a) Linear discriminant analysis(b) Logistic regression(c) Separating hyperplanes

(a) Assumptions

Recall: Regression vs. classification

We always want to predict fromCx œ ÐB ßá ß B Ñ −" : :‘ .

Regression case: is a real numberC − ‘

Classification case: C − Ö"ßá ßO×

where number of classesO œ

In both cases:We always begin by assuming that valuesÐx,CÑ are instances of underlying randomvariables (RVs) .Ð ] ÑX,

Assume we have fixed joint probabilitydistribution on the space of

‘ l: ‚ œ ÖÐx xß CÑ À − ß C − ×ß‘ l:

of possible values.

Here

l‘

œ ÞÖ"ßáO×œ in the regression casein the classification case

Specifically, we will say that for a(measurable) set of possibleE § ‚‘ l:values of ÐXß ] Ñ,

Ð Ð ß ] Ñ − EÑProbability that X

´ ÐÐ ß ] Ñ − EÑ ´ ÐEÑÞ X

In short we write

ÐX xß ] Ñ µ Ð ß CÑ

In regression case we often assumeC − ß‘that the measure has a joint probabilitydensity function

3Ðxß CÑ œ ÐB ßá ß B ß CÑÞ3 " :

That is, for a set of possible values of (E ß CÑx

3ÖÐ ß CÑ − E× œ Ð ß CÑ . .Cx x x(E

œ ÐB ßá ß B ß CÑ.B .B á.B .C(E

" : " # :3

where 3Ðxß CÑ œ ÐB ßá ß B ß CÑ œ3 " : joint densityfunction of (\ ßá\ ß]" : before we seethem .)

Our goal is to predict through some]function .0ÐXÑ

(b) Bayes method

Emphasis: we will start by using the so-calledBayes method in which a joint probabilitydensity function 3Ðxß CÑ is assumed fullyknown for now.

What is a good 'guess' of given ?] X

If we actually know the joint distribution ofÐ ß ] ÑX (Bayes method):

We will find the 'best' guess at by]minimizing a given loss (penalty) function

PÐCß 0ÐxÑÑ œ estimate of our loss if the truevalue is but we guess C 0Ð ÑÞx

So we choose our best estimate of 0Ð 0Ðs x xÑ Ñby first viewing x X Ä C Ä ]and asrandom variables and then replacing theserandom variables again by and .x C

We have

[note X notœ Ð\ ßá ß\ Ñ" : data matrix here!]

EPE 0 œ Ð0Ñs argmin0

œ ÒPÐ] ß 0Ðargmin0

Ð„ X,] Ñ XÑÑÓ

œ ÒPÐ] ß 0Ð ÑÑl œ Óargmin0

„ „X xðóóóóóóóóóóóóñóóóóóóóóóóóóòˆ ‰] l x X x´ ÒPÐ] ß0Ð ÑÓ„] lX X

(1)

[Note notational convenience of writing

„ „] l ] lX xÒPÐ] ß 0Ð ÑÓ ´ ÒPÐ] ß 0Ð ÑÑl œ ÓX x X x ](1a)

[Both sides of mean same thing since(1a) X xœ are fixed; right side of isboth (1a) more convenient notationally]

Remark: Note we have defined a newrandom variable .] lx

This refers to the values of ] conditioned onX x Xœ , i.e., on the assumption that rv takes on a specific value.

The probability measure (distribution) of ] lxon is the ‘ conditional probabilitydistribution of ] œ Þ given X x

[See probability notes for more carefuldefinition of conditional probabilitydistributions]

(c) Minimization of error

Note to minimize the penalty sufficient(1) it isto minimize the inside expression

I ÒPÐ] ß 0Ð] lx x X xÑÑl Óœ

pointwise for each fixed .x

For regression (with a continuous] − ‘variable) we often choose squared errorloss:

PÐ] ß 0ÐX XÑÑ œ Ð] 0Ð ÑÑ#

In this case we obtain

0Ð Ñ œ ] 0Ð Ñ ls x x X xarg min [(0Ð

] l#

xx

Ñ

„ Ñ œ Ó

œ ] -Ñ larg min [(-

] l#„ x X xœ Ó

œ Ò] l Ó ´ Ð] l œ Ñ„ „x X x (2)

[where now x is frozen, so we define -(dummy variable) ]œ 0Ð Ñx

Note that above we use the following easyfact:

For any random variable (RV) , the number]- that minimizes the average value ofÐ] -Ñ - œ Ð] ÑÞ# is „

So (2) shows that 0Ðs x xÑ œ Ð] l Ñ„ .

What do we do now for classification?

6. Bayes estimate for classification case

(a) Special case: :] œ !ß "

Now suppose that .if class 1 if class 2] œ"!œ

Then it makes sense to use

0Ðs x X xÑ œ IÐ] l œ Ñ

again here:

[after this just choose the value or that! "0Ð Ñs x is closest to]

So we have

0Ðs x X xÑ œ IÐ] l œ Ñ

œ Ð l œ † " Ð # l œ Ñ † ! ðñò ðñòclass 1 ) class ] œ" ] œ!

X x X x

œ Ð l œ ðñòclass 1 ).] œ"

X x

Now take the real number value 0Ðs xÑ, andchoose a threshold on for class 0Ð Ñ ] œ !s xor ] œ "Þ

For example pick the value closer toÖ!ß "×0Ð Ñ ]s x and make that .

(b) General multiclass case: ] œ !ßá ßOagain:

Now again consider classes O "ßá ßOÞ

Let be categorical RV with these ] Opossible values

Let us estimate the category of input by] x0s ( .xÑ

Above (in the two class case ) we usedO œ #real-valued 0Ðs xÑ and found the closestvalue ( or .] œ ! ] œ "Ñ

Here (in the class case) we will let O 0Ð Ñs xtake on just the possible class values"ßá ßOÞ

We have

0Ð œ I ÒPÐ] ß 0Ðs x X X xÑ ÑÑl œ Óarg min0Ð ÑX

Xß]

œ I I ÒPÐ] ß 0Ð ÑÑl œ Óarg min 0Ð Ñx

X x] l x X x

(this can again be minimized pointwise overthe internal expectation expression)

Now instead of 0ÐxÑ œ - œ real number weonly allow with an0Ð Ñ œ 5 " Ÿ 5 Ÿ Oxinteger.

(d) Minimization for general case

Thus best 0Ðx xÑ œ 0Ð Ñs minimizes the internalexpression

0Ðs x X xÑ œ I ÒPÐ] ß 5Ñl œ Óargmin5

] lx

(note .C − Ö"ßá ßO×Ñ

So

0Ð Ñ œ PÐCß 5Ñ Ð] œ C ls x X xargmin5 Cœ"

O

œ Ñ (3)

Often we choose ( to be a 0/1 loss:P Cß 5Ñi.e.

PÐCß 5Ñ œ" C Á 5! C œ 5œ if if Þ

Then in note that if and(3) PÐCß 5Ñ œ ! C œ 5PÐCß 5Ñ œ " C Á 5Þ if Thus rewrite as(3)

CÐ Ñ œ 0Ð Ñ œ ÐCl œ Ñs sx x X xargmin5 CÀCÁ5

œ " Ð5l œ Ñargmin5

c d X xœ Ð5l œ Ñsargmax

5 X x

Minimizing means that we want to be(3) Cthe that makes the largest.5 Ð5l œ Ñ X x

This formula for is called the 0Ð Ñs x Bayes'srule, in this discrete case and the abovecontinuous case.

The full expected prediction error

EPE Ð0 Ñ ´ ÒPÐ] ß 0Ðs „ÐX,] Ñ XÑÑÓ

above is called the .Bayes (error) rate

(e) Bayes decision boundary

Finally

Öx X x X xÀ ÖC œ 4l œ × œ ÖC œ 5l œ ××

is called the Bayes decision boundarybetween classes and .4 5

As an example, if we consider :x − ‘#

So above is more likely, while below] œ 4] œ 5 is more likely. On the Bayesdecision boundary they are equally likely.

(f) Summary

Special case classes (again)O œ #

Assume again that or .] œ " !

Then can again use real-valued 0Ðs xÑ and thenchoose closest value or to .] œ ! " 0Ð Ñs x

Then a good choice for is0Ð Ñx

0Ð Ñ œ Ö] œ "l œ ×x X x .

[So if we choose otherwise0Ð Ñ "Î# ] œ "àx] œ !]

This would be approximated by

0Ð Ñ œ Ö] œ "l œ ×s sx X x ; (4)

[here represents a probability estimatedsfrom the dataset ].g œ ÖÐ ß C Ñ×x3 3 3œ"R

Recall that using (4) we choose estimated] œ C − Ö!ß "×s as being closest value to0Ð Ñs x .

This estimate of lettingC Cs is equivalent to

C œ Ö] œ 5l œ ×s sargmax5œ!ß"

X x

More generally with classes, andOC œ "ßá ßO.

This becomes the prediction

CÐ Ñ œ ÐC œ 5l œ ÑÞs sx X xargmax5

In the class case we can see5

EPE Ð Ñ œ TÖCÐ Ñ Á C×sx x

[assuming ]if otherwisePÐCß 5Ñ œ" C Á 5!œ

[can easily show this note we againàcondition on ]x Þ

lecture 4 - math.bu.edumath.bu.edu/people/mkon/ma751/l4ml.pdflecture 4. 1. lasso this motivates a...

Documents