lasso - courses.cs.washington.edu · 1/17/17 1 cse 446: machine learning cse 446: machine learning...

30
1/17/17 1 CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression: Regularization for feature selection CSE 446: Machine Learning Feature selection task ©2017 Emily Fox

Upload: others

Post on 10-Sep-2019

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

1

CSE 446: Machine Learning

CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017

©2017 Emily Fox

Lasso Regression: Regularization for feature selection

CSE 446: Machine Learning

Feature selection task

©2017 Emily Fox

Page 2: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

2

CSE 446: Machine Learning 3

Efficiency: -  If size(w) = 100B, each prediction is expensive

-  If ŵ sparse , computation only depends on # of non-zeros

Interpretability: -  Which features are relevant for prediction?

Why might you want to perform feature selection?

©2017 Emily Fox

many zeros = X

wj 6=0

ŷi = ŵj hj(xi)

CSE 446: Machine Learning 4

Sparsity: Housing application

$ ?

LotsizeSingleFamilyYearbuiltLastsoldpriceLastsaleprice/sq?Finishedsq?Unfinishedsq?Finishedbasementsq?#floorsFlooringtypesParkingtypeParkingamountCoolingHeaIngExteriormaterialsRooftypeStructurestyle

DishwasherGarbagedisposalMicrowaveRange/OvenRefrigeratorWasherDryerLaundrylocaIonHeaIngtypeJeVedTubDeckFencedYardLawnGardenSprinklerSystem

LotsizeSingleFamilyYearbuiltLastsoldpriceLastsaleprice/sq?Finishedsq?Unfinishedsq?Finishedbasementsq?#floorsFlooringtypesParkingtypeParkingamountCoolingHeaIngExteriormaterialsRooftypeStructurestyle

DishwasherGarbagedisposalMicrowaveRange/OvenRefrigeratorWasherDryerLaundrylocaIonHeaIngtypeJeVedTubDeckFencedYardLawnGardenSprinklerSystem

©2017 Emily Fox

Page 3: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

3

CSE 446: Machine Learning

Option 1: All subsets or greedy variants

©2017 Emily Fox

CSE 446: Machine Learning 6

Exhaustive approach: “all subsets”

Consider all possible models, each using a subset of features

How many models were evaluated?each indexed by features included

©2017 Emily Fox

yi = εi

yi = w0h0(xi) + εi

yi = w1 h1(xi) + εi

yi = w0h0(xi) + w1 h1(xi) + εi

yi = w0h0(xi) + w1 h1(xi) + … + wD

hD(xi)+ εi

[0 0 0 … 0 0 0]

[1 0 0 … 0 0 0]

[0 1 0 … 0 0 0]

[1 1 0 … 0 0 0]

[1 1 1 … 1 1 1]

2D

28 = 256 230 = 1,073,741,824 21000 = 1.071509 x 10301

2100B = HUGE!!!!!!

Typically, computationally

infeasible

Page 4: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

4

CSE 446: Machine Learning 7

Choosing model complexity?

Option 1: Assess on validation set

Option 2: Cross validation

Option 3+: Other metrics for penalizing model complexity like BIC…

©2017 Emily Fox

CSE 446: Machine Learning 8

Greedy algorithms

Forward stepwise: Starting from simple model and iteratively add features most useful to fit Backward stepwise: Start with full model and iteratively remove features least useful to fit Combining forward and backward steps: In forward algorithm, insert steps to remove features no longer as important Lots of other variants, too.

©2017 Emily Fox

Page 5: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

5

CSE 446: Machine Learning

Option 2: Regularize

©2017 Emily Fox

CSE 446: Machine Learning 10

Ridge regression: L2 regularized regression

Total cost =

measure of fit + λ measure of magnitude of coefficients

©2017 Emily Fox

RSS(w) ||w||2=w02+…+wD

2

2

Encourages small weights but not exactly 0

Page 6: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

6

CSE 446: Machine Learning 11

Coefficient path – ridge

©2017 Emily Fox

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

h

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

λ

co

effi

cie

nts

ŵj

CSE 446: Machine Learning 12

Using regularization for feature selection

Instead of searching over a discrete set of solutions, can we use regularization?

- Start with full model (all possible features)

- “Shrink” some coefficients exactly to 0 •  i.e., knock out certain features

- Non-zero coefficients indicate “selected” features

©2017 Emily Fox

Page 7: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

7

CSE 446: Machine Learning 13

Thresholding ridge coefficients?

Why don’t we just set small ridge coefficients to 0?

©2017 Emily Fox

# bedrooms

# bathro

oms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year r

enovated

last sa

les pric

e

cost per s

q.ft.

heating

# showers

waterfront

0

CSE 446: Machine Learning 14

Thresholding ridge coefficients?

Selected features for a given threshold value

©2017 Emily Fox

# bedrooms

# bathro

oms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year r

enovated

last sa

les pric

e

cost per s

q.ft.

heating

# showers

waterfront

0

Page 8: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

8

CSE 446: Machine Learning 15

Thresholding ridge coefficients?

Let’s look at two related features…

©2017 Emily Fox

# bedrooms

# bathro

oms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year r

enovated

last sa

les pric

e

cost per s

q.ft.

heating

# showers

waterfront

0

Nothing measuring bathrooms was included!

CSE 446: Machine Learning 16

Thresholding ridge coefficients?

If only one of the features had been included…

©2017 Emily Fox

# bedrooms

# bathro

oms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year r

enovated

last sa

les pric

e

cost per s

q.ft.

heating

# showers

waterfront

0

Page 9: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

9

CSE 446: Machine Learning 17

Thresholding ridge coefficients?

Would have included bathrooms in selected model

©2017 Emily Fox

# bedrooms

# bathro

oms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year r

enovated

last sa

les pric

e

cost per s

q.ft.

heating

# showers

waterfront

0

Can regularization lead directly to sparsity?

CSE 446: Machine Learning 18

Try this cost instead of ridge…

Total cost = measure of fit + λ measure of magnitude of coefficients

©2017 Emily Fox

RSS(w) ||w||1=|w0|+…+|wD|

Lasso regression (a.k.a. L1 regularized regression)

Leads to sparse solutions!

Page 10: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

10

CSE 446: Machine Learning 19

Lasso regression: L1 regularized regression

Just like ridge regression, solution is governed by a continuous parameter λ

If λ=0: If λ=∞: If λ in between:

RSS(w) + λ||w||1 tuning parameter = balance of fit and sparsity

©2017 Emily Fox

CSE 446: Machine Learning 20

Coefficient path – ridge

©2017 Emily Fox

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

h

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

λ

co

effi

cie

nts

ŵj

Page 11: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

11

CSE 446: Machine Learning 21

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− LASSO

h

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

Coefficient path – lasso

©2017 Emily Fox

λ

co

effi

cie

nts

ŵj

CSE 446: Machine Learning

Fitting the lasso regression model (for given λ value)

©2017 Emily Fox

Page 12: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

12

CSE 446: Machine Learning 23

How we optimized past objectives

To solve for ŵ, previously took gradient of total cost objective and either:

1)  Derived closed-form solution

2)  Used in gradient descent algorithm

©2017 Emily Fox

CSE 446: Machine Learning 24

Optimizing the lasso objective

Lasso total cost:

Issues: 1)  What’s the derivative of |wj|?

2)  Even if we could compute derivative, no closed-form solution

©2017 Emily Fox

RSS(w) + ||w||1

λ

gradients à subgradients

can use subgradient descent

Page 13: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

13

CSE 446: Machine Learning

Aside 1: Coordinate descent

©2017 Emily Fox

CSE 446: Machine Learning 26

Coordinate descent

Goal: Minimize some function g

Often, hard to find minimum for all coordinates, but easy for each coordinate

Coordinate descent: Initialize ŵ = 0 (or smartly…)

while not converged pick a coordinate j ŵj ß

©2017 Emily Fox

Page 14: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

14

CSE 446: Machine Learning 27

Comments on coordinate descent

How do we pick next coordinate? -  At random (“random” or “stochastic” coordinate descent), round robin, …

No stepsize to choose! Super useful approach for many problems

-  Converges to optimum in some cases (e.g., “strongly convex”)

-  Converges for lasso objective

©2017 Emily Fox

CSE 446: Machine Learning

Aside 2: Normalizing features

©2017 Emily Fox

Page 15: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

15

CSE 446: Machine Learning 29

p

Normalizing features

Scale training columns (not rows!) as:

Apply same training scale factors to test data:

©2017 Emily Fox

hj(xk) = hj(xk)

hj(xi)2

NX

i=1

summing over training points apply to

test point

Training features

Test features

Normalizer: zj

Normalizer: zj

hj(xk) = hj(xk) p

hj(xi)2

NX

i=1

CSE 446: Machine Learning

Aside 3: Coordinate descent for unregularized regression (for normalized features)

©2017 Emily Fox

Page 16: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

16

CSE 446: Machine Learning 31

Optimizing least squares objective one coordinate at a time

Fix all coordinates w-j and take partial w.r.t. wj

©2017 Emily Fox

RSS(w) = (yi- wjhj(xi))2

NX

i=1

DX

j=0

NX

i=1RSS(w) = -2 hj(xi)(yi – wjhj(xi))

∂ ∂wj

DX

j=0

CSE 446: Machine Learning 32

Optimizing least squares objective one coordinate at a time

Set partial = 0 and solve

©2017 Emily Fox

RSS(w) = (yi- wjhj(xi))2

NX

i=1

DX

j=0

RSS(w) = -2ρj + 2wj = 0

∂ ∂wj

Page 17: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

17

CSE 446: Machine Learning 33

Coordinate descent for least squares regression

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj = ρj

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

prediction without feature j

residual without feature j

CSE 446: Machine Learning

Coordinate descent for lasso (for normalized features)

©2017 Emily Fox

Page 18: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

18

CSE 446: Machine Learning 35

Coordinate descent for least squares regression

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj = ρj

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

prediction without feature j

residual without feature j

CSE 446: Machine Learning 36

Coordinate descent for lasso

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

Page 19: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

19

CSE 446: Machine Learning 37

Soft thresholding

©2017 Emily Fox

ŵj

ρj

ŵj = ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

CSE 446: Machine Learning 38

How to assess convergence?

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

Page 20: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

20

CSE 446: Machine Learning 39

When to stop?

For convex problems, will start to take smaller and smaller steps

Measure size of steps taken in a full loop over all features -  stop when max step < ε

Convergence criteria

©2017 Emily Fox

CSE 446: Machine Learning 40

Other lasso solvers

©2017 Emily Fox

Classically: Least angle regression (LARS) [Efron et al. ‘04]

Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08]

Now:

•  Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])

•  Other parallel learning approaches for linear models -  Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])

-  Parallel independent solutions then averaging [Zhang et al. ‘12]

•  Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]

Page 21: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

21

CSE 446: Machine Learning

Coordinate descent for lasso (for unnormalized features)

©2017 Emily Fox

CSE 446: Machine Learning 42

Coordinate descent for lasso with normalized features

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

Page 22: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

22

CSE 446: Machine Learning 43

Coordinate descent for lasso with unnormalized features

Precompute:

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1zj = hj(xi)

2

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

(ρj + λ/2)/zj if ρj < -λ/2

(ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

CSE 446: Machine Learning

How to choose λ

©2017 Emily Fox

Page 23: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

23

CSE 446: Machine Learning 45

If sufficient amount of data…

©2017 Emily Fox

Validation set

Training set Test set

fit ŵλ test performance of ŵλ to select λ*

assess generalization

error of ŵλ*

CSE 446: Machine Learning

Summary for feature selection and lasso regression

©2017 Emily Fox

Page 24: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

24

CSE 446: Machine Learning 47

Impact of feature selection and lasso

Lasso has changed machine learning, statistics, & electrical engineering

But, for feature selection in general, be careful about interpreting selected features

-  selection only considers features included

-  sensitive to correlations between features

-  result depends on algorithm used

-  there are theoretical guarantees for lasso under certain conditions

©2017 Emily Fox

CSE 446: Machine Learning 48

What you can do now… •  Describe “all subsets” and greedy variants for feature selection

•  Analyze computational costs of these algorithms

•  Formulate lasso objective

•  Describe what happens to estimated lasso coefficients as tuning parameter λ is varied

•  Interpret lasso coefficient path plot

•  Contrast ridge and lasso regression

•  Estimate lasso regression parameters using an iterative coordinate descent algorithm

©2017 Emily Fox

Page 25: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

25

CSE 446: Machine Learning

Deriving the lasso coordinate descent update

©2017 Emily Fox

OPTIONAL

CSE 446: Machine Learning 50

Optimizing lasso objective one coordinate at a time

Fix all coordinates w-j and take partial w.r.t. wj

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|

NX

i=1

DX

j=0

DX

j=0

derive without normalizing features

Page 26: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

26

CSE 446: Machine Learning 51

Part 1: Partial of RSS term

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|

NX

i=1

DX

j=0

DX

j=0

NX

i=1RSS(w) = -2 hj(xi)(yi – wjhj(xi))

∂ ∂wj

DX

j=0

CSE 446: Machine Learning 52

Part 2: Partial of L1 penalty term

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|

NX

i=1

DX

j=0

DX

j=0

λ |wj| = ??? ∂ ∂wj |x|

x

Page 27: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

27

CSE 446: Machine Learning 53

Subgradients of convex functions Gradients lower bound convex functions:

Subgradients: Generalize gradients to non-differentiable points: -  Any plane that lower bounds function

©2017 Emily Fox

b a unique at x if function

differentiable at x

g(x)

x

|x|

x

CSE 446: Machine Learning 54

Part 2: Subgradient of L1 term

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|

NX

i=1

DX

j=0

DX

j=0

λ∂|wj| = wj

|wj|

wj

-λ when wj < 0

λ when wj > 0 [-λ,λ] when wj = 0

Page 28: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

28

CSE 446: Machine Learning 55

Putting it all together…

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|

NX

i=1

DX

j=0

DX

j=0

∂[lasso cost] = 2zjwj – 2ρj + wj

-λ when wj < 0

λ when wj > 0 [-λ, λ] when wj = 0

2zjwj – 2ρj – λ when wj < 0

2zjwj – 2ρj + λ when wj > 0 [-2ρj-λ, -2ρj+λ] when wj = 0 =

CSE 446: Machine Learning 56

Optimal solution: Set subgradient = 0

©2017 Emily Fox

∂[lasso cost] = = 0 wj

Case 1 (wj < 0): Case 2 (wj = 0):

Case 3 (wj > 0):

2zjŵj – 2ρj – λ = 0

ŵj = 0

For ŵj < 0, need

For ŵj = 0, need [-2ρj-λ, -2ρj+λ] to contain 0:

2zjŵj – 2ρj + λ = 0 For ŵj > 0, need

2zjwj – 2ρj – λ when wj < 0

2zjwj – 2ρj + λ when wj > 0 [-2ρj-λ, -2ρj+λ] when wj = 0

Page 29: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

29

CSE 446: Machine Learning 57

Optimal solution: Set subgradient = 0

©2017 Emily Fox

∂[lasso cost] = = 0 wj

(ρj + λ/2)/zj if ρj < -λ/2

(ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2] ŵj =

2zjwj – 2ρj – λ when wj < 0

2zjwj – 2ρj + λ when wj > 0 [-2ρj-λ, -2ρj+λ] when wj = 0

CSE 446: Machine Learning 58 ©2017 Emily Fox

ŵj

ρj

(ρj + λ/2)/zj if ρj < -λ/2

(ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2] ŵj =

Soft thresholding

Page 30: lasso - courses.cs.washington.edu · 1/17/17 1 CSE 446: Machine Learning CSE 446: Machine Learning Emily Fox University of Washington January 18, 2017 ©2017 Emily Fox Lasso Regression:

1/17/17

30

CSE 446: Machine Learning 59

Coordinate descent for lasso

Precompute:

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1zj = hj(xi)

2

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

(ρj + λ/2)/zj if ρj < -λ/2

(ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]