lasso regression - courses.cs.washington.edu...activity in which brain regions can predict...

35
4/12/18 1 CSE 416: Machine Learning Emily Fox University of Washington April 12, 2018 ©2018 Emily Fox Lasso Regression: Regularization for feature selection CSE/STAT 416: Intro to Machine Learning 2 Symptom of overfitting Often, overfitting associated with very large estimated parameters ŵ ©2018 Emily Fox square feet (sq.ft.) price ($) x f ŵ y OVERFIT

Upload: others

Post on 16-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

1

CSE/STAT 416: Intro to Machine Learning

CSE 416: Machine LearningEmily FoxUniversity of WashingtonApril 12, 2018

©2018 Emily Fox

Lasso Regression:Regularization for feature selection

CSE/STAT 416: Intro to Machine Learning2

Symptom of overfitting

Often, overfitting associated with verylarge estimated parameters ŵ

©2018 Emily Fox

square feet (sq.ft.)

pri

ce ($

)

x

fŵy

OVERFIT

Page 2: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

2

CSE/STAT 416: Intro to Machine Learning3 ©2018 Emily Fox

TrainingData

Featureextraction

ML model

Qualitymetric

ML algorithm

y

h(x) ŷ

ŵ

CSE/STAT 416: Intro to Machine Learning4

Consider specific total cost

Total cost =measure of fit + measure of magnitude of coefficients

©2018 Emily Fox

RSS(w) ||w||22

Want to balance:i. How well function fits dataii. Magnitude of coefficients

Page 3: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

3

CSE/STAT 416: Intro to Machine Learning5

Consider resulting objective

What if ŵ selected to minimize

RSS(w) + ||w||2

©2018 Emily Fox

λ

Ridge regression(a.k.a L2 regularization)

tuning parameter = balance of fit and magnitude

2

CSE/STAT 416: Intro to Machine Learning6

What summary # is indicative of size of regression coefficients?

- Sum?

- Sum of absolute value?

- Sum of squares (L2 norm)

©2018 Emily Fox

Measure of magnitude of regression coefficient

Page 4: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

4

CSE/STAT 416: Intro to Machine Learning

Feature selection task

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning8

Efficiency: - If size(w) = 100B, each prediction is expensive- If ŵ sparse , computation only depends on # of non-zeros

Interpretability: - Which features are relevant for prediction?

Why might you want to performfeature selection?

©2018 Emily Fox

many zerosX

wj 6=0

ŷi = ŵj hj(xi)

Page 5: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

5

CSE/STAT 416: Intro to Machine Learning9

Sparsity: Housing application

$ ?

Lot sizeSingle FamilyYear builtLast sold priceLast sale price/sqftFinished sqftUnfinished sqftFinished basement sqft# floorsFlooring typesParking typeParking amountCoolingHeatingExterior materialsRoof typeStructure style

DishwasherGarbage disposalMicrowaveRange / OvenRefrigeratorWasherDryerLaundry locationHeating typeJetted TubDeckFenced YardLawnGardenSprinkler System

Lot sizeSingle FamilyYear builtLast sold priceLast sale price/sqftFinished sqftUnfinished sqftFinished basement sqft# floorsFlooring typesParking typeParking amountCoolingHeatingExterior materialsRoof typeStructure style

DishwasherGarbage disposalMicrowaveRange / OvenRefrigeratorWasherDryerLaundry locationHeating typeJetted TubDeckFenced YardLawnGardenSprinkler System…

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning10

Sparsity: Reading your mind

©2018 Emily Fox

very sad very happy

Activity in which brain regions can predict

happiness?

Page 6: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

6

CSE/STAT 416: Intro to Machine Learning

Option 1: All subsets or greedy variants

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning12

Find best model of size: 0

©2018 Emily Fox

# of features

RSS

(ŵ)

0

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 7: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

7

CSE/STAT 416: Intro to Machine Learning13

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning14

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 8: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

8

CSE/STAT 416: Intro to Machine Learning15

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning16

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 9: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

9

CSE/STAT 416: Intro to Machine Learning17

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning18

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 10: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

10

CSE/STAT 416: Intro to Machine Learning19

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning20

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 11: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

11

CSE/STAT 416: Intro to Machine Learning21

Find best model of size: 1

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning22

Find best model of size: 2

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 12: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

12

CSE/STAT 416: Intro to Machine Learning23

Note: Not necessarily nested!

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning24

Note: Not necessarily nested!

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 13: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

13

CSE/STAT 416: Intro to Machine Learning25

Find best model of size: 3

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2 3

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning26

Find best model of size: 4

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2 3 4

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 14: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

14

CSE/STAT 416: Intro to Machine Learning27

Find best model of size: 5

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2 3 4 5

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning28

Find best model of size: 6

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2 3 4 5 6

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 15: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

15

CSE/STAT 416: Intro to Machine Learning29

Find best model of size: 7

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2 3 4 5 6 7

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

CSE/STAT 416: Intro to Machine Learning30

Find best model of size: 8

©2018 Emily Fox

# of features

RSS

(ŵ)

0 1 2 3 4 5 6 7 8

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 16: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

16

CSE/STAT 416: Intro to Machine Learning31

Choosing model complexity?

Option 1: Assess on validation set

Option 2: Cross validation

Option 3+: Other metrics for penalizingmodel complexity like BIC…

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning32

Complexity of “all subsets”

How many models were evaluated?- each indexed by features included

©2018 Emily Fox

yi = εi

yi = w0h0(xi) + εi

yi = w1 h1(xi) + εi

yi = w0h0(xi) + w1 h1(xi) + εi

yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

[0 0 0 … 0 0 0]

[1 0 0 … 0 0 0]

[0 1 0 … 0 0 0]

[1 1 0 … 0 0 0]

[1 1 1 … 1 1 1]

2D

28 = 256230 = 1,073,741,82421000 = 1.071509 x 10301

2100B = HUGE!!!!!!

Typically, computationally

infeasible

Page 17: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

17

CSE/STAT 416: Intro to Machine Learning33

Greedy algorithms

Forward stepwise:Starting from simple model and iteratively add features most useful to fit

Backward stepwise:Start with full model and iteratively remove features least useful to fit

Combining forward and backward steps:In forward algorithm, insert steps to remove features no longer as important

Lots of other variants, too.

33©2017 Emily Fox

CSE/STAT 416: Intro to Machine Learning

Option 2: Regularize

©2018 Emily Fox

Page 18: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

18

CSE/STAT 416: Intro to Machine Learning35

Ridge regression: L2 regularized regression

Total cost =measure of fit + λ measure of magnitude of coefficients

©2018 Emily Fox

RSS(w) ||w||2=w02+…+wD

22

Encourages small weightsbut not exactly 0

CSE/STAT 416: Intro to Machine Learning36

Coefficient path – ridge

©2018 Emily Fox

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

λ

coef

ficie

nts

ŵj

Page 19: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

19

CSE/STAT 416: Intro to Machine Learning37

Using regularization for feature selection

Instead of searching over a discrete set of solutions, can we use regularization?- Start with full model (all possible features)

- “Shrink” some coefficients exactly to 0• i.e., knock out certain features

- Non-zero coefficients indicate “selected” features

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning38

Thresholding ridge coefficients?

Why don’t we just set small ridge coefficients to 0?

©2018 Emily Fox

# bedrooms

# bathrooms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year re

novated

last sales p

rice

cost per s

q.ft.

heating

# showers

waterfront

0

Page 20: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

20

CSE/STAT 416: Intro to Machine Learning39

Thresholding ridge coefficients?

Selected features for a given threshold value

©2018 Emily Fox

# bedrooms

# bathrooms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year re

novated

last sales p

rice

cost per s

q.ft.

heating

# showers

waterfront

0

CSE/STAT 416: Intro to Machine Learning40

Thresholding ridge coefficients?

Let’s look at two related features…

©2018 Emily Fox

# bedrooms

# bathrooms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year re

novated

last sales p

rice

cost per s

q.ft.

heating

# showers

waterfront

0

Nothing measuring bathrooms was included!

Page 21: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

21

CSE/STAT 416: Intro to Machine Learning41

Thresholding ridge coefficients?

If only one of the features had been included…

©2018 Emily Fox

# bedrooms

# bathrooms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year re

novated

last sales p

rice

cost per s

q.ft.

heating

# showers

waterfront

0

CSE/STAT 416: Intro to Machine Learning42

Thresholding ridge coefficients?

Would have included bathrooms in selected model

©2018 Emily Fox

# bedrooms

# bathrooms

sq.ft.

living

sq.ft.

lot

floors

year b

uilt

year re

novated

last sales p

rice

cost per s

q.ft.

heating

# showers

waterfront

0

Can regularization lead directly to sparsity?

Page 22: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

22

CSE/STAT 416: Intro to Machine Learning43

Try this cost instead of ridge…

Total cost =measure of fit + λ measure of magnitude of coefficients

©2018 Emily Fox

RSS(w) ||w||1=|w0|+…+|wD|

Lasso regression(a.k.a. L1 regularized regression)

Leads to sparse solutions!

CSE/STAT 416: Intro to Machine Learning44

Lasso regression: L1 regularized regression

Just like ridge regression, solution is governed by a continuous parameter λ

If λ=0:

If λ=∞:

If λ in between:

RSS(w) + λ||w||1tuning parameter = balance of fit and sparsity

©2018 Emily Fox

Page 23: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

23

CSE/STAT 416: Intro to Machine Learning45

Coefficient path – ridge

©2018 Emily Fox

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

λ

coef

ficie

nts

ŵj

CSE/STAT 416: Intro to Machine Learning46

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− LASSO

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

Coefficient path – lasso

©2018 Emily Fox

λ

coef

ficie

nts

ŵj

Page 24: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

24

CSE/STAT 416: Intro to Machine Learning47

Revisit polynomial fit demo

What happens if we refit our high-orderpolynomial, but now using lasso regression?

Will consider a few settings of λ …

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning

How to choose λ:Cross validation

©2018 Emily Fox

Page 25: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

25

CSE/STAT 416: Intro to Machine Learning49

If sufficient amount of data…

©2018 Emily Fox

Validation set

Training setTest set

fit ŵλtest performance of ŵλ to select λ*

assess generalization

error of ŵλ*

CSE/STAT 416: Intro to Machine Learning50

Start with smallish dataset

©2018 Emily Fox

All data

Page 26: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

26

CSE/STAT 416: Intro to Machine Learning51

Still form test set and hold out

©2018 Emily Fox

Rest of dataTest set

CSE/STAT 416: Intro to Machine Learning52

How do we use the other data?

©2018 Emily Fox

Rest of data

use for both training and validation, but not so naively

Page 27: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

27

CSE/STAT 416: Intro to Machine Learning53

Recall naïve approach

Is validation set enough to compare performance of ŵλacross λ values?

©2018 Emily Fox

Valid.set

Training set

small validation set

No

CSE/STAT 416: Intro to Machine Learning54

Choosing the validation set

Didn’t have to use the last data points tabulated to form validation set

Can use any data subset

©2018 Emily Fox

Valid.set

small validation set

Page 28: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

28

CSE/STAT 416: Intro to Machine Learning55

Choosing the validation set

©2018 Emily Fox

Valid.set

small validation set

Which subset should I use?

ALL! average performance over all choices

CSE/STAT 416: Intro to Machine Learning56

(use same split of data for all other steps)

Preprocessing: Randomly assign data to K groups

©2018 Emily Fox

NK

NK

NK

NK

NK

Rest of data

K-fold cross validation

Page 29: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

29

CSE/STAT 416: Intro to Machine Learning57

For k=1,…,K1. Estimate ŵλ(k) on the training blocks

2. Compute error on validation block: errork(λ)

©2018 Emily Fox

Validset

ŵλ(1)error1(λ)

K-fold cross validation

CSE/STAT 416: Intro to Machine Learning58

For k=1,…,K1. Estimate ŵλ(k) on the training blocks

2. Compute error on validation block: errork(λ)

©2018 Emily Fox

Validset

ŵλ(2)error2(λ)

K-fold cross validation

Page 30: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

30

CSE/STAT 416: Intro to Machine Learning59

For k=1,…,K1. Estimate ŵλ(k) on the training blocks

2. Compute error on validation block: errork(λ)

©2018 Emily Fox

Validset

ŵλ(3) error3(λ)

K-fold cross validation

CSE/STAT 416: Intro to Machine Learning60

For k=1,…,K1. Estimate ŵλ(k) on the training blocks

2. Compute error on validation block: errork(λ)

©2018 Emily Fox

Validset

ŵλ(4) error4(λ)

K-fold cross validation

Page 31: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

31

CSE/STAT 416: Intro to Machine Learning61

For k=1,…,K1. Estimate ŵλ(k) on the training blocks

2. Compute error on validation block: errork(λ)

Compute average error: CV(λ) = errork(λ)

©2018 Emily Fox

Validset

ŵλ(5) error5(λ)

1

K

KX

k=1

K-fold cross validation

CSE/STAT 416: Intro to Machine Learning62

Repeat procedure for each choice of λ

Choose λ* to minimize CV(λ)

©2018 Emily Fox

Validset

K-fold cross validation

Page 32: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

32

CSE/STAT 416: Intro to Machine Learning63

What value of K?

Formally, the best approximation occurs for validation sets of size 1 (K=N)

Computationally intensive- requires computing N fits of model per λ

Typically, K=5 or 10

©2018 Emily Fox

leave-one-outcross validation

5-fold CV 10-fold CV

CSE/STAT 416: Intro to Machine Learning64

Choosing λ via cross validation for lasso

Cross validation is choosing the λ that provides best predictive accuracy

Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection

c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion

©2018 Emily Fox

Page 33: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

33

CSE/STAT 416: Intro to Machine Learning

Practical concerns with lasso

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning66

Debiasing lassoLasso shrinks coefficients relative to LS solutionà more bias, less variance

Can reduce bias as follows:1. Run lasso to select

features2. Run least squares

regression with only selected features

“Relevant” features no longer shrunk relative to LS fit of same reduced model

©2018 Emily Fox

Figure used with permission of Mario Figueiredo(captions modified to fit course)

True coefficients (D=4096, non-zero = 160)

L1 reconstruction (non-zero = 1024, MSE = 0.0072)

Debiased (non-zero = 1024, MSE = 3.26e-005)

Least squares (non-zero = 0, MSE = 1.568)

Page 34: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

34

CSE/STAT 416: Intro to Machine Learning67

Issues with standard lasso objective1. With group of highly correlated features, lasso tends to select amongst

them arbitrarily- Often prefer to select all together

2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution

Elastic net aims to address these issues- hybrid between lasso and ridge regression- uses L1 and L2 penalties

See Zou & Hastie ‘05 for further discussion

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning

Summary for feature selection and lasso regression

©2018 Emily Fox

Page 35: Lasso Regression - courses.cs.washington.edu...Activity in which brain regions can predict happiness? 4/12/18 6 CSE/STAT 416: Intro to Machine Learning Option 1: All subsets or greedy

4/12/18

35

CSE/STAT 416: Intro to Machine Learning69

Impact of feature selection and lasso

Lasso has changed machine learning, statistics, & electrical engineering

But, for feature selection in general, be careful about interpreting selected features- selection only considers features included

- sensitive to correlations between features

- result depends on algorithm used

- there are theoretical guarantees for lasso under certain conditions

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning70

What you can do now…• Describe “all subsets” and greedy variants for feature selection• Analyze computational costs of these algorithms• Formulate lasso objective• Describe what happens to estimated lasso coefficients as tuning

parameter λ is varied• Interpret lasso coefficient path plot• Contrast ridge and lasso regression• Implement K-fold cross validation to select lasso tuning parameter λ

©2018 Emily Fox