regularized regression - university of washington · 22 43 cse 446: machine learning impact of...

1/20/2017

1

CSE 446: Machine Learning

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 20, 2017

©2017 Emily Fox

Regularized Regression:Geometric intuition of solutionPlus: Cross validation


Coordinate descent for lasso(for normalized features)

©2017 Emily Fox

1/20/2017

2

CSE 446: Machine Learning3

Coordinate descent for least squares regression

Initialize ŵ= 0 (or smartly…)while not converged

for j=0,1,…,D

compute:

set: ŵj = ρj

©2017 Emily Fox

ρj = hj(xi)(yi –ŷi(ŵ-j))

prediction without feature j

residualwithout feature j


Coordinate descent for lasso


for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox


ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/20 if ρj in [-λ/2, λ/2]

1/20/2017

3


Soft thresholding

©2017 Emily Fox

ŵj

ρj

ŵj =ρj + λ/2 if ρj < -λ/2



How to assess convergence?


for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox


ρj + λ/2 if ρj < -λ/2


1/20/2017

4


When to stop?

For convex problems, will start to take smaller and smaller steps

Measure size of steps taken in a full loop over all features- stop when max step < ε

Convergence criteria

©2017 Emily Fox


Other lasso solvers

©2017 Emily Fox

Classically: Least angle regression (LARS) [Efron et al. ‘04]

Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08]

Now:

• Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])

• Other parallel learning approaches for linear models- Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])

- Parallel independent solutions then averaging [Zhang et al. ‘12]

• Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]

1/20/2017

5


Coordinate descent for lasso(for unnormalized features)

©2017 Emily Fox


Coordinate descent for lassowith unnormalized features

Precompute:


for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

zj = hj(xi)2


(ρj + λ/2)/zj if ρj < -λ/2

(ρj – λ/2)/zj if ρj > λ/20 if ρj in [-λ/2, λ/2]

1/20/2017

6


Geometric intuition forsparsity of lasso solution

©2017 Emily Fox


Geometric intuition for ridge regression

©2017 Emily Fox

1/20/2017

7


Visualizing the ridge cost in 2D

©2017 Emily Fox

RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1

2)2



©2017 Emily Fox


2)2

1/20/2017

8



©2017 Emily Fox


2)2


Visualizing the ridge solution

©2017 Emily Fox


2)2

1/20/2017

9


Geometric intuition for lasso

©2017 Emily Fox


Visualizing the lasso cost in 2D

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)

1/20/2017

10



©2017 Emily Fox




©2017 Emily Fox


1/20/2017

11


Visualizing the lasso solution

©2017 Emily Fox



Revisit polynomial fit demo

What happens if we refit our high-orderpolynomial, but now using lasso regression?

Will consider a few settings of λ …

©2017 Emily Fox

1/20/2017

12


How to choose λ:Cross validation

©2017 Emily Fox


If sufficient amount of data…

©2017 Emily Fox

Validation set

Training setTest set

fitŵλtest performance ofŵλ to select λ*

assess generalization

error of ŵλ*

1/20/2017

13


Start with smallish dataset

©2017 Emily Fox

All data


Still form test set and hold out

©2017 Emily Fox

Rest of dataTest set

1/20/2017

14


How do we use the other data?

©2017 Emily Fox

Rest of data

use for both training and validation, but not so naively


Recall naïve approach

Is validation set enough to compare performance of ŵλacross λ values?

©2017 Emily Fox

Valid.set

Training set

small validation set

No

1/20/2017

15


Choosing the validation set

Didn’t have to use the last data points tabulated to form validation set

Can use any data subset

©2017 Emily Fox

Valid.set



Choosing the validation set

©2017 Emily Fox

Valid.set


Which subset should I use?

ALL! average performance over all choices

1/20/2017

16


(use same split of data for all other steps)

Preprocessing: Randomly assign data to K groups

©2017 Emily Fox

NK

NK

NK

NK

NK

Rest of data

K-fold cross validation


For k=1,…,K1. Estimate ŵλ

(k) on the training blocks

2. Compute error on validation block: errork(λ)

©2017 Emily Fox

Validset

ŵλ(1)error1(λ)


1/20/2017

17





©2017 Emily Fox

Validset

ŵλ(2)error2(λ)






©2017 Emily Fox

Validset

ŵλ(3) error3(λ)


1/20/2017

18





©2017 Emily Fox

Validset

ŵλ(4) error4(λ)






Compute average error: CV(λ) = errork(λ)

©2017 Emily Fox

Validset

ŵλ(5) error5(λ)


1/20/2017

19


Repeat procedure for each choice of λ

Choose λ* to minimize CV(λ)

©2017 Emily Fox

Validset



What value of K?

Formally, the best approximation occurs for validation sets of size 1 (K=N)

Computationally intensive- requires computing N fits of model per λ

Typically, K=5 or 10

©2017 Emily Fox

leave-one-outcross validation

5-fold CV 10-fold CV

1/20/2017

20


Choosing λ via cross validation for lasso

Cross validation is choosing the λ that provides best predictive accuracy

Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection

c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion

©2017 Emily Fox


Practical concerns with lasso

©2017 Emily Fox

1/20/2017

21


Issues with standard lasso objective

1. With group of highly correlated features, lasso tends to select amongst them arbitrarily- Often prefer to select all together

2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution

Elastic net aims to address these issues- hybrid between lasso and ridge regression

- uses L1 and L2 penalties

See Zou & Hastie ‘05 for further discussion

©2017 Emily Fox


Summary for feature selection and lasso regression

©2017 Emily Fox

1/20/2017

22


Impact of feature selection and lasso

Lasso has changed machine learning, statistics, & electrical engineering

But, for feature selection in general, be careful about interpreting selected features

- selection only considers features included

- sensitive to correlations between features

- result depends on algorithm used

- there are theoretical guarantees for lasso under certain conditions

©2017 Emily Fox


What you can do now…• Describe “all subsets” and greedy variants for feature selection

• Analyze computational costs of these algorithms

• Formulate lasso objective

• Describe what happens to estimated lasso coefficients as tuning parameter λ is varied

• Interpret lasso coefficient path plot

• Contrast ridge and lasso regression

• Estimate lasso regression parameters using an iterative coordinate descent algorithm

• Implement K-fold cross validation to select lasso tuning parameter λ

©2017 Emily Fox

1/20/2017

23


CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 20, 2017

©2017 Emily Fox

Linear classifiers


Linear classifier: Intuition

©2017 Emily Fox

1/20/2017

24


Classifier

©2017 Emily Fox

Sentence from

review

ClassifierMODEL

Input: xOutput: y Predicted class

ŷ= +1

ŷ= -1 Sushi was awesome, the food was awesome, but the service was awful.


Score(x) = weighted sum of features of sentence

If Score (x) > 0:

ŷ= Else:

ŷ=

Feature Coefficient

… …

©2017 Emily Fox

Sentence from

review

Input: x

Simple linear classifier

1/20/2017

25


A simple example: Word counts

©2017 Emily Fox

Feature Coefficientgood 1.0

great 1.2

awesome 1.7

bad -1.0

terrible -2.1

awful -3.3

restaurant, the, we, where, …

0.0

… …

Input xi:Sushi was great, the food was awesome, but the service was terrible.

Called a linear classifier, because score is weighted sum of features.


More generically…

©2017 Emily Fox

feature 1 = h0(x) … e.g., 1

feature 2 = h1(x) … e.g., x[1] = #awesome

feature 3 = h2(x) … e.g., x[2] = #awfulor, log(x[7]) x[2] = log(#bad) x #awful

or, tf-idf(“awful”)

…

feature D+1 = hD(x) … some other function of x[1],…, x[d]

Model: ŷi = sign(Score(xi))

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)

= wj hj(xi) = wTh(xi)

1/20/2017

26


Decision boundaries

©2017 Emily Fox


Suppose only two words had non-zero coefficient

©2017 Emily Fox

#awesome0 1 2 3 4 …

#aw

ful

0

1

2

3

4

…

Sushi was awesome, the food was awesome, but the service was awful.

Score(x) = 1.0 #awesome – 1.5 #awful

Input Coefficient Value

w0 0.0

#awesome w1 1.0

#awful w2 -1.5

1/20/2017

27


Decision boundary example

©2017 Emily Fox

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

…

Score(x) > 0

Score(x) < 0

Score(x) = 1.0 #awesome – 1.5 #awful

Input Coefficient Value

w0 0.0

#awesome w1 1.0

#awful w2 -1.5

Decision boundary separates + and –predictions


For more inputs (linear features)…

©2017 Emily Fox

#awesome

#aw

ful

x[1]

x[3]

Score(x) = w0

+ w1 #awesome+ w2 #awful+ w3 #great

x[2

]

1/20/2017

28


For general features…

For more general classifiers (not just linear features)more complicated shapes

©2017 Emily Fox

regularized regression - university of washington · 22 43 cse 446: machine learning impact of...

Documents