linear classifiers - university of washington...2 regularized linear classifier with gradient ascent...

1/25/2017

1

CSE 446: Machine Learning

Linear classifiers: Overfitting and regularization

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 25, 2017


Logistic regression recap

©2017 Emily Fox

1/25/2017

2

CSE 446: Machine Learning3

Thus far, we focused on decision boundaries

©2017 Emily Fox

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

…

Score(x) > 0

Score(x) < 0

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)

= wTh(xi)

Relate Score(xi) to P(y=+1|x,ŵ)?⌃


Compare and contrast regression models

• Linear regression with Gaussian errors

yi = wTh(xi) + εi εi ~ N(0,σ2)

p(y|x,w) = N(y; wTh(x), σ2)

• Logistic regression

©2017 Emily Fox

2σ

yµ = wTh(x)

-1 +1 yP(y|x,w) =

1 .

1 + e- w h(x)T

e- w h(x).

1 + e- w h(x)T

T

y = +1

y = -1

1/25/2017

3


Understanding the logistic regression model

©2017 Emily Fox

Score(xi) P(y=+1|xi,w)

P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 .

1 + e- w h(x)T


= estimate of class probabilities

If > 0.5:

ŷ= Else:

ŷ=

©2017 Emily Fox

Sentence from

review

Input: x

Predict most likely class

P(y|x)⌃

P(y=+1|x)⌃

Estimating improves interpretability:- Predict ŷ= +1 and tell me how sure you are

P(y|x)⌃

1/25/2017

4


Gradient descent algorithm, cont’d

©2017 Emily Fox

CSE 446: Machine Learning8 ©2017 Emily Fox

Difference between truth and prediction

Sum over data points

Featurevalue

Step 5: Gradient over all data points

P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0

yi=+1

yi=-1

1/25/2017

5


Summary of gradient ascentfor logistic regression

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || ℓ(w(t))|| > εfor j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η partial[j]

t t + 1

Δ


Overfitting in classification

©2017 Emily Fox

1/25/2017

6


Learned decision boundary

©2017 Emily Fox

Feature ValueCoefficient

learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07


Quadratic features (in 2d)

©2017 Emily Fox

Note: we are not including crossterms for simplicity


learned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.59

h3(x) (x[1])2 -0.17

h4(x) (x[2])2 -0.96

1/25/2017

7


Degree 6 features (in 2d)

©2017 Emily Fox



learned

h0(x) 1 21.6

h1(x) x[1] 5.3

h2(x) x[2] -42.7

h3(x) (x[1])2 -15.9

h4(x) (x[2])2 -48.6

h5(x) (x[1])3 -11.0

h6(x) (x[2])3 67.0

h7(x) (x[1])4 1.5

h8(x) (x[2])4 48.0

h9(x) (x[1])5 4.4

h10(x) (x[2])5 -14.2

h11(x) (x[1])6 0.8

h12(x) (x[2])6 -8.6

Score(x) < 0

Coefficient values getting large


Degree 20 features (in 2d)

©2017 Emily Fox


Feature

ValueCoefficien

tlearned

h0(x) 1 8.7

h1(x) x[1] 5.1

h2(x) x[2] 78.7

… … …

h11(x) (x[1])6 -7.5

h12(x) (x[2])6 3803

h13(x) (x[1])7 21.1

h14(x) (x[2])7 -2406

… … …

h37(x) (x[1])19 -2*10-6

h38(x) (x[2])19 -0.15

h39(x) (x[1])20 -2*10-8

h40(x) (x[2])20 0.03

Often, overfitting associated with very large estimated coefficients ŵ

1/25/2017

8


Overfitting in classification

©2017 Emily Fox

Model complexity

Err

or

=

Cla

ssif

icat

ion

Err

or

Overfitting if there exists w*:

• training_error(w*) > training_error(ŵ)

• true_error(w*) < true_error(ŵ)


Overfitting in classifiers Overconfident predictions

©2017 Emily Fox

1/25/2017

9


Logistic regression model

©2017 Emily Fox

-∞ +∞wTh(xi)

0.0 1.0

0.0

0.5

P(y=+1|xi,w) = sigmoid(wTh(xi))


The subtle (negative) consequence of overfitting in logistic regression

Overfitting Large coefficient values

ŵTh(xi) is very positive (or very negative) sigmoid(ŵTh(xi)) goes to 1 (or to 0)

Model becomes extremely overconfident of predictions

©2017 Emily Fox

1/25/2017

10


Effect of coefficients on logistic regression model

©2017 Emily Fox

w0 0

w#awesome +1

w#awful -1

w0 0

w#awesome +6

w#awful -6

w0 0

w#awesome +2

w#awful -2

#awesome - #awful #awesome - #awful#awesome - #awful

Input x: #awesome=2, #awful=1


Learned probabilities

©2017 Emily Fox


learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07

1/25/2017

11


Quadratic features: Learned probabilities

©2017 Emily Fox

Feature ValueCoefficien

tlearned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.58

h3(x) (x[1])2 -0.17

h4(x) (x[2])2 -0.96


Overfitting Overconfident predictions

©2017 Emily Fox

Degree 6: Learned probabilities Degree 20: Learned probabilities

Tiny uncertainty regions

Overfitting & overconfident about it!!!

We are sure we are right, when we are surely wrong!

1/25/2017

12


Overfitting in logistic regression:Another perspective

©2017 Emily Fox


Linearly-separable data

Data are linearly separable if:• There exist coefficients ŵsuch that:

- For all positive training data

- For all negative training data

©2017 Emily Fox

Note 1: If you are using D features, linear separability happens in a D-dimensional space

Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0

1/25/2017

13


Effect of linear separability on coefficients

©2017 Emily Fox

x[1]=#awesome

X[2

]=#

awfu

l

0 1 2 3 4 …

0

1

2

3

4

…

Score(x) > 0

Score(x) < 0 Data are linearly separable with ŵ1=1.0 and ŵ2=-1.5

Data also linearly separable with ŵ1=10 and ŵ2=-15

Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109


Effect of linear separabilityon weights

©2017 Emily Fox

Data is linearly separable with ŵ1=1.0 and ŵ2=-1.5

Data also linearly separable with ŵ1=10 and ŵ2=-15

Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109

x[1]= 2 x[2]= 1 X

[2]=

#aw

ful

x[1]=#awesome1 2 3 4 …0

0

1

2

3

4

…

P(y=+1|xi,ŵ)

Maximum likelihood estimation (MLE) prefers most certain model Coefficients go to infinity for linearly-separable data!!!

1/25/2017

14


Overfitting in logistic regression is “twice as bad”

Learning tries to find decision boundary that separates data

Overly complex boundary

If data are linearly

separable

Coefficients go to

infinity!

©2017 Emily Fox

ŵ1=109

ŵ2=-1.5x109


Penalizing large coefficients to mitigate overfitting

©2017 Emily Fox

1/25/2017

15


Desired total cost format

Want to balance:

i. How well function fits data

ii. Magnitude of coefficients

Total quality =

measure of fit - measure of magnitude of coefficients

©2017 Emily Fox

(data likelihood)large # = good fit to

training data

large # = overfit

want to balance


Consider resulting objective

Select ŵto maximize:

©2017 Emily Fox

L2 regularized logistic regression

2ℓ(w) - ||w||2tuning parameter = balance of fit and magnitude

λ

Pick λ using:• Validation set (for large datasets)

• Cross-validation (for smaller datasets)

(as in ridge/lasso regression)

1/25/2017

16


Regularization

λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10

Range of coefficients

-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22

Decision boundary

Degree 20 features, effect of regularization penalty λ

©2017 Emily Fox


Coefficient path

©2017 Emily Fox

λ

co

eff

icie

nts

ŵj

1/25/2017

17


Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1

Range of coefficients

-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57

Learned probabilities

Degree 20 features: regularization reduces “overconfidence”

©2017 Emily Fox


Finding best L2 regularized linear classifier with gradient ascent

©2017 Emily Fox

1/25/2017

18


Understanding contribution of L2 regularization

©2017 Emily Fox

- 2 λ wj Impact on wj

wj > 0

wj < 0

Term from L2 penalty


Summary of gradient ascent for logistic regression + L2 regularization

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while not converged:

for j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η (partial[j] – 2λ wj(t))

t t + 1

1/25/2017

19


Summary of overfitting in logistic regression

©2017 Emily Fox


What you can do now…

• Identify when overfitting is happening• Relate large learned coefficients to overfitting• Describe the impact of overfitting on decision boundaries and

predicted probabilities of linear classifiers• Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning

parameter λ is varied• Interpret coefficient path plot• Estimate L2 regularized logistic regression coefficients using

gradient ascent• Describe the use of L1 regularization to obtain sparse logistic

regression solutions

©2017 Emily Fox

1/25/2017

20


Linear classifiers: Scaling up learning via SGD

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 25, 2017


Stochastic gradient descent:Learning, one data point at a time

©2017 Emily Fox

1/25/2017

21


Data

Why gradient ascent is slow…

©2017 Emily Fox

w(t)

Compute gradientPass

over data

w(t+1)Update

coefficients

w(t+2)Update

coefficients

Δ

ℓ(w(t))

Δ

ℓ(w(t+1))

Every update requires

a full pass over data


More formally: How expensive is gradient ascent?

©2017 Emily Fox

Contribution of data point xi,yi to gradient


1/25/2017

22


Every step requires touching every data point!!!

©2017 Emily Fox


Time to compute contribution of xi,yi

# of data points (N)Total time to compute

1 step of gradient ascent

1 millisecond 1000

1 second 1000

1 millisecond 10 million

1 millisecond 10 billion




1 millisecond 1000

1 second 1000

1 millisecond 10 million




1 millisecond 1000

1 second 1000




1 millisecond 1000


Data

Stochastic gradient ascent

©2017 Emily Fox

w(t)

Compute gradientUse only

small subsets of data

w(t+1)Update

coefficients

w(t+2)Update

coefficients

w(t+3)Update

coefficients

w(t+4)Update

coefficients

Many updates for each pass

over data

1/25/2017

23


Example: Instead of all data points for gradient, use 1 data point only???

©2017 Emily Fox

Gradient ascent


Stochasticgradient ascent

≈

Each time, pick different data point i


Stochastic gradient ascent for logistic regression

©2017 Emily Fox

init w(1)=0, t=1

until convergedfor j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η partial[j]

t t + 1

for i=1,…,N Sum over data points


1/25/2017

24


Stochastic gradient for L2-regularized objective

What about regularization term?

©2017 Emily Fox

Total derivative = - 2 λ wj

Stochasticgradient ascent


≈Total derivative

- 2 λ wjN

Each data point contributes 1/N to regularization

partial[j]

partial[j]


Comparing computational time per step

©2017 Emily Fox


# of data points (N)Total time for

1 step of gradient

Total time for 1 step of

stochastic gradient

1 millisecond 1000 1 second

1 second 1000 16.7 minutes

1 millisecond 10 million 2.8 hours

1 millisecond 10 billion 115.7 days

Gradient ascent

≈Stochastic gradient ascent

1/25/2017

25


Which one is better??? Depends…

AlgorithmTime per iteration

GradientSlow for

large data

Stochasticgradient

Always fast

©2017 Emily Fox


In theory

GradientSlow for

large dataSlower

Stochasticgradient

Always fast

Faster


In theory In practice

GradientSlow for

large dataSlower

Oftenslower

Stochasticgradient

Always fast

FasterOften faster


In theory In practiceSensitivity to parameters

GradientSlow for

large dataSlower

Oftenslower

Moderate

Stochasticgradient

Always fast

FasterOften faster

Very high

Total time to convergence for large data


Summary of stochastic gradient

Tiny change to gradient ascent

Much better scalability

Huge impact in real-world

Very tricky to get right in practice

©2017 Emily Fox

1/25/2017

26


Why would stochastic gradient ever work???

©2017 Emily Fox


Gradient is direction of steepest ascent

©2017 Emily Fox

Gradient is “best” direction, butany direction that goes “up” would be useful

1/25/2017

27


In ML, steepest direction is sum of “little directions” from each data point

©2017 Emily Fox


For most data points,contribution points “up”


Stochastic gradient: Pick a data point and move in direction

©2017 Emily Fox

Most of the time, total likelihood will increase

≈

1/25/2017

28


Stochastic gradient ascent:Most iterations increase likelihood, but sometimes decrease itOn average, make progress

©2017 Emily Fox

until converged

for i=1,…,N

for j=0,…,D

wj(t+1) wj

(t) + ηt t + 1


Convergence path

©2017 Emily Fox

1/25/2017

29


Convergence paths

©2017 Emily Fox

Gradient

Stochastic gradient


Stochastic gradient convergence is “noisy”

©2017 Emily Fox

Bett

er

Avg

. lo

g li

kelih

oo

d

Gradient usually increases likelihood smoothly

Stochastic gradient makes “noisy” progress

Total time proportional to # passes over data

Stochastic gradient achieves higher likelihood sooner, but it’s noisier

1/25/2017

30


Eventually, gradient catches up

©2017 Emily Fox

Avg

. lo

g li

kelih

oo

dB

ett

er

Gradient

Stochastic gradient

Note: should only trust “average” quality of stochastic gradient


The last coefficients may be really good or really bad!!

©2017 Emily Fox

w(1000) was bad

w(1005) was good

How do we minimize risk of

picking bad coefficients

Stochastic gradient will eventually oscillate around a solution

Minimize noise: don’t return last learned coefficients

Output average:

ŵ= 1 w(t)

T

1/25/2017

31


Summary of why stochastic gradient works

Gradient finds direction of steeps ascent

Gradient is sum of contributions from each data point

Stochastic gradient uses direction from 1 data point

On average increases likelihood, sometimes decreases

Stochastic gradient has “noisy” convergence

©2017 Emily Fox


Online learning: Fitting models from streaming data

©2017 Emily Fox

1/25/2017

32


Batch vs online learning

Batch learning• All data is available at

start of training time

Online learning• Data arrives (streams in) over time

- Must train model as data arrives!

©2017 Emily Fox

DataML

algorithm

time

ŵ

Data

t=1

Data

t=2

ML algorithm

ŵ(1)

Data

t=3

Data

t=4

ŵ(2)ŵ(3)ŵ(4)


Online learning example: Ad targeting

©2017 Emily Fox

Website….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......

Input: xt

User info, page text

ML algorithm

ŵ(t)

Ad1

Ad2

Ad3

ŷ= Suggested ads

User clicked on Ad2

yt=Ad2

ŵ(t+1)

1/25/2017

33


Online learning problem• Data arrives over each time step t:

- Observe input xt

• Info of user, text of webpage

- Make a predictionŷt

• Which ad to show

- Observe true output yt

• Which ad user clicked on

©2017 Emily Fox

Need ML algorithm to update coefficients each time step!


Stochastic gradient ascent can be used for online learning!!!

©2017 Emily Fox

• init w(1)=0, t=1

• Each time step t:- Observe input xt

- Make a predictionŷt

- Observe true output yt

- Update coefficients:

for j=0,…,D

wj(t+1) wj

(t) + η

1/25/2017

34


Summary of online learning

Data arrives over time

Must make a prediction every time new data point arrives

Observe true class after prediction made

Want to update parameters immediately

©2017 Emily Fox


Summary of stochastic gradient descent

©2017 Emily Fox

1/25/2017

35


What you can do now…

• Significantly speedup learning algorithm using stochastic gradient

• Describe intuition behind why stochastic gradient works

• Apply stochastic gradient in practice

• Describe online learning problems

• Relate stochastic gradient to online learning

©2017 Emily Fox


Generalized linear models

©2017 Emily Fox

1/25/2017

36


Models for data

• Linear regression with Gaussian errors

yi = wTh(xi) + εi εi ~ N(0,σ2)

p(y|x,w) = N(y; wTh(x), σ2)

• Logistic regression

©2017 Emily Fox

2σ

yµ = wTh(x)

-1 +1 yP(y|x,w) =

1 .

1 + e- w h(x)T

e- w h(x).

1 + e- w h(x)T

T

y = +1

y = -1


Examples of “generalized linear models”

A generalized linear model has

©2017 Emily Fox

E[y | x,w ] = g(wTh(x))

Mean Link function g

Linear regression wTh(x) identity function

Logistic regression

(assuming y in {0,1} instead of {-1,1} or some

other set of values)

P(y=+1|x,w)

(mean is different if y is not in {0,1})

sigmoid function

(need slight modificationif y is not in {0,1})

1/25/2017

37


Stochastic gradient descent more formally

©2017 Emily Fox


Learning Problems as Expectations• Minimizing loss in training data:

- Given dataset:

• Sampled iid from some distribution p(x) on features:

- Loss function, e.g., hinge loss, logistic loss,…

- We often minimize loss in training data:

• However, we should really minimize expected loss on all data:

• So, we are approximating the integral by the average on the training data

©2017 Emily Fox

1/25/2017

38


Gradient Ascent in Terms of Expectations

• “True” objective function:

• Taking the gradient:

• “True” gradient ascent rule:

• How do we estimate expected gradient?

©2017 Emily Fox


SGD: Stochastic Gradient Ascent (or Descent)

• “True” gradient:

• Sample based approximation:

• What if we estimate gradient with just one sample???- Unbiased estimate of gradient- Very noisy!

- Called stochastic gradient ascent (or descent)• Among many other names

- VERY useful in practice!!!

©2017 Emily Fox

linear classifiers - university of washington...2 regularized linear classifier with gradient ascent...

Documents