linear classifiers - university of washington...2 regularized linear classifier with gradient ascent...

38
1/25/2017 1 Linear classifiers: Overfitting and regularization ©2017 Emily Fox CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 Logistic regression recap ©2017 Emily Fox

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

1

CSE 446: Machine Learning

Linear classifiers: Overfitting and regularization

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 25, 2017

CSE 446: Machine Learning

Logistic regression recap

©2017 Emily Fox

Page 2: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

2

CSE 446: Machine Learning3

Thus far, we focused on decision boundaries

©2017 Emily Fox

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

Score(x) > 0

Score(x) < 0

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)

= wTh(xi)

Relate Score(xi) to P(y=+1|x,ŵ)?⌃

CSE 446: Machine Learning4

Compare and contrast regression models

• Linear regression with Gaussian errors

yi = wTh(xi) + εi εi ~ N(0,σ2)

p(y|x,w) = N(y; wTh(x), σ2)

• Logistic regression

©2017 Emily Fox

yµ = wTh(x)

-1 +1 yP(y|x,w) =

1 .

1 + e- w h(x)T

e- w h(x).

1 + e- w h(x)T

T

y = +1

y = -1

Page 3: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

3

CSE 446: Machine Learning5

Understanding the logistic regression model

©2017 Emily Fox

Score(xi) P(y=+1|xi,w)

P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 .

1 + e- w h(x)T

CSE 446: Machine Learning6

= estimate of class probabilities

If > 0.5:

ŷ= Else:

ŷ=

©2017 Emily Fox

Sentence from

review

Input: x

Predict most likely class

P(y|x)⌃

P(y=+1|x)⌃

Estimating improves interpretability:- Predict ŷ= +1 and tell me how sure you are

P(y|x)⌃

Page 4: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

4

CSE 446: Machine Learning

Gradient descent algorithm, cont’d

©2017 Emily Fox

CSE 446: Machine Learning8 ©2017 Emily Fox

Difference between truth and prediction

Sum over data points

Featurevalue

Step 5: Gradient over all data points

P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0

yi=+1

yi=-1

Page 5: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

5

CSE 446: Machine Learning9

Summary of gradient ascentfor logistic regression

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || ℓ(w(t))|| > εfor j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η partial[j]

t t + 1

Δ

CSE 446: Machine Learning

Overfitting in classification

©2017 Emily Fox

Page 6: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

6

CSE 446: Machine Learning11

Learned decision boundary

©2017 Emily Fox

Feature ValueCoefficient

learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07

CSE 446: Machine Learning12

Quadratic features (in 2d)

©2017 Emily Fox

Note: we are not including crossterms for simplicity

Feature ValueCoefficient

learned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.59

h3(x) (x[1])2 -0.17

h4(x) (x[2])2 -0.96

Page 7: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

7

CSE 446: Machine Learning13

Degree 6 features (in 2d)

©2017 Emily Fox

Note: we are not including crossterms for simplicity

Feature ValueCoefficient

learned

h0(x) 1 21.6

h1(x) x[1] 5.3

h2(x) x[2] -42.7

h3(x) (x[1])2 -15.9

h4(x) (x[2])2 -48.6

h5(x) (x[1])3 -11.0

h6(x) (x[2])3 67.0

h7(x) (x[1])4 1.5

h8(x) (x[2])4 48.0

h9(x) (x[1])5 4.4

h10(x) (x[2])5 -14.2

h11(x) (x[1])6 0.8

h12(x) (x[2])6 -8.6

Score(x) < 0

Coefficient values getting large

CSE 446: Machine Learning14

Degree 20 features (in 2d)

©2017 Emily Fox

Note: we are not including crossterms for simplicity

Feature

ValueCoefficien

tlearned

h0(x) 1 8.7

h1(x) x[1] 5.1

h2(x) x[2] 78.7

… … …

h11(x) (x[1])6 -7.5

h12(x) (x[2])6 3803

h13(x) (x[1])7 21.1

h14(x) (x[2])7 -2406

… … …

h37(x) (x[1])19 -2*10-6

h38(x) (x[2])19 -0.15

h39(x) (x[1])20 -2*10-8

h40(x) (x[2])20 0.03

Often, overfitting associated with very large estimated coefficients ŵ

Page 8: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

8

CSE 446: Machine Learning15

Overfitting in classification

©2017 Emily Fox

Model complexity

Err

or

=

Cla

ssif

icat

ion

Err

or

Overfitting if there exists w*:

• training_error(w*) > training_error(ŵ)

• true_error(w*) < true_error(ŵ)

CSE 446: Machine Learning

Overfitting in classifiers Overconfident predictions

©2017 Emily Fox

Page 9: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

9

CSE 446: Machine Learning17

Logistic regression model

©2017 Emily Fox

-∞ +∞wTh(xi)

0.0 1.0

0.0

0.5

P(y=+1|xi,w) = sigmoid(wTh(xi))

CSE 446: Machine Learning18

The subtle (negative) consequence of overfitting in logistic regression

Overfitting Large coefficient values

ŵTh(xi) is very positive (or very negative) sigmoid(ŵTh(xi)) goes to 1 (or to 0)

Model becomes extremely overconfident of predictions

©2017 Emily Fox

Page 10: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

10

CSE 446: Machine Learning19

Effect of coefficients on logistic regression model

©2017 Emily Fox

w0 0

w#awesome +1

w#awful -1

w0 0

w#awesome +6

w#awful -6

w0 0

w#awesome +2

w#awful -2

#awesome - #awful #awesome - #awful#awesome - #awful

Input x: #awesome=2, #awful=1

CSE 446: Machine Learning20

Learned probabilities

©2017 Emily Fox

Feature ValueCoefficient

learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07

Page 11: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

11

CSE 446: Machine Learning21

Quadratic features: Learned probabilities

©2017 Emily Fox

Feature ValueCoefficien

tlearned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.58

h3(x) (x[1])2 -0.17

h4(x) (x[2])2 -0.96

CSE 446: Machine Learning22

Overfitting Overconfident predictions

©2017 Emily Fox

Degree 6: Learned probabilities Degree 20: Learned probabilities

Tiny uncertainty regions

Overfitting & overconfident about it!!!

We are sure we are right, when we are surely wrong!

Page 12: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

12

CSE 446: Machine Learning

Overfitting in logistic regression:Another perspective

©2017 Emily Fox

CSE 446: Machine Learning24

Linearly-separable data

Data are linearly separable if:• There exist coefficients ŵsuch that:

- For all positive training data

- For all negative training data

©2017 Emily Fox

Note 1: If you are using D features, linear separability happens in a D-dimensional space

Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0

Page 13: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

13

CSE 446: Machine Learning25

Effect of linear separability on coefficients

©2017 Emily Fox

x[1]=#awesome

X[2

]=#

awfu

l

0 1 2 3 4 …

0

1

2

3

4

Score(x) > 0

Score(x) < 0 Data are linearly separable with ŵ1=1.0 and ŵ2=-1.5

Data also linearly separable with ŵ1=10 and ŵ2=-15

Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109

CSE 446: Machine Learning26

Effect of linear separabilityon weights

©2017 Emily Fox

Data is linearly separable with ŵ1=1.0 and ŵ2=-1.5

Data also linearly separable with ŵ1=10 and ŵ2=-15

Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109

x[1]= 2 x[2]= 1 X

[2]=

#aw

ful

x[1]=#awesome1 2 3 4 …0

0

1

2

3

4

P(y=+1|xi,ŵ)

Maximum likelihood estimation (MLE) prefers most certain model Coefficients go to infinity for linearly-separable data!!!

Page 14: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

14

CSE 446: Machine Learning27

Overfitting in logistic regression is “twice as bad”

Learning tries to find decision boundary that separates data

Overly complex boundary

If data are linearly

separable

Coefficients go to

infinity!

©2017 Emily Fox

ŵ1=109

ŵ2=-1.5x109

CSE 446: Machine Learning

Penalizing large coefficients to mitigate overfitting

©2017 Emily Fox

Page 15: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

15

CSE 446: Machine Learning29

Desired total cost format

Want to balance:

i. How well function fits data

ii. Magnitude of coefficients

Total quality =

measure of fit - measure of magnitude of coefficients

©2017 Emily Fox

(data likelihood)large # = good fit to

training data

large # = overfit

want to balance

CSE 446: Machine Learning30

Consider resulting objective

Select ŵto maximize:

©2017 Emily Fox

L2 regularized logistic regression

2ℓ(w) - ||w||2tuning parameter = balance of fit and magnitude

λ

Pick λ using:• Validation set (for large datasets)

• Cross-validation (for smaller datasets)

(as in ridge/lasso regression)

Page 16: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

16

CSE 446: Machine Learning31

Regularization

λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10

Range of coefficients

-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22

Decision boundary

Degree 20 features, effect of regularization penalty λ

©2017 Emily Fox

CSE 446: Machine Learning32

Coefficient path

©2017 Emily Fox

λ

co

eff

icie

nts

ŵj

Page 17: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

17

CSE 446: Machine Learning33

Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1

Range of coefficients

-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57

Learned probabilities

Degree 20 features: regularization reduces “overconfidence”

©2017 Emily Fox

CSE 446: Machine Learning

Finding best L2 regularized linear classifier with gradient ascent

©2017 Emily Fox

Page 18: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

18

CSE 446: Machine Learning35

Understanding contribution of L2 regularization

©2017 Emily Fox

- 2 λ wj Impact on wj

wj > 0

wj < 0

Term from L2 penalty

CSE 446: Machine Learning36

Summary of gradient ascent for logistic regression + L2 regularization

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while not converged:

for j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η (partial[j] – 2λ wj(t))

t t + 1

Page 19: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

19

CSE 446: Machine Learning

Summary of overfitting in logistic regression

©2017 Emily Fox

CSE 446: Machine Learning38

What you can do now…

• Identify when overfitting is happening• Relate large learned coefficients to overfitting• Describe the impact of overfitting on decision boundaries and

predicted probabilities of linear classifiers• Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning

parameter λ is varied• Interpret coefficient path plot• Estimate L2 regularized logistic regression coefficients using

gradient ascent• Describe the use of L1 regularization to obtain sparse logistic

regression solutions

©2017 Emily Fox

Page 20: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

20

CSE 446: Machine Learning

Linear classifiers: Scaling up learning via SGD

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 25, 2017

CSE 446: Machine Learning

Stochastic gradient descent:Learning, one data point at a time

©2017 Emily Fox

Page 21: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

21

CSE 446: Machine Learning41

Data

Why gradient ascent is slow…

©2017 Emily Fox

w(t)

Compute gradientPass

over data

w(t+1)Update

coefficients

w(t+2)Update

coefficients

Δ

ℓ(w(t))

Δ

ℓ(w(t+1))

Every update requires

a full pass over data

CSE 446: Machine Learning42

More formally: How expensive is gradient ascent?

©2017 Emily Fox

Contribution of data point xi,yi to gradient

Sum over data points

Page 22: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

22

CSE 446: Machine Learning43

Every step requires touching every data point!!!

©2017 Emily Fox

Sum over data points

Time to compute contribution of xi,yi

# of data points (N)Total time to compute

1 step of gradient ascent

1 millisecond 1000

1 second 1000

1 millisecond 10 million

1 millisecond 10 billion

Time to compute contribution of xi,yi

# of data points (N)Total time to compute

1 step of gradient ascent

1 millisecond 1000

1 second 1000

1 millisecond 10 million

Time to compute contribution of xi,yi

# of data points (N)Total time to compute

1 step of gradient ascent

1 millisecond 1000

1 second 1000

Time to compute contribution of xi,yi

# of data points (N)Total time to compute

1 step of gradient ascent

1 millisecond 1000

CSE 446: Machine Learning44

Data

Stochastic gradient ascent

©2017 Emily Fox

w(t)

Compute gradientUse only

small subsets of data

w(t+1)Update

coefficients

w(t+2)Update

coefficients

w(t+3)Update

coefficients

w(t+4)Update

coefficients

Many updates for each pass

over data

Page 23: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

23

CSE 446: Machine Learning45

Example: Instead of all data points for gradient, use 1 data point only???

©2017 Emily Fox

Gradient ascent

Sum over data points

Stochasticgradient ascent

Each time, pick different data point i

CSE 446: Machine Learning46

Stochastic gradient ascent for logistic regression

©2017 Emily Fox

init w(1)=0, t=1

until convergedfor j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η partial[j]

t t + 1

for i=1,…,N Sum over data points

Each time, pick different data point i

Page 24: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

24

CSE 446: Machine Learning47

Stochastic gradient for L2-regularized objective

What about regularization term?

©2017 Emily Fox

Total derivative = - 2 λ wj

Stochasticgradient ascent

Each time, pick different data point i

≈Total derivative

- 2 λ wjN

Each data point contributes 1/N to regularization

partial[j]

partial[j]

CSE 446: Machine Learning48

Comparing computational time per step

©2017 Emily Fox

Time to compute contribution of xi,yi

# of data points (N)Total time for

1 step of gradient

Total time for 1 step of

stochastic gradient

1 millisecond 1000 1 second

1 second 1000 16.7 minutes

1 millisecond 10 million 2.8 hours

1 millisecond 10 billion 115.7 days

Gradient ascent

≈Stochastic gradient ascent

Page 25: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

25

CSE 446: Machine Learning49

Which one is better??? Depends…

AlgorithmTime per iteration

GradientSlow for

large data

Stochasticgradient

Always fast

©2017 Emily Fox

AlgorithmTime per iteration

In theory

GradientSlow for

large dataSlower

Stochasticgradient

Always fast

Faster

AlgorithmTime per iteration

In theory In practice

GradientSlow for

large dataSlower

Oftenslower

Stochasticgradient

Always fast

FasterOften faster

AlgorithmTime per iteration

In theory In practiceSensitivity to parameters

GradientSlow for

large dataSlower

Oftenslower

Moderate

Stochasticgradient

Always fast

FasterOften faster

Very high

Total time to convergence for large data

CSE 446: Machine Learning50

Summary of stochastic gradient

Tiny change to gradient ascent

Much better scalability

Huge impact in real-world

Very tricky to get right in practice

©2017 Emily Fox

Page 26: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

26

CSE 446: Machine Learning

Why would stochastic gradient ever work???

©2017 Emily Fox

CSE 446: Machine Learning52

Gradient is direction of steepest ascent

©2017 Emily Fox

Gradient is “best” direction, butany direction that goes “up” would be useful

Page 27: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

27

CSE 446: Machine Learning53

In ML, steepest direction is sum of “little directions” from each data point

©2017 Emily Fox

Sum over data points

For most data points,contribution points “up”

CSE 446: Machine Learning54

Stochastic gradient: Pick a data point and move in direction

©2017 Emily Fox

Most of the time, total likelihood will increase

Page 28: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

28

CSE 446: Machine Learning55

Stochastic gradient ascent:Most iterations increase likelihood, but sometimes decrease itOn average, make progress

©2017 Emily Fox

until converged

for i=1,…,N

for j=0,…,D

wj(t+1) wj

(t) + ηt t + 1

CSE 446: Machine Learning

Convergence path

©2017 Emily Fox

Page 29: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

29

CSE 446: Machine Learning57

Convergence paths

©2017 Emily Fox

Gradient

Stochastic gradient

CSE 446: Machine Learning58

Stochastic gradient convergence is “noisy”

©2017 Emily Fox

Bett

er

Avg

. lo

g li

kelih

oo

d

Gradient usually increases likelihood smoothly

Stochastic gradient makes “noisy” progress

Total time proportional to # passes over data

Stochastic gradient achieves higher likelihood sooner, but it’s noisier

Page 30: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

30

CSE 446: Machine Learning59

Eventually, gradient catches up

©2017 Emily Fox

Avg

. lo

g li

kelih

oo

dB

ett

er

Gradient

Stochastic gradient

Note: should only trust “average” quality of stochastic gradient

CSE 446: Machine Learning60

The last coefficients may be really good or really bad!!

©2017 Emily Fox

w(1000) was bad

w(1005) was good

How do we minimize risk of

picking bad coefficients

Stochastic gradient will eventually oscillate around a solution

Minimize noise: don’t return last learned coefficients

Output average:

ŵ= 1 w(t)

T

Page 31: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

31

CSE 446: Machine Learning61

Summary of why stochastic gradient works

Gradient finds direction of steeps ascent

Gradient is sum of contributions from each data point

Stochastic gradient uses direction from 1 data point

On average increases likelihood, sometimes decreases

Stochastic gradient has “noisy” convergence

©2017 Emily Fox

CSE 446: Machine Learning

Online learning: Fitting models from streaming data

©2017 Emily Fox

Page 32: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

32

CSE 446: Machine Learning

Batch vs online learning

Batch learning• All data is available at

start of training time

Online learning• Data arrives (streams in) over time

- Must train model as data arrives!

©2017 Emily Fox

DataML

algorithm

time

ŵ

Data

t=1

Data

t=2

ML algorithm

ŵ(1)

Data

t=3

Data

t=4

ŵ(2)ŵ(3)ŵ(4)

CSE 446: Machine Learning64

Online learning example: Ad targeting

©2017 Emily Fox

Website….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......

Input: xt

User info, page text

ML algorithm

ŵ(t)

Ad1

Ad2

Ad3

ŷ= Suggested ads

User clicked on Ad2

yt=Ad2

ŵ(t+1)

Page 33: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

33

CSE 446: Machine Learning65

Online learning problem• Data arrives over each time step t:

- Observe input xt

• Info of user, text of webpage

- Make a predictionŷt

• Which ad to show

- Observe true output yt

• Which ad user clicked on

©2017 Emily Fox

Need ML algorithm to update coefficients each time step!

CSE 446: Machine Learning66

Stochastic gradient ascent can be used for online learning!!!

©2017 Emily Fox

• init w(1)=0, t=1

• Each time step t:- Observe input xt

- Make a predictionŷt

- Observe true output yt

- Update coefficients:

for j=0,…,D

wj(t+1) wj

(t) + η

Page 34: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

34

CSE 446: Machine Learning67

Summary of online learning

Data arrives over time

Must make a prediction every time new data point arrives

Observe true class after prediction made

Want to update parameters immediately

©2017 Emily Fox

CSE 446: Machine Learning

Summary of stochastic gradient descent

©2017 Emily Fox

Page 35: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

35

CSE 446: Machine Learning69

What you can do now…

• Significantly speedup learning algorithm using stochastic gradient

• Describe intuition behind why stochastic gradient works

• Apply stochastic gradient in practice

• Describe online learning problems

• Relate stochastic gradient to online learning

©2017 Emily Fox

CSE 446: Machine Learning

Generalized linear models

©2017 Emily Fox

Page 36: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

36

CSE 446: Machine Learning71

Models for data

• Linear regression with Gaussian errors

yi = wTh(xi) + εi εi ~ N(0,σ2)

p(y|x,w) = N(y; wTh(x), σ2)

• Logistic regression

©2017 Emily Fox

yµ = wTh(x)

-1 +1 yP(y|x,w) =

1 .

1 + e- w h(x)T

e- w h(x).

1 + e- w h(x)T

T

y = +1

y = -1

CSE 446: Machine Learning72

Examples of “generalized linear models”

A generalized linear model has

©2017 Emily Fox

E[y | x,w ] = g(wTh(x))

Mean Link function g

Linear regression wTh(x) identity function

Logistic regression

(assuming y in {0,1} instead of {-1,1} or some

other set of values)

P(y=+1|x,w)

(mean is different if y is not in {0,1})

sigmoid function

(need slight modificationif y is not in {0,1})

Page 37: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

37

CSE 446: Machine Learning

Stochastic gradient descent more formally

©2017 Emily Fox

CSE 446: Machine Learning74

Learning Problems as Expectations• Minimizing loss in training data:

- Given dataset:

• Sampled iid from some distribution p(x) on features:

- Loss function, e.g., hinge loss, logistic loss,…

- We often minimize loss in training data:

• However, we should really minimize expected loss on all data:

• So, we are approximating the integral by the average on the training data

©2017 Emily Fox

Page 38: Linear classifiers - University of Washington...2 regularized linear classifier with gradient ascent ©2017 Emily Fox 1/25/2017 18 35 CSE 446: Machine Learning Understanding contribution

1/25/2017

38

CSE 446: Machine Learning75

Gradient Ascent in Terms of Expectations

• “True” objective function:

• Taking the gradient:

• “True” gradient ascent rule:

• How do we estimate expected gradient?

©2017 Emily Fox

CSE 446: Machine Learning76

SGD: Stochastic Gradient Ascent (or Descent)

• “True” gradient:

• Sample based approximation:

• What if we estimate gradient with just one sample???- Unbiased estimate of gradient- Very noisy!

- Called stochastic gradient ascent (or descent)• Among many other names

- VERY useful in practice!!!

©2017 Emily Fox